####
**R Data Science:**
R Programming A-Z: R For Data Science With Real Exercises!

####
**Data Science:**
Machine Learning A-Z: Hands-On Python & R In Data Science

This article explains the theoretical and practical application of decision tree with R. It covers terminologies and important concepts related to decision tree. In this tutorial, we run decision tree on credit data which gives you background of the financial project and how predictive modeling is used in banking and finance domain.

A decision tree is a graphical representation of possible solutions to a decision based on certain conditions. It is called a decision tree because it starts with a single variable, which then branches off into a number of solutions, just like a tree.

A decision tree has three main components :

It is a technique to correct overfitting problem. It reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. It is used to remove anomalies in the training data due to noise or outliers. The pruned trees are less complex trees.

The‘CP’ stands for

The cost complexity is measured by the following two parameters −

In this case, we pick the tree having

It is a process of dividing a node into two or more sub-nodes.

A sub section of entire tree is called branch.

A node which splits into sub-nodes.

It is the sub-node of a parent node.

When you have missing data, decision tree return predictions when they include surrogate splits. If parameter value of surrogate is set 2, it means if the primary splitter is missing, we use the number one surrogate. If the number one surrogate is missing, then we use the number two surrogate.

Gini Index measures impurity in node. It varies between 0 and (1-1/n) where n is the number of categories in a dependent variable.

The impurity of a node is measured by the Least-Squared Deviation (LSD), which is simply the within variance for the node.

**Decision Tree : Meaning**A decision tree is a graphical representation of possible solutions to a decision based on certain conditions. It is called a decision tree because it starts with a single variable, which then branches off into a number of solutions, just like a tree.

A decision tree has three main components :

**Root Node :**The top most node is called Root Node. It implies the best predictor (independent variable).**Decision / Internal Node :**The nodes in which predictors (independent variables) are tested and each branch represents an outcome of the test**Leaf / Terminal Node :**It holds a class label (category) - Yes or No (Final Classification Outcome).

Decision Tree Explained |

**Advantages and Disadvantages of Decision Tree****Advantages :**- Decision tree is easy to interpret.
- Decision Tree works even if there is nonlinear relationships between variables. It does not require linearity assumption.
- Decision Tree is not sensitive to outliers.

**Disadvantages :**- Decision tree model generally overfits. It means it does not perform well on validation sample.
- It assumes all independent variables interact each other, It is generally not the case every time.

**Terminologies related to decision tree****1. Pruning : Correct Overfitting**It is a technique to correct overfitting problem. It reduces the size of decision trees by removing sections of the tree that provide little power to classify instances. It is used to remove anomalies in the training data due to noise or outliers. The pruned trees are less complex trees.

**Pruning Method : Cost Complexity**The‘CP’ stands for

**Complexity Parameter**of the tree.**We want the cp value of the smallest tree that has smallest cross validation error**. In regression, this means that the overall R-squared must increase by cp at each step.The cost complexity is measured by the following two parameters −

- Number of leaves in the tree (i.e. size of the tree)
- Error rate of the tree (i.e. misclassification rate or Sum of Squared Error)

**Thus large trees with a low error rate are penalized in favor of smaller trees.**CP nsplit rel error xerror xstd 1 0.046948 0 1.00000 1.00000 0.057151 2 0.023474 4 0.75587 0.81221 0.053580 3 0.015649 5 0.73239 0.83099 0.053989 4 0.011737 10 0.64789 0.87324 0.054867 5 0.010955 12 0.62441 0.89671 0.055328 6 0.010000 17 0.56808 0.89671 0.055328

In this case, we pick the tree having

**CP = 0.023474**as it has least cross validation error (xerror). The rel error of each iteration of the tree is the fraction of misclassified cases in the iteration relative to the fraction of misclassified cases in the root node.Cost Complexity (cp)is the tuning parameter in CART.

**2. Splitting**It is a process of dividing a node into two or more sub-nodes.

**3. Branch**A sub section of entire tree is called branch.

**4. Parent Node**A node which splits into sub-nodes.

**5. Child Node**It is the sub-node of a parent node.

**6. Surrogate Split**When you have missing data, decision tree return predictions when they include surrogate splits. If parameter value of surrogate is set 2, it means if the primary splitter is missing, we use the number one surrogate. If the number one surrogate is missing, then we use the number two surrogate.

**Classification and Regression Tree (CART)**

**Classification Tree :**The outcome (dependent) variable is a categorical variable (binary) and predictor (independent) variables can be continuous or categorical variables (binary).

**Algorithm of Classification Tree: Gini Index**

Gini Index measures impurity in node. It varies between 0 and (1-1/n) where n is the number of categories in a dependent variable.

**Process :**

- Rules based on variables' values are selected to get the best split to differentiate observations based on the dependent variable
- Once a rule is selected and splits a node into two, the same process is applied to each "child" node (i.e. it is a recursive procedure)
- Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met. (Alternatively, the data are split as much as possible and then the tree is later pruned.

**Regression Tree :**The outcome (dependent) variable is a continuous variable and predictor (independent) variables can be continuous or categorical variables (binary).

**Algorithm of Regression Tree: Least-Squared Deviation or Least Absolute Deviation**

The impurity of a node is measured by the Least-Squared Deviation (LSD), which is simply the within variance for the node.

**Analysis of German Credit Data**

The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants.

The objective of the model is whether to approve a loan to a prospective applicant based on his/her profiles.

- Make sure all the categorical variables are converted into factors.
- The function rpart will run a regression tree if the response variable is numeric, and a classification tree if it is a factor.
**rpart parameter -****Method -**"class" for a classification tree ; "anova" for a regression tree**minsplit :**minimum number of observations in a node before splitting. Default value - 20**minbucket :**minimum number of observations in terminal node (leaf). Default value - 7 (i.e. minsplit/3)**xval :**Number of cross validations**Prediction (Scoring) :**If type = "prob": This is for a classification tree. It generates probabilities - Prob(Y=0) and Prob(Y=1).**Prediction (Classification) :**If type = "class": This is for a classification tree. It returns 0/1.

**R : Decision Tree**

#read data file

mydata= read.csv("C:\\Users\\Deepanshu Bhalla\\Desktop\\german_credit.csv")

# Check attributes of data

str(mydata)

'data.frame': 1000 obs. of 21 variables: $ Creditability : Factor w/ 2 levels "0","1": 2 2 2 2 $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ... $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ... $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ... $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ... $ Credit.Amount : int 1049 2799 841 2122 2171 2241 $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ... $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ... $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ... $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ... $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ... $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ... $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ... $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ... $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ... $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ... $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ... $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ... $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ... $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ... $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...

# Check number of rows and columns

dim(mydata)

# Make dependent variable as a factor (categorical)

mydata$Creditability = as.factor(mydata$Creditability)

# Split data into training (70%) and validation (30%)

dt = sort(sample(nrow(mydata), nrow(mydata)*.7))

train<-mydata[dt,]

val<-mydata[-dt,] # Check number of rows in training data set

nrow(train)

# To view dataset

edit(train)

# Decision Tree Model

library(rpart)

mtree <- rpart(Creditability~., data = train, method="class", control = rpart.control(minsplit = 20, minbucket = 7, maxdepth = 10, usesurrogate = 2, xval =10 ))

mtree

#Plot tree

plot(mtree)

text(mtree)

#Beautify tree

library(rattle)

library(rpart.plot)

library(RColorBrewer)

#view1

prp(mtree, faclen = 0, cex = 0.8, extra = 1)

#view2 - total count at each node

tot_count <- function(x, labs, digits, varlen)

{paste(labs, "\n\nn =", x$frame$n)}

prp(mtree, faclen = 0, cex = 0.8, node.fun=tot_count)

#view3- fancy Plot

rattle() fancyRpartPlot(mtree)

############################

########Pruning#############

############################

# Select the tree size that has least misclassification rate (prediction error).

#‘CP’ stands for Complexity Parameter of the tree.

# We want the cp value (with a simpler tree) that has least xerror(cross validated error).

printcp(mtree)

bestcp <- mtree$cptable[which.min(mtree$cptable[,"xerror"]),"CP"]

# Prune the tree using the best cp.

pruned <- prune(mtree, cp = bestcp)

# Plot pruned tree

prp(pruned, faclen = 0, cex = 0.8, extra = 1)

# confusion matrix (training data)

conf.matrix <- table(train$Creditability, predict(pruned,type="class"))

rownames(conf.matrix) <- paste("Actual", rownames(conf.matrix), sep = ":")

colnames(conf.matrix) <- paste("Pred", colnames(conf.matrix), sep = ":")

print(conf.matrix)

#Scoring

library(ROCR)

val1 = predict(pruned, val, type = "prob")

#Storing Model Performance Scores

pred_val <-prediction(val1[,2],val$Creditability)

# Calculating Area under Curve

perf_val <- performance(pred_val,"auc")

perf_val

# Plotting Lift curve

plot(performance(pred_val, measure="lift", x.measure="rpp"), colorize=TRUE)

# Calculating True Positive and False Positive Rate

perf_val <- performance(pred_val, "tpr", "fpr")

# Plot the ROC curve

plot(perf_val, col = "green", lwd = 1.5)

#Calculating KS statistics

ks1.tree <- max(attr(perf_val, "y.values")[[1]] - (attr(perf_val, "x.values")[[1]]))

ks1.tree

# Advanced Plot

prp(pruned, main="assorted arguments",

extra=106, # display prob of survival and percent of obs

nn=TRUE, # display the node numbers

fallen.leaves=TRUE, # put the leaves on the bottom of the page

branch=.5, # change angle of branch lines

faclen=0, # do not abbreviate factor levels

trace=1, # print the automatically calculated cex

shadow.col="gray", # shadows under the leaves

branch.lty=3, # draw branches using dotted lines

split.cex=1.2, # make the split text larger than the node text

split.prefix="is ", # put "is " before split text

split.suffix="?", # put "?" after split text

split.box.col="lightgray", # lightgray split boxes (default is white)

split.border.col="darkgray", # darkgray border on split boxes

split.round=.5)

Nice Article! Thanks for making decision tree so simpler :-)

ReplyDeleteRemarkable and well defined

ReplyDeleteNice Article. Could you please let me know how to calculate root mean error.

ReplyDeleteSorry, I meant root node error.

DeletePlease provide decision tree in sas if you can, thanks

ReplyDeleteDecision tree algorithm is not available in SAS STAT. It is available in SAS Enterprise Miner. I don't have access to SAS Enterprise Miner. I can share some tutorial about how to build a decision tree in SAS Enterprise Miner if you want. Thanks!

Deleteamazing....

ReplyDeletemtree <- rpart(Creditability~., data = train, method="class", control = rpart.control((minsplit = 20, minbucket = 7, maxdepth = 10, usesurrogate = 2, xval =10 ))

ReplyDeleteError: unexpected ',' in "mtree <- rpart(Creditability~., data = train, method="class", control = rpart.control((minsplit = 20,"

I am getting this error can you please tell me the way, so that i don't get this error

There should be a single bracket in 'rpart.control(('. Use rpart.control( instead of rpart.control((. Let me know if it works. I am logged in via mobile. Will update the code in the article tomorrow.

DeleteTree Lopping and Root Barriers could be considered cruel, an new way of practicing old behaviours but there is a place for it. The problem with having trees in your garden or in the street is that their roots eventually tear up the road and pavement or get into the foundations of your house. Look at this site

ReplyDeleteI have thought which I came across in beginning of tutorial with the mentioning of the "root" node. Isn't that the dependent variable from which mother and child node comes?Just asking to clear my doubt, because I see, that has been mentioned as the most important predictor.

ReplyDeleteThe root node is the independent variable (predictor). In this example, the dependent variable is binary in nature - whether to approve a loan to a prospective applicant.

Delete