Impute Missing Values with Decision Tree

Live Online Training : Predictive Modeling using SAS

- Explain Advanced Algorithms in Simple English
- Live Projects & Case Studies
- Domain Knowledge
- Job Placement Assistance

CART has built-in algorithm to impute missing data with surrogate variables. The surrogate splits the data in exactly the same way as the primary split, in other words, we are looking for clones, close approximations, something else in the data that can do the same work that the primary split accomplished.

Imputation Process

Suppose there are 10 predictors x1 − x10 to be included in the CART analysis, and suppose there are missing values for x1 only, which happens to be the “best” predictor chosen to define the “optimal” split.

CART is applied with x1 as the dependent variable and x2 − x10 as potential splitting variables. Only one partitioning is allowed here; a full tree is not constructed. The nine predictors are then ranked by the proportion of cases in x1 that are misclassified. Predictors that do no better than the marginal distribution of x1 are dropped from further consideration.

The variable with the lowest classification error for x1 is then used in place of x1 to assign cases with missing values on x1 to one of the two daughter nodes. That is, “the predicted classes for x1 are used when the actual classes for x1 are missing”. If there are missing data for the “best” predictor of x1, the “best” surrogate variable is used instead. If there are missing data on the “best” surrogate variable of x2, the second “best” surrogate variable of x3 is used instead. And so on.
Surrogate splitting rules enable you to use the values of other input variables to perform a split for observations with missing values.
Important Note : Tree Surrogate splitting rule method can impute missing values for both numeric and categorical variables.

In R, it is implemented with usesurrogate = 2 in rpart.control option in rpart package. Check out : GBM Missing Imputation

Mice Package : Imputing Missing Value with CART
anscombe <- within(anscombe, {
y1[1:3] <- NA
y4[3:5] <- NA
imp = mice(anscombe, meth = "cart", minbucket = 4)
imp1 = complete(imp)

Source : Mice Package in Detail 

Statistics Tutorials : 50 Statistics Tutorials

About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has close to 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like retail and commercial banking, Telecom, HR and Automotive.

While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

3 Responses to "Impute Missing Values with Decision Tree"

  1. This comment has been removed by the author.

  2. Hii, Thank you so much for posting this. this article veru useful for readers. you writing style is good. Once again thanks for sharing..

  3. Well said ,you have furnished the right information that will be useful to anyone at all time.Thanks for sharing your Ideas.
    hadoop online training


Next → ← Prev