This article explains how to select important variables using boruta package in R. Variable Selection is an important step in a predictive modeling project. It is also called 'Feature Selection'. Every private and public agency has started tracking data and collecting information of various attributes. It results to access to too many predictors for a predictive model. But not every variable is important for prediction of a particular task. Hence it is essential to identify important variables and remove redundant variables. Before building a predictive model, it is generally not know the exact list of important variable which returns accurate and robust model.
Why Variable Selection is important?
Why Boruta Package?
There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special". See the following reasons to use boruta package for feature selection.
Basic Idea of Boruta Algorithm
Follow the steps below to understand the algorithm -
Difference between Boruta and Random Forest Importance Measure
When i first learnt this algorithm, this question 'RF importance measure vs. Boruta' made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.
In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.
Is Boruta a solution for all?
What is shuffled feature or permuted copies?
It simply means changing order of values of a variable. See the practical example below -
R : Feature Selection with Boruta Package
1. Get Data into R
The read.csv() function is used to read data from CSV and import it into R environment.
3. Define categorical variables
4. Explore Data
No missing values in the dataframe df.
Handle Missing Values
It shows all the three variables are considered important and no one is tagged 'unimportant'. The plot() option shows box plot of all the attributes plus minimum, average and max shadow score. Variables having boxplot in green shows all predictors are important. If boxplots are in red, it shows they are rejected. And yellow color of box plot indicates they are tentative.
Tentative Attributes refers to importance score so close to their best shadow attributes that Boruta is unable to decide in default number of random forest runs.
As you can see above the label of shadowMean is not displayed as it got truncated due to insufficient space. To fix this problem, run the following program.
Let's add some irrelevant data to our original dataset
It is to check whether boruta package will be able to find unimportant variables or not. In the following program, we have created duplicate copies of the original 3 variables and then randomise the order of values in these variables.
Run Boruta Algorithm
The irrelevant variable we added to the dataset came out unimportant as per boruta algorithm.
To save a final list of important variables in a vector, use getSelectedAttributes() function.
In this case, RFE algorithm returned all the variables based on model accuracy. As compared to RFE, boruta final variables make more sense in terms of interpretation. It all depends on data and its variables' distribution. As an analyst, we should explore both the techniques and see which one works better for the dataset. There are many packages in R for variable selection. Every technique has pros and cons.
The following functions can be used for model fitting in RFE selections
Why Variable Selection is important?
- Removing a redundant variable helps to improve accuracy. Similarly, inclusion of a relevant variable has a positive effect on model accuracy.
- Too many variables might result to overfitting which means model is not able to generalize pattern
- Too many variables leads to slow computation which in turns requires more memory and hardware.
Why Boruta Package?
There are a lot of packages for feature selection in R. The question arises " What makes boruta package so special". See the following reasons to use boruta package for feature selection.
- It works well for both classification and regression problem.
- It takes into account multi-variable relationships.
- It is an improvement on random forest variable importance measure which is a very popular method for variable selection.
- It follows an all-relevant variable selection method in which it considers all features which are relevant to the outcome variable. Whereas, most of the other variable selection algorithms follow a minimal optimal method where they rely on a small subset of features which yields a minimal error on a chosen classifier.
- It can handle interactions between variables
- It can deal with fluctuating nature of random a random forest importance measure
Boruta Package |
Basic Idea of Boruta Algorithm
Perform shuffling of predictors' values and join them with the original predictors and then build random forest on the merged dataset. Then make comparison of original variables with the randomised variables to measure variable importance. Only variables having higher importance than that of the randomised variables are considered important.
How Boruta Algorithm Works
Follow the steps below to understand the algorithm -
- Create duplicate copies of all independent variables. When the number of independent variables in the original data is less than 5, create at least 5 copies using existing variables.
- Shuffle the values of added duplicate copies to remove their correlations with the target variable. It is called shadow features or permuted copies.
- Combine the original ones with shuffled copies
- Run a random forest classifier on the combined dataset and performs a variable importance measure (the default is Mean Decrease Accuracy) to evaluate the importance of each variable where higher means more important.
- Then Z score is computed. It means mean of accuracy loss divided by standard deviation of accuracy loss.
- Find the maximum Z score among shadow attributes (MZSA)
- Tag the variables as 'unimportant' when they have importance significantly lower than MZSA. Then we permanently remove them from the process.
- Tag the variables as 'important' when they have importance significantly higher than MZSA.
- Repeat the above steps for predefined number of iterations (random forest runs), or until all attributes are either tagged 'unimportant' or 'important', whichever comes first.
Difference between Boruta and Random Forest Importance Measure
When i first learnt this algorithm, this question 'RF importance measure vs. Boruta' made me puzzled for hours. After reading a lot about it, I figured out the exact difference between these two variable selection algorithms.
In random forest, the Z score is computed by dividing the average accuracy loss by its standard deviation. It is used as the importance measure for all the variables. But we cannot use Z Score which is calculated in random forest, as a measure for finding variable importance as this Z score is not directly related to the statistical significance of the variable importance. To workaround this problem, boruta package runs random forest on both original and random attributes and compute the importance of all variables. Since the whole process is dependent on permuted copies, we repeat random permutation procedure to get statistically robust results.
Is Boruta a solution for all?
Answer is NO. You need to test other algorithms. It is not possible to judge the best algorithm without knowing data and assumptions. Since it is an improvement on random forest variable importance measure, it should work well on most of the times.
What is shuffled feature or permuted copies?
It simply means changing order of values of a variable. See the practical example below -
set.seed(123)
mydata = data.frame(var1 = 1 : 6, var2=runif(6))
shuffle = data.frame(apply(mydata,2,sample))
head(cbind(mydata, shuffle))
Original Shuffled var1 var2 var1 var2 1 1 0.2875775 4 0.9404673 2 2 0.7883051 5 0.4089769 3 3 0.4089769 3 0.2875775 4 4 0.8830174 2 0.0455565 5 5 0.9404673 6 0.8830174 6 6 0.0455565 1 0.7883051
R : Feature Selection with Boruta Package
1. Get Data into R
The read.csv() function is used to read data from CSV and import it into R environment.
#Read data2. List of variables
df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
#Column NamesResult : "admit" "gre" "gpa" "rank"
names(df)
3. Define categorical variables
df$admit = as.factor(df$admit)
df$rank = as.factor(df$rank)
4. Explore Data
#Summarize Data
summary(df)
#Check number of missing values
sapply(df, function(y) sum(is.na(y)))
admit gre gpa rank 0:273 Min. :220.0 Min. :2.260 1: 61 1:127 1st Qu.:520.0 1st Qu.:3.130 2:151 Median :580.0 Median :3.395 3:121 Mean :587.7 Mean :3.390 4: 67 3rd Qu.:660.0 3rd Qu.:3.670 Max. :800.0 Max. :4.000
No missing values in the dataframe df.
Handle Missing Values
In this dataset, we have no missing values. If it exists in your dataset, you need to impute them before implementing boruta package.5. Run Boruta Algorithm
#Install and load Boruta package
install.packages("Boruta")
library(Boruta)
# Run Boruta Algorithm
set.seed(456)
boruta <- Boruta(admit~., data = df, doTrace = 2)
print(boruta)
plot(boruta)
Boruta performed 9 iterations in 4.870027 secs.
3 attributes confirmed important: gpa, gre, rank;
No attributes deemed unimportant.
It shows all the three variables are considered important and no one is tagged 'unimportant'. The plot() option shows box plot of all the attributes plus minimum, average and max shadow score. Variables having boxplot in green shows all predictors are important. If boxplots are in red, it shows they are rejected. And yellow color of box plot indicates they are tentative.
Tentative Attributes refers to importance score so close to their best shadow attributes that Boruta is unable to decide in default number of random forest runs.
Box Plot - Variable Selection |
plot(boruta, xlab = "", xaxt = "n")
k <-lapply(1:ncol(boruta$ImpHistory),function(i)
boruta$ImpHistory[is.finite(boruta$ImpHistory[,i]),i])
names(k) <- colnames(boruta$ImpHistory)
Labels <- sort(sapply(k,median))
axis(side = 1,las=2,labels = names(Labels),
at = 1:ncol(boruta$ImpHistory), cex.axis = 0.7)
Let's add some irrelevant data to our original dataset
It is to check whether boruta package will be able to find unimportant variables or not. In the following program, we have created duplicate copies of the original 3 variables and then randomise the order of values in these variables.
#Add some random permuted data
set.seed(777)
df.new<-data.frame(df,apply(df[,-1],2,sample))
names(df.new)[5:7]<-paste("Random",1:3,sep="")
df.new$Random1 = as.numeric(as.character(df.new$Random1))
df.new$Random2 = as.numeric(as.character(df.new$Random2))
> head(df.new)
admit gre gpa rank Random1 Random2 Random3
1 0 380 3.61 3 600 3.76 4
2 1 660 3.67 3 660 3.30 4
3 1 800 4.00 1 700 3.37 2
4 1 640 3.19 4 620 3.33 3
5 0 520 2.93 4 600 3.04 2
6 1 760 3.00 2 520 3.64 4
Run Boruta Algorithm
set.seed(456)
boruta2 <- Boruta(admit~., data = df.new, doTrace = 1)
print(boruta2)
plot(boruta2)
Boruta performed 55 iterations in 21.79995 secs.
3 attributes confirmed important: gpa, gre, rank;
3 attributes confirmed unimportant: Random1, Random2, Random3;
The irrelevant variable we added to the dataset came out unimportant as per boruta algorithm.
> attStats(boruta2)
meanImp medianImp minImp maxImp normHits decision
gre 5.56458881 5.80124786 2.347609 8.410490 0.90909091 Confirmed
gpa 9.66289180 9.37140347 6.818527 13.405592 1.00000000 Confirmed
rank 10.16762154 10.22875211 6.173894 15.235444 1.00000000 Confirmed
Random1 0.05986751 0.18360283 -1.281078 2.219137 0.00000000 Rejected
Random2 1.15927054 1.35728128 -2.779228 3.816915 0.29090909 Rejected
Random3 0.05281551 -0.02874847 -3.126645 3.219810 0.05454545 Rejected
To save a final list of important variables in a vector, use getSelectedAttributes() function.
#See list of finalvars
finalvars = getSelectedAttributes(boruta2, withTentative = F)
[1] "gre" "gpa" "rank"
Incase you get tentative attributes in your dataset, you need to treat them. In this dataset, we did not get any one. When you run the following function, it will compare the median Z score of the variables with the median Z score of the best shadow attribute and then make a decision whether an attribute should be confirmed or rejected.
Tentative.boruta <- TentativeRoughFix(boruta2)
List of parameters used in Boruta
- maxRuns: maximal number of random forest runs. Default is 100.
- doTrace: It refers to verbosity level. 0 means no tracing. 1 means reporting attribute decision as soon as it is cleared. 2 means all of 1 plus reporting each iteration. Default is 0.
- getImp : function used to obtain attribute importance. The default is getImpRfZ, which runs random forest from the ranger package and gathers Z-scores of mean decrease accuracy measure.
- holdHistory: The full history of importance runs is stored if set to TRUE (Default).
Compare Boruta with RFE Algorithm
In caret, there is a variable selection algorithm called recursive feature elimination (RFE). It is also called backward selection. A brief explanation of the algorithm is given below -
- Fit the model using all independent variables.
- Calculate variable importance of all the variables.
- Each independent variable is ranked using its importance to the model.
- Drop the weakest variable (worst ranked) and builds a model using the remaining variables and calculate model accuracy.
- Repeat step 4 until all variables are used.
- Variables are then ranked according to when they were dropped.
- For regression, RMSE and R-Squared are used as a metrics. For classification, it is 'Accuracy' and 'Kappa'.
In the code below, we are building a random forest model in RFE algorithm. The function 'rfFuncs' denotes for random forest.
library(caret)
library(randomForest)
set.seed(456)
control <- rfeControl(functions=rfFuncs, method="cv", number=10)
rfe <- rfe(df.new[,2:7], df.new[,1], rfeControl=control)
print(rfe, top=10)
plot(rfe, type=c("g", "o"), cex = 1.0)
predictors(rfe)
head(rfe$resample, 10)
Outer resampling method: Cross-Validated (10 fold)
Resampling performance over subset size:
Variables Accuracy Kappa AccuracySD KappaSD Selected
4 0.6477 0.1053 0.07009 0.1665
6 0.7076 0.2301 0.06285 0.1580 *
The top 6 variables (out of 6):
gpa, rank, gre, Random2, Random3, Random1
RFE - Variable Selection |
The following functions can be used for model fitting in RFE selections
- linear regression (lmFuncs)
- random forests (rfFuncs)
- naive Bayes (nbFuncs)
- bagged trees (treebagFuncs)
Does Boruta handle multicollinearity?
Multicollinearity means high correlation between independent variables. It is an important assumption in linear and logistic regression model. It makes coefficients (or estimates) more biased. Lets's check whether boruta algorithm takes care of it. Let's create some sample data. In this case, we are creating 3 predictors x1-x3 and target variable y.
set.seed(123)
x1 <- runif(500)
x2 <- rnorm(500)
x3 <- x2 + rnorm(n,sd=0.5)
y <- x3 + runif(500)
cor(x2,x3)
[1] 0.8981247
The correlation of variables x2 and x3 is very high (close to 0.9). It means they are highly correlated.
mydata = data.frame(x1,x2,x3)
Boruta(mydata, y)
Boruta performed 9 iterations in 7.088029 secs.
2 attributes confirmed important: x2, x3;
1 attributes confirmed unimportant: x1;
Boruta considered both highly correlated variables to be important. It implies it does not treat collinearity while selecting important variables. It is because of the way algorithm works.
Important points related to Boruta
- Impute missing values - Make sure missing or blank values are filled up before running boruta algorithm.
- Collinearity - It is important to handle collinearity after getting important variables from boruta.
- Slow Speed - It is slow in terms of speed as compared to other traditional feature selection algorithms.
TestIndex <- createDataPartition(Credit$Credit.Amount,times = 1,p=0.5,list = FALSE)
ReplyDeleteTrain <- Credit[TestIndex,]
Test <- Credit[-TestIndex,]
names(Credit)
# Building the model
Fit <- glm(Creditability~Account.Balance+Payment.Status.of.Previous.Credit+Purpose+Value.Savings.Stocks+Length.of.current.employment+Type.of.apartment+Most.valuable.available.asset+Concurrent.Credits+Duration.of.Credit..month.+Credit.Amount+Age..years.,family = binomial,data = Train)
summary(Fit)
# Removing the non significant variable
Fit1 <- glm(Creditability~Account.Balance+Payment.Status.of.Previous.Credit+Value.Savings.Stocks+Length.of.current.employment+Most.valuable.available.asset+Duration.of.Credit..month.,family = binomial,data=Train )
summary(Fit1)
fitlog <- predict(Fit1,type="response",newdata=Test)
fitlog
fitlog1 <- predict(Fit1,type="response",newdata=Test)
fitlog1<-ifelse(fitlog1>0.5,1,0)
tab <- table(fitlog1,Train$Creditability)
1-sum(diag(tab))/sum(tab)
# Model performance Evaluation
library(ROCR)
pred <- predict(Fit1,Test,type="response")
hist(pred)
# Cut offvalue on eye estimate
pred <- prediction(pred,Test$Creditability)
eval <- performance(pred,"acc")
plot(eval)
abline(h=0.775,v=.575)
# Identifying the best cutoff and accuracy
eval
max <- which.max(slot(eval,"y.values")[[1]])
acc <- slot(eval,"y.values")[[1]][max]
cut <- slot(eval,"x.values")[[1]][max]
cut
# Optimalcutoffvalueis .572
fitlog1 <- predict(Fit1,Test,type = "response")
fitlog1 <- ifelse(fitlog1>0.57,1,0)
# Misclassification error
tab1 <- table(fitlog1,Test$Creditability)
tab1
accuracy <- (sum(diag(tab1))/sum(tab1))
accuracy
# accuracy = 77.4%
# ROC
# we are intrested in finding the number of 1 rather than 0's.'
roc <- performance(pred,"tpr","fpr")
plot(roc)
abline(0,1)
require(Deducer)
rocplot(Fit1)
Missing Value Imputation
ReplyDeleteCategorical logreg for Binary Variables , polyreg for more than 2 levels (MICE)
Numerical KNN Imputation , Multivariate Imputation by Chained Equations, rpart http://www.stefvanbuuren.nl/publications/MICE%20V1.0%20Manual%20TNO00038%202000.pdf
Levels Reduction WOE and Informaion and Business Logic
Dimensionality Reduction PCA
Data Synthesis SMOTE, ROSE . Synthetic Minority Oversampling Technique
Examining the Each variable outlier Detection, Outlier Capping for the numerical variables, bi-variate analysis wrt dependent variable
Test & Train Data preparation Caret package- createDataPartition
Model Building
Logistic Regression
stepAic and decision tree for the variable selection
building the Logistic Regression model
Evaluation of Logistic Regression
Goodness of Fit -hoslem, loglikelihood, pR2,waldtest , variable Importance, Classification rate, AUC ROC curve, K-fold cross validation ,Concordance -Discordance pairs
Evaluating the accuracy by random forest and CHAID trees
Visulisation of the result via ggplots,
Hi Deepanshu,
ReplyDeleteHow to implement this in SAS?
Regards,
Harneet.
SIR CAN YOU HELP US WITH EDA PROCESS IN PHYTHON
ReplyDelete