Dimensionality Reduction with R

Deepanshu Bhalla 11 Comments , , , ,
In predictive modeling, dimensionality reduction or dimension reduction is the process of reducing the number of irrelevant variables. It is a very important step of predictive modeling. Some predictive modelers call it 'Feature Selection' or 'Variable Selection'. If the idea is to improve accuracy of the model, it's the step wherein you need to invest most of your time. Right variables produce more accurate models. Hence, it is required to build some variable reduction /selection strategies to build the robust predictive model. Some of these strategies are listed below -

Step I : Remove Redundant Variables

The following 3 simple analysis helps to remove redundant variables
  1. Remove variables having high percentage of missing values (say 50%)
  2. Remove Zero and Near Zero-Variance Predictors
  3. Remove highly correlated variables (greater than 0.7). The absolute values of pair-wise correlations are considered. If two variables have a high correlation, look at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation. See how it works -
The sample correlation matrix is shown below -
Corrmatrix <- structure(c(1, 0.82, 0.54, 0.36, 0.85, 0.82, 1, 0.01, 0.74, 0.36,
                   0.54, 0.01, 1, 0.65, 0.91, 0.36, 0.74, 0.65, 1, 0.36,
                   0.85, 0.36, 0.91, 0.36, 1),
                 .Dim = c(5L, 5L))
Correlation Matrix
Steps :
  1. First, take absolute values of correlation matrix (Use abs(Corrmatrix) function in R)
  2. Replace all diagonal values (1s) in the matrix with NAs (Use diag(Corrmatrix) <- NA)
  3. Compute mean of each column.
  4. Calculate rank of the mean values (calculated in Step 3) on descending order.
Average of Columns
5. Reorder correlation matrix based on the Rank.
Updated Correlation Matrix
6. Now, checks if the matrix(i,j) > cutoff, then calculates the following steps -

- Mean value of the row of matrix( i , ). For example, for the variables(x1,x5), the mean value of first row is 0.642
- Mean of the row of matrix( -j , ). For variables (x1,x5), the mean value of all rows (except second row) is 0.545

Since x1 has the highest mean absolute correlation value, removes variable 'x1'.

In the next iteration, it checks for the variables (x5,x3) excluding x1. It would be a comparison of mean of (0.91 0.36 0.36) vs. mean of (0.91 0.36 0.36 0.36 0.65 0.74 0.36 0.01 0.74)

How to do it in R

The findCorrelation() function from caret package performs the above steps.
library(caret)
findCorrelation(Corrmatrix, cutoff = .6, verbose = TRUE, names = TRUE)
It returns the variables that are highly correlated and can be removed prior to building a model.

Step II : Feature Selection with Random Forest

Random Forest is one of the most popular machine learning algorithm in data science. The best part of this algorithm is there are no assumptions attached to it as regression techniques have. The pre-processing work is very less as compared to other techniques. It overcomes the problem of overfitting that decision tree has. One of the output it produces - It provides a list of predictor (independent) variables which are important in predicting the target (dependent) variable. It contains two measures of variable importance. The first one - Gini gain produced by the variable, averaged over all trees.
The second one - Permutation Importance i.e. mean decrease in classification accuracy after permuting the variable, averaged over all trees. Sort the permutation importance score on descending order and select the TOP k variables.
Since the permutation variable importance is affected by collinearity, it is necessary to handle collinearity prior to running random forest for extracting important variables.

Download : Example Dataset

R Code : Removing Redundant Variables
# load required libraries
library(caret)
library(corrplot)
library(plyr)

# load required dataset
dat <- read.csv("C:\\Users\\Deepanshu Bhalla\\Downloads\\pml-training.csv")

# Set seed
set.seed(227)

# Remove variables having high missing percentage (50%)
dat1 <- dat[, colMeans(is.na(dat)) <= .5]
dim(dat1)

# Remove Zero and Near Zero-Variance Predictors
nzv <- nearZeroVar(dat1)
dat2 <- dat1[, -nzv]
dim(dat2)

# Identifying numeric variables
numericData <- dat2[sapply(dat2, is.numeric)]

# Calculate correlation matrix
descrCor <- cor(numericData)

# Print correlation matrix and look at max correlation
print(descrCor)
summary(descrCor[upper.tri(descrCor)])

# Check Correlation Plot
corrplot(descrCor, order = "FPC", method = "color", type = "lower", tl.cex = 0.7, tl.col = rgb(0, 0, 0))

# find attributes that are highly corrected
highlyCorrelated <- findCorrelation(descrCor, cutoff=0.7)

# print indexes of highly correlated attributes
print(highlyCorrelated)

# Indentifying Variable Names of Highly Correlated Variables
highlyCorCol <- colnames(numericData)[highlyCorrelated]

# Print highly correlated attributes
highlyCorCol

# Remove highly correlated variables and create a new dataset
dat3 <- dat2[, -which(colnames(dat2) %in% highlyCorCol)]
dim(dat3)

R Code : Feature Selection with Random Forest

library(randomForest)

#Train Random Forest
rf <-randomForest(classe~.,data=dat3, importance=TRUE,ntree=1000)

#Evaluate variable importance
imp = importance(rf, type=1)
imp <- data.frame(predictors=rownames(imp),imp)

# Order the predictor levels by importance
imp.sort <- arrange(imp,desc(MeanDecreaseAccuracy))
imp.sort$predictors <- factor(imp.sort$predictors,levels=imp.sort$predictors)

# Select the top 20 predictors
imp.20<- imp.sort[1:20,]
print(imp.20)

# Plot Important Variables
varImpPlot(rf, type=1)

# Subset data with 20 independent and 1 dependent variables
dat4 = cbind(classe = dat3$classe, dat3[,c(imp.20$predictors)])
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

11 Responses to "Dimensionality Reduction with R"
  1. great article. I am learning R and this was very helpful indeed. Please keep writing more hands on how to do this in R stuff:-)
    cheers
    Manish

    ReplyDelete
  2. Hi Manish, nice article.
    One question
    If we are removing highly correlated variables, do we keep one variable among the highly correlated set (cluster) to represent the other removed variables of the set or do we remove all of them?

    ReplyDelete
    Replies
    1. No, it keeps one variable among the 2 highly correlated variables. If two variables have a high correlation, the function looks at the mean absolute correlation of each variable and removes the SINGLE variable with the largest mean absolute correlation.

      Delete
    2. What do you mean by mean absolute correlation of each variable? How can we compute the correlation of a single variable? Thanks.

      Delete
    3. I have added more description in the article. Hope it helps.

      Delete
  3. Hi Deepanshu,
    Thanks for your very useful code. I am getting error in the mtry command as mentioned below. Any clue to solve that ?
    mtry <- tuneRF(dat3[, -36], dat3[,36], ntreeTry=1000, stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
    mtry = 5 OOB error = 0%
    Searching left ...
    mtry = 4 OOB error = 0%
    NaN 0.01
    Error in if (Improve > improve) { : missing value where TRUE/FALSE needed

    ReplyDelete
    Replies
    1. It is because OOB error is very small. Scope of improvement is very minimal. Hence, improve = 0.01 fails. You can ignore this line of code and just to the next step. Set 4 to mtry instead of mtry = best.m. See the code below -
      rf <-randomForest(classe~.,data=dat3, mtry=4, importance=TRUE,ntree=1000)

      Delete
  4. Really like the way you have created a whole correlation module. Thanks. Any other articles you have written on pre-analytics visualization in R?

    ReplyDelete
  5. What a informative post! It's really helpful.
    But I have a question. findCorrelation() function searches columns to remove to reduce pair-wise correlations. Then, is it different with vif() function?
    vif() function calculates multi-collinearity. what's the difference?

    ReplyDelete
  6. with this i know i'm not alone in data science

    ReplyDelete
Next → ← Prev