How to Calculate Variable Importance using Random Forest in R

Calculating variable importance with Random Forest is a powerful technique used to understand the significance of different variables in a predictive model. Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions to achieve better accuracy and robustness.

Variable importance in Random Forest can be measured using the Gini impurity (or Gini index) or Mean Decrease in Accuracy (MDA) methods. The Gini impurity measures the degree of node impurity in each decision tree, and MDA measures the decrease in model performance when a particular variable is removed from the dataset.

In the example below, we are using the built-in "iris" dataset, which contains measurements of iris flowers along with their species labels. The goal is to predict the species based on the measurements of iris flowers.

Please follow the steps below to calculate variable importance with Random Forest in R.

Step 1: Load the required package

First, you need to make sure the randomForest package is installed. If you don't have it yet, install it using the following command:

install.packages("randomForest")
library(randomForest)

Step 2: Prepare and Split the Data

Split your data into training and testing sets (80% training, 20% testing).

set.seed(123)
train_indices <- sample(1:nrow(iris), 0.8 * nrow(iris))
train_data <- iris[train_indices, ]
test_data <- iris[-train_indices, ]

Step 3: Build the Random Forest Model

rf_model <- randomForest(Species ~ ., data = train_data, ntree = 100, mtry = 2)

Step 4: Extract Variable Importance Scores

variable_importance <- importance(rf_model)

Step 5: Rank and Visualize Variable Importance

Sort the variables based on importance in descending order. Create a bar plot to visualize variable importance.

sorted_importance <- data.frame(variable_importance[order(-variable_importance[, 1]), ])
barplot(sorted_importance[, 1], names.arg = rownames(sorted_importance), las = 2, col = "blue", main = "Variable Importance")

Step 6: Interpretation

The bar plot displays the relative importance of each variable in descending order. The higher the bar, the more important the variable in predicting the species.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn