Random Forest on Imbalance Data

In random forest, you can perform oversampling of events without data loss.

There are 2 functions in randomForest package for sampling :

1. strata - A (factor) variable that is used for stratified sampling.

2. sampsize - Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

Example : sampsize= c(100,50) OR you can write : sampsize=c('0'=100, '1'=50)

Meaning : This will randomly sample 100, 50 entities from the two classes (with replacement) to grow each tree.

library(caret)
set.seed(1401)
rf = train( class ~ ., data = training, method = "rf", ntree = 500, strata = training$class,
sampsize = rep(sum(training$class == 1), 2), metric = "ROC")
testing$rf = predict(rf, testing, type = "prob")[,1]
library(pROC)
auc <- roc(testing$class, testing$rf, levels = rev(levels(training$class)))
plot(auc, col = rgb(1, 0, 0, .5), lwd = 2)

In the above code, sampsize = rep(sum(training$class == 1), 2) means both the classes will have same frequency.e.g. sampsize = c(100 cases of 0, 100 cases of 1).

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn