Random Forest on Imbalance Data

Live Online Training :Data Science with R - Explain Advanced Algorithms in Simple English - Live Projects - Case Studies - Job Placement Assistance - Get 10% off till Oct 26, 2017 - Batch starts from October 28, 2017

In random forest, you can perform oversampling of events without data loss.

There are 2 functions in randomForest package for sampling :

1. strata - A (factor) variable that is used for stratified sampling.
2. sampsize - Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

Example : sampsize= c(100,50) OR  you can write : sampsize=c('0'=100, '1'=50)
Meaning : This will randomly sample 100, 50 entities from the two classes (with replacement) to grow each tree.
library(caret)
set.seed(1401)
rf = train( class ~ ., data = training, method = "rf", ntree = 500, strata = training\$class,
sampsize = rep(sum(training\$class == 1), 2), metric = "ROC")
testing\$rf = predict(rf, testing, type = "prob")[,1]
library(pROC)
auc <- roc(testing\$class, testing\$rf, levels = rev(levels(training\$class)))
plot(auc, col = rgb(1, 0, 0, .5), lwd = 2)
In the above code, sampsize = rep(sum(training\$class == 1), 2) means both the classes will have same frequency.e.g. sampsize = c(100 cases of 0, 100 cases of 1).

Statistics Tutorials : 50 Statistics Tutorials

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has close to 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like retail and commercial banking, Telecom, HR and Automotive.

While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn