Random Forest on Imbalance Data

In random forest, you can perform oversampling of events without data loss.

There are 2 functions in randomForest package for sampling :

1. strata - A (factor) variable that is used for stratified sampling.
2. sampsize - Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

Example : sampsize= c(100,50) OR  you can write : sampsize=c('0'=100, '1'=50)
Meaning : This will randomly sample 100, 50 entities from the two classes (with replacement) to grow each tree.
library(caret)
set.seed(1401)
rf = train( class ~ ., data = training, method = "rf", ntree = 500, strata = training$class,
sampsize = rep(sum(training$class == 1), 2), metric = "ROC")
testing$rf = predict(rf, testing, type = "prob")[,1]
library(pROC)
auc <- roc(testing$class, testing$rf, levels = rev(levels(training$class)))
plot(auc, col = rgb(1, 0, 0, .5), lwd = 2)
In the above code, sampsize = rep(sum(training$class == 1), 2) means both the classes will have same frequency.e.g. sampsize = c(100 cases of 0, 100 cases of 1).

R Tutorials : 75 Free R Tutorials


Statistics Tutorials : 50 Statistics Tutorials

About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like banking, Telecom, HR and Health Insurance.

While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*
Related Posts:
1 Response to "Random Forest on Imbalance Data"

Next → ← Prev