Random Forest on Imbalance Data

In random forest, you can perform oversampling of events without data loss.

There are 2 functions in randomForest package for sampling :

1. strata - A (factor) variable that is used for stratified sampling.
2. sampsize - Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

Example : sampsize= c(100,50) OR  you can write : sampsize=c('0'=100, '1'=50)
Meaning : This will randomly sample 100, 50 entities from the two classes (with replacement) to grow each tree.
library(caret)
set.seed(1401)
rf = train( class ~ ., data = training, method = "rf", ntree = 500, strata = training$class,
sampsize = rep(sum(training$class == 1), 2), metric = "ROC")
testing$rf = predict(rf, testing, type = "prob")[,1]
library(pROC)
auc <- roc(testing$class, testing$rf, levels = rev(levels(training$class)))
plot(auc, col = rgb(1, 0, 0, .5), lwd = 2)
In the above code, sampsize = rep(sum(training$class == 1), 2) means both the classes will have same frequency.e.g. sampsize = c(100 cases of 0, 100 cases of 1).
Related Posts
About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he has worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and Human Resource.

2 Responses to "Random Forest on Imbalance Data"

Next → ← Prev
Love this Post? Spread the Word!
Share