Random Forest on Imbalance Data

R Data Science: R Programming A-Z: R For Data Science With Real Exercises!

Data Science: Machine Learning A-Z: Hands-On Python & R In Data Science

In random forest, you can perform oversampling of events without data loss.

There are 2 functions in randomForest package for sampling :

1. strata - A (factor) variable that is used for stratified sampling.
2. sampsize - Size(s) of sample to draw. For classification, if sampsize is a vector of the length the number of strata, then sampling is stratified by strata, and the elements of sampsize indicate the numbers to be drawn from the strata.

Example : sampsize= c(100,50) OR  you can write : sampsize=c('0'=100, '1'=50)
Meaning : This will randomly sample 100, 50 entities from the two classes (with replacement) to grow each tree.
rf = train( class ~ ., data = training, method = "rf", ntree = 500, strata = training$class,
sampsize = rep(sum(training$class == 1), 2), metric = "ROC")
testing$rf = predict(rf, testing, type = "prob")[,1]
auc <- roc(testing$class, testing$rf, levels = rev(levels(training$class)))
plot(auc, col = rgb(1, 0, 0, .5), lwd = 2)
In the above code, sampsize = rep(sum(training$class == 1), 2) means both the classes will have same frequency.e.g. sampsize = c(100 cases of 0, 100 cases of 1).
Coursera Data Science

R Tutorials : 75 Free R Tutorials

Statistics Tutorials : 50 Statistics Tutorials

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

1 Response to "Random Forest on Imbalance Data"

Next → ← Prev