Oversampling for Rare Event with R


R Data Science: R Programming A-Z: R For Data Science With Real Exercises!

Oversampling

Oversampling occurs when you have less than 10 events per independent variable in your logistic regression model. Suppose, there are 9900 non-events and 100 events in 10k cases. You need to oversample the events (decrease the volume of non-events so that proportion of events and non-events gets balanced).
You take a small proportion of the many non-event cases and a large proportion of the relatively few event cases.
R Code: Oversampling for Rare Event Model
# Read data filelibrary(caret)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
table(mydata$admit)
In the program below, we are keeping all the events and same number of non-events.
#OverSampling - 50:50
mydata$admit = as.factor(mydata$admit)
set.seed(9)
down_train <- downSample(x = subset(mydata, select = -c(admit)), y = mydata$admit, yname = "admit")
table(down_train$admit)
In the program below, we are keeping % of non-events as to maintain the event ratio 40% post oversampling. 
samplepcnt = 40
minClass <- floor(min(table(mydata$admit))*(100/samplepcnt-1))
dt =  subset(mydata, admit==0)
set.seed(112)
dt2 = sort(sample(nrow(dt), minClass))
dt3 = dt[dt2,]
dt4 =  rbind(subset(mydata, admit==1),dt3)
nrow(dt4)

Coursera Data Science

R Tutorials : 75 Free R Tutorials

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

0 Response to "Oversampling for Rare Event with R"

Post a Comment

Next → ← Prev