Oversampling for Rare Event with R

Oversampling

Oversampling occurs when you have less than 10 events per independent variable in your logistic regression model. Suppose, there are 9900 non-events and 100 events in 10k cases. You need to oversample the events (decrease the volume of non-events so that proportion of events and non-events gets balanced).
You take a small proportion of the many non-event cases and a large proportion of the relatively few event cases.
R Code: Oversampling for Rare Event Model
# Read data filelibrary(caret)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
table(mydata$admit)
In the program below, we are keeping all the events and same number of non-events.
#OverSampling - 50:50
mydata$admit = as.factor(mydata$admit)
set.seed(9)
down_train <- downSample(x = subset(mydata, select = -c(admit)), y = mydata$admit, yname = "admit")
table(down_train$admit)
In the program below, we are keeping % of non-events as to maintain the event ratio 40% post oversampling. 
samplepcnt = 40
minClass <- floor(min(table(mydata$admit))*(100/samplepcnt-1))
dt =  subset(mydata, admit==0)
set.seed(112)
dt2 = sort(sample(nrow(dt), minClass))
dt3 = dt[dt2,]
dt4 =  rbind(subset(mydata, admit==1),dt3)
nrow(dt4)


R Tutorials : 75 Free R Tutorials

About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has close to 7 years of experience in data science and predictive modeling. During his tenure, he has worked with global clients in various domains like retail and commercial banking, Telecom, HR and Automotive.


While I love having friends who agree, I only learn from those who don't.

Let's Get Connected: Email | LinkedIn

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

0 Response to "Oversampling for Rare Event with R"

Post a Comment

Next → ← Prev