Oversampling for Rare Event with R

Oversampling

Oversampling occurs when you have less than 10 events per independent variable in your logistic regression model. Suppose, there are 9900 non-events and 100 events in 10k cases. You need to oversample the events (decrease the volume of non-events so that proportion of events and non-events gets balanced).
You take a small proportion of the many non-event cases and a large proportion of the relatively few event cases.
R Code: Oversampling for Rare Event Model
# Read data filelibrary(caret)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
table(mydata$admit)
In the program below, we are keeping all the events and same number of non-events.
#OverSampling - 50:50
mydata$admit = as.factor(mydata$admit)
set.seed(9)
down_train <- downSample(x = subset(mydata, select = -c(admit)), y = mydata$admit, yname = "admit")
table(down_train$admit)
In the program below, we are keeping % of non-events as to maintain the event ratio 40% post oversampling. 
samplepcnt = 40
minClass <- floor(min(table(mydata$admit))*(100/samplepcnt-1))
dt =  subset(mydata, admit==0)
set.seed(112)
dt2 = sort(sample(nrow(dt), minClass))
dt3 = dt[dt2,]
dt4 =  rbind(subset(mydata, admit==1),dt3)
nrow(dt4)

ListenData Logo
Spread the Word!
Share
Related Posts
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he has worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and Human Resource.

0 Response to "Oversampling for Rare Event with R"

Post a Comment

Next → ← Prev
Looks like you are using an ad blocker!

To continue reading you need to turnoff adblocker and refresh the page. We rely on advertising to help fund our site. Please whitelist us if you enjoy our content.