Oversampling for Rare Event with R


Oversampling occurs when you have less than 10 events per independent variable in your logistic regression model. Suppose, there are 9900 non-events and 100 events in 10k cases. You need to oversample the events (decrease the volume of non-events so that proportion of events and non-events gets balanced).
You take a small proportion of the many non-event cases and a large proportion of the relatively few event cases.
R Code: Oversampling for Rare Event Model
# Read data filelibrary(caret)
mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
In the program below, we are keeping all the events and same number of non-events.
#OverSampling - 50:50
mydata$admit = as.factor(mydata$admit)
down_train <- downSample(x = subset(mydata, select = -c(admit)), y = mydata$admit, yname = "admit")
In the program below, we are keeping % of non-events as to maintain the event ratio 40% post oversampling. 
samplepcnt = 40
minClass <- floor(min(table(mydata$admit))*(100/samplepcnt-1))
dt =  subset(mydata, admit==0)
dt2 = sort(sample(nrow(dt), minClass))
dt3 = dt[dt2,]
dt4 =  rbind(subset(mydata, admit==1),dt3)

Related Posts
About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 8 years of experience in data science. During his tenure, he has worked with global clients in various domains like Banking, Insurance, Telecom and Human Resource.

0 Response to "Oversampling for Rare Event with R"

Post a comment

Next → ← Prev
Love this Post? Spread the Word!