Splitting Data into Training and Test Sets with R

In this tutorial, you will learn how to split sample into training and test data sets with R.

The following code splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.

data <-read.csv("c:/datafile.csv")

dt = sort(sample(nrow(data), nrow(data)*.7))
train<-data[dt,]
test<-data[-dt,]

Here sample( ) function randomly picks 70% rows from the data set. It is sampling without replacement.

Method 2 : To maintain same percentage of event rate in both training and validation dataset.
Here we are performing stratified sampling using caret package. The createDataPartition function from caret package generates a stratified random split of the data. In simple words, suppose you have event rate (mean of binary dependent variable) of 10%. The program below makes sure this 10% holds in both training and validation dataset. 70% of data goes in training and remaining 30% in validation dataset.

library(caret)
set.seed(3456)
trainIndex <- createDataPartition(data$FD, p = .7,
                                  list = FALSE,
                                  times = 1)
Train <- data[ trainIndex,]
Valid <- data[-trainIndex,]

In the above program, FD is a dependent (target) variable having two values 1 and 0. Make sure it is defined in factor format.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 9 Responses to "Splitting Data into Training and Test Sets with R"

SaurabhJuly 17, 2017 at 11:05 PM
This won't randomize the order, bad option
UnknownNovember 7, 2017 at 12:10 AM
One more way to split data into two part

install.packages("caTools")
library(caTools)

iris <- iris

iris$spl <-sample.split(iris,SplitRatio = 0.7)
head(iris)
train=subset(iris, iris$spl==TRUE)
test <- subset(iris,iris$spl ==FALSE)
Bh@v@n@February 22, 2019 at 2:08 PM
It helped me a lot.Thank you
VaisakhMarch 20, 2019 at 7:03 AM
Very helpful and to the point, thanks Deepanshu!
AnonymousJanuary 20, 2021 at 12:46 AM
This is a really helpful post - thank you.