Splitting Data into Training and Test Sets with R

Deepanshu Bhalla 15 Comments
In this tutorial, you will learn how to split sample into training and test data sets with R.

The following code splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.
data <-read.csv("c:/datafile.csv")

dt = sort(sample(nrow(data), nrow(data)*.7))
train<-data[dt,]
test<-data[-dt,]
Here sample( ) function randomly picks 70% rows from the data set. It is sampling without replacement.

Method 2 : To maintain same percentage of event rate in both training and validation dataset.
Here we are performing stratified sampling using caret package. The createDataPartition function from caret package generates a stratified random split of the data. In simple words, suppose you have event rate (mean of binary dependent variable) of 10%. The program below makes sure this 10% holds in both training and validation dataset. 70% of data goes in training and remaining 30% in validation dataset.
library(caret)
set.seed(3456)
trainIndex <- createDataPartition(data$FD, p = .7,
                                  list = FALSE,
                                  times = 1)
Train <- data[ trainIndex,]
Valid <- data[-trainIndex,]
In the above program, FD is a dependent (target) variable having two values 1 and 0. Make sure it is defined in factor format.
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

15 Responses to "Splitting Data into Training and Test Sets with R"
  1. This comment has been removed by the author.

    ReplyDelete
  2. This won't randomize the order, bad option

    ReplyDelete
    Replies
    1. What do you mean by "This won't randomize the order"? Sample function randomize the order.

      Delete
    2. how can i select specific data i don't want to randomize data

      Delete
  3. One more way to split data into two part



    install.packages("caTools")
    library(caTools)



    iris <- iris


    iris$spl <-sample.split(iris,SplitRatio = 0.7)
    head(iris)
    train=subset(iris, iris$spl==TRUE)
    test <- subset(iris,iris$spl ==FALSE)

    ReplyDelete
    Replies
    1. This is the correct solution when you have a dependent column. Thank you, Rishabh.

      Delete
    2. But you split only a variable in the iris dataset... Will this apply to all the other variables?

      Delete
  4. Very helpful and to the point, thanks Deepanshu!

    ReplyDelete
  5. This is a really helpful post - thank you.

    ReplyDelete
  6. i really appreciate your work, thanks for sharing

    ReplyDelete
  7. Wrong approach, If the data is in order, u will have a poor splitting, you might probably have you test data to be one set of the logical data.
    set.seed(123) #This will help you produce same random set everytime the code is run even when run by co-worker

    data1<- runif(nrow(data)) #This will randomize the data and save as data1

    data2<- data[order(data1), ]#This will save the randomized data set in data1, set it on table and save to data2

    ReplyDelete
  8. The work of your website is really commendable, thanks for sharing the post

    ReplyDelete
  9. This is a really helpful posting thanks

    ReplyDelete
  10. great blog, i really love your content, thanks for sharing nice psot

    ReplyDelete
Next → ← Prev