Splitting Data into Training and Test Sets with R

In this tutorial, you will learn how to split sample into training and test data sets with R.

The following code splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.
data<-read.csv("c:/datafile.csv")

dt = sort(sample(nrow(data), nrow(data)*.7))
train<-data[dt,]
test<-data[-dt,]

Here sample( ) function randomly picks 70% rows from the data set. It is sampling without replacement.

Method 2 : To maintain same percentage of event rate in both training and validation dataset.
library(caret)
set.seed(3456)
trainIndex <- createDataPartition(data$FD, p = .7,
                                  list = FALSE,
                                  times = 1)
Train <- data[ trainIndex,]
Valid <- data[-trainIndex,]

In the above program, FD is a dependent variable having two values 1 and 0. Make sure it is defined in factor format.
Love this Post? Spread the Word!
Comment and share to motivate us to write more!
About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 8 years of experience in data science. During his tenure, he has worked with global clients in various domains like Banking, Insurance, Telecom and Human Resource.

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*
Related Posts:
7 Responses to "Splitting Data into Training and Test Sets with R"
  1. This comment has been removed by the author.

    ReplyDelete
  2. This won't randomize the order, bad option

    ReplyDelete
    Replies
    1. What do you mean by "This won't randomize the order"? Sample function randomize the order.

      Delete
  3. One more way to split data into two part



    install.packages("caTools")
    library(caTools)



    iris <- iris


    iris$spl <-sample.split(iris,SplitRatio = 0.7)
    head(iris)
    train=subset(iris, iris$spl==TRUE)
    test <- subset(iris,iris$spl ==FALSE)

    ReplyDelete
    Replies
    1. This is the correct solution when you have a dependent column. Thank you, Rishabh.

      Delete
  4. Very helpful and to the point, thanks Deepanshu!

    ReplyDelete

We have Zero Tolerance to Spam. Comments with links will be deleted immediately upon our review.

Next → ← Prev
Scroll to Top