The following code splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.

data <-read.csv("c:/datafile.csv") dt = sort(sample(nrow(data), nrow(data)*.7)) train<-data[dt,] test<-data[-dt,]Here

**sample( )**function randomly picks 70% rows from the data set. It is

**sampling without replacement**.

**Method 2 :**To maintain same percentage of event rate in both training and validation dataset.

Here we are performing stratified sampling using caret package. The

`createDataPartition`

function from caret package generates a stratified random split of the data. In simple words, suppose you have event rate (mean of binary dependent variable) of 10%. The program below makes sure this 10% holds in both training and validation dataset. 70% of data goes in training and remaining 30% in validation dataset.
library(caret) set.seed(3456) trainIndex <- createDataPartition(data$FD, p = .7, list = FALSE, times = 1) Train <- data[ trainIndex,] Valid <- data[-trainIndex,]In the above program,

**FD**is a dependent (target) variable having two values 1 and 0. Make sure it is defined in

**factor**format.

This won't randomize the order, bad option

ReplyDeleteWhat do you mean by "This won't randomize the order"? Sample function randomize the order.

Deletehow can i select specific data i don't want to randomize data

DeleteOne more way to split data into two part

ReplyDeleteinstall.packages("caTools")

library(caTools)

iris <- iris

iris$spl <-sample.split(iris,SplitRatio = 0.7)

head(iris)

train=subset(iris, iris$spl==TRUE)

test <- subset(iris,iris$spl ==FALSE)

This is the correct solution when you have a dependent column. Thank you, Rishabh.

DeleteBut you split only a variable in the iris dataset... Will this apply to all the other variables?

DeleteIt helped me a lot.Thank you

ReplyDeleteVery helpful and to the point, thanks Deepanshu!

ReplyDeleteThis is a really helpful post - thank you.

ReplyDelete