The following code splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.
data <-read.csv("c:/datafile.csv") dt = sort(sample(nrow(data), nrow(data)*.7)) train<-data[dt,] test<-data[-dt,]Here sample( ) function randomly picks 70% rows from the data set. It is sampling without replacement.
Method 2 : To maintain same percentage of event rate in both training and validation dataset.
Here we are performing stratified sampling using caret package. The
createDataPartition
function from caret package generates a stratified random split of the data. In simple words, suppose you have event rate (mean of binary dependent variable) of 10%. The program below makes sure this 10% holds in both training and validation dataset. 70% of data goes in training and remaining 30% in validation dataset.
library(caret) set.seed(3456) trainIndex <- createDataPartition(data$FD, p = .7, list = FALSE, times = 1) Train <- data[ trainIndex,] Valid <- data[-trainIndex,]In the above program, FD is a dependent (target) variable having two values 1 and 0. Make sure it is defined in factor format.
This won't randomize the order, bad option
ReplyDeleteWhat do you mean by "This won't randomize the order"? Sample function randomize the order.
Deletehow can i select specific data i don't want to randomize data
DeleteOne more way to split data into two part
ReplyDeleteinstall.packages("caTools")
library(caTools)
iris <- iris
iris$spl <-sample.split(iris,SplitRatio = 0.7)
head(iris)
train=subset(iris, iris$spl==TRUE)
test <- subset(iris,iris$spl ==FALSE)
This is the correct solution when you have a dependent column. Thank you, Rishabh.
DeleteBut you split only a variable in the iris dataset... Will this apply to all the other variables?
DeleteIt helped me a lot.Thank you
ReplyDeleteVery helpful and to the point, thanks Deepanshu!
ReplyDeleteThis is a really helpful post - thank you.
ReplyDelete