The following code splits 70% of the data selected randomly into training set and the remaining 30% sample into test data set.
data <-read.csv("c:/datafile.csv") dt = sort(sample(nrow(data), nrow(data)*.7)) train<-data[dt,] test<-data[-dt,]Here sample( ) function randomly picks 70% rows from the data set. It is sampling without replacement.
Method 2 : To maintain same percentage of event rate in both training and validation dataset.
Here we are performing stratified sampling using caret package. The
createDataPartition
function from caret package generates a stratified random split of the data. In simple words, suppose you have event rate (mean of binary dependent variable) of 10%. The program below makes sure this 10% holds in both training and validation dataset. 70% of data goes in training and remaining 30% in validation dataset.
library(caret) set.seed(3456) trainIndex <- createDataPartition(data$FD, p = .7, list = FALSE, times = 1) Train <- data[ trainIndex,] Valid <- data[-trainIndex,]In the above program, FD is a dependent (target) variable having two values 1 and 0. Make sure it is defined in factor format.
This comment has been removed by the author.
ReplyDeleteThis won't randomize the order, bad option
ReplyDeleteWhat do you mean by "This won't randomize the order"? Sample function randomize the order.
Deletehow can i select specific data i don't want to randomize data
DeleteOne more way to split data into two part
ReplyDeleteinstall.packages("caTools")
library(caTools)
iris <- iris
iris$spl <-sample.split(iris,SplitRatio = 0.7)
head(iris)
train=subset(iris, iris$spl==TRUE)
test <- subset(iris,iris$spl ==FALSE)
This is the correct solution when you have a dependent column. Thank you, Rishabh.
DeleteBut you split only a variable in the iris dataset... Will this apply to all the other variables?
DeleteIt helped me a lot.Thank you
ReplyDeleteVery helpful and to the point, thanks Deepanshu!
ReplyDeleteThis is a really helpful post - thank you.
ReplyDeletei really appreciate your work, thanks for sharing
ReplyDeleteWrong approach, If the data is in order, u will have a poor splitting, you might probably have you test data to be one set of the logical data.
ReplyDeleteset.seed(123) #This will help you produce same random set everytime the code is run even when run by co-worker
data1<- runif(nrow(data)) #This will randomize the data and save as data1
data2<- data[order(data1), ]#This will save the randomized data set in data1, set it on table and save to data2
The work of your website is really commendable, thanks for sharing the post
ReplyDeleteThis is a really helpful posting thanks
ReplyDeletegreat blog, i really love your content, thanks for sharing nice psot
ReplyDelete