In R, there is a package called caret which stands for Classification And REgression Training. It makes predictive modeling easy. It can run most of the predive modeling techniques with cross-validation. It can also perform data slicing and pre-processing data modeling steps.
Loading required libraries
The following code splits 60% of data into training and remaining into validation.
Similarly, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data.
Loading required libraries
library(C50)Set Parallel Processing - Decrease computation time
library(ROCR)
library(caret)
library(plyr)
install.packages("doMC")Splitting data into training and validation
library(doMC)
registerDoMC(cores = 5)
The following code splits 60% of data into training and remaining into validation.
trainIndex <- createDataPartition(data[,1], p = .6, list = FALSE, times = 1)In this code, a data.frame named "data" contains full dataset. The list = FALSE avoids returns the data as a list. This function also has an argument, times, that can create multiple splits at once; the data indices are returned in a list of integer vectors.
dev <- data[ trainIndex,]
val <- data[-trainIndex,]
Similarly, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data.
Repeated K Fold Cross-Validation
There are two ways to tune an algorithm in the Caret R package :
grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000, by = 50), .shrinkage = c(0.01, 0.1))
Example 1 : train with tuneGrid (Manual Grid)
Example 2 : train with tunelength (Automatic Grid)
Finding the Tuning Parameter for each of the algorithms
Visit this link - http://topepo.github.io/caret/modelList.html
Calculating the Variable Importance
Other Useful Functions
cvCtrl <- trainControl(method = "repeatedcv", number =10, repeats =3, classProbs = TRUE)Explanation :
- repeatedcv : repeated K-fold cross-validation
- number = 10 : 10-fold cross-validations
- repeats = 3 : three separate 10-fold cross-validations are used.
- classProbs = TRUE : It should be TRUE if metric = " ROC " is used in the train function. It can be skipped if metric = "Kappa" is used.
There are two ways to tune an algorithm in the Caret R package :
- tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.
- tuneGrid = It means user has to specify a tune grid manually. In the grid, each algorithm parameter can be specified as a vector of possible values. These vectors combine to define all the possible combinations to try.
grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000, by = 50), .shrinkage = c(0.01, 0.1))
Example 1 : train with tuneGrid (Manual Grid)
grid <- expand.grid(.model = "tree", .trials = c(1:100), .winnow = FALSE)
set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
tuneGrid = grid, trControl = cvCtrl)
Example 2 : train with tunelength (Automatic Grid)
set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
tunelength = 10, trControl = cvCtrl)
Finding the Tuning Parameter for each of the algorithms
Visit this link - http://topepo.github.io/caret/modelList.html
Calculating the Variable Importance
varImp(tuned$finalModel , scale=FALSE)To get the area under the ROC curve for each predictor, the filterVarImp function can be used. The area under the ROC curve is computed for each class.
plot(varImp(tuned$finalModel))
RocImp <- filterVarImp(x = dev[, -1], y = dev[,1])
RocImp
# Seeing result
tuned
# Seeing Parameter Tuning
trellis.par.set(caretTheme())
plot(tuned, metric = "ROC")
# Seeing final model result
print(tuned$finalModel)
#Summaries of C5.0 Model
summary(tuned$finalModel)
# variable Importance
C5imp(tuned$finalModel, metric="usage")
#Scoring
val1 = predict(tuned$finalModel, val[, -1], type = "prob")
Other Useful Functions
- nearZeroVar: a function to remove predictors that are sparse and highly unbalanced
- findCorrelation: a function to remove the optimal set of predictors to achieve low pair–wise correlations (Check out this link)
- preProcess: Variable selection using PCA
- predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models) (currently7 methods)
- confusionMatrix, sensitivity, specificity, posPredValue, negPredValue: classes for assessing classifier performance
Finding the Tuning Parameter for each of the algorithms is very useful information along with the other useful information.
ReplyDeleteI appreciate your service to the world communities on enhancing the skills without expecting anything.
Amazing!!!
Thank for your sharing. But I have not understand which data you used for the example?
ReplyDelete