CARET Package Implementation in R

In R, there is a package called caret which stands for Classification And REgression Training. It makes predictive modeling easy. It can run most of the predive modeling techniques with cross-validation. It can also perform data slicing and pre-processing data modeling steps.

Loading required libraries

library(C50)
library(ROCR)
library(caret)
library(plyr)

Set Parallel Processing - Decrease computation time

install.packages("doMC")
library(doMC)
registerDoMC(cores = 5)

Splitting data into training and validation

The following code splits 60% of data into training and remaining into validation.

trainIndex <- createDataPartition(data[,1], p = .6, list = FALSE, times = 1)
dev <- data[ trainIndex,]
val <- data[-trainIndex,]

In this code, a data.frame named "data" contains full dataset. The list = FALSE avoids returns the data as a list. This function also has an argument, times, that can create multiple splits at once; the data indices are returned in a list of integer vectors.

Similarly, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data.

Repeated K Fold Cross-Validation

cvCtrl <- trainControl(method = "repeatedcv", number =10, repeats =3, classProbs = TRUE)

Explanation :

repeatedcv : repeated K-fold cross-validation
number = 10 : 10-fold cross-validations
repeats = 3 : three separate 10-fold cross-validations are used.
classProbs = TRUE : It should be TRUE if metric = " ROC " is used in the train function. It can be skipped if metric = "Kappa" is used.

Note : Kappa measures accuracy.

There are two ways to tune an algorithm in the Caret R package :

tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.
tuneGrid = It means user has to specify a tune grid manually. In the grid, each algorithm parameter can be specified as a vector of possible values. These vectors combine to define all the possible combinations to try.

For example, grid = expand.grid(.mtry= c(1:100))

grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000, by = 50), .shrinkage = c(0.01, 0.1))

Example 1 : train with tuneGrid (Manual Grid)

grid <- expand.grid(.model = "tree", .trials = c(1:100), .winnow = FALSE)

set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
tuneGrid = grid, trControl = cvCtrl)

Example 2 : train with tunelength (Automatic Grid)

set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
tunelength = 10, trControl = cvCtrl)

Finding the Tuning Parameter for each of the algorithms

Visit this link - http://topepo.github.io/caret/modelList.html

Calculating the Variable Importance

varImp(tuned$finalModel , scale=FALSE)
plot(varImp(tuned$finalModel))

To get the area under the ROC curve for each predictor, the filterVarImp function can be used. The area under the ROC curve is computed for each class.

RocImp <- filterVarImp(x = dev[, -1], y = dev[,1])
RocImp

# Seeing result
tuned

# Seeing Parameter Tuning
trellis.par.set(caretTheme())
plot(tuned, metric = "ROC")

# Seeing final model result
print(tuned$finalModel)

#Summaries of C5.0 Model
summary(tuned$finalModel)

# variable Importance
C5imp(tuned$finalModel, metric="usage")

#Scoring
val1 = predict(tuned$finalModel, val[, -1], type = "prob")

Other Useful Functions

nearZeroVar: a function to remove predictors that are sparse and highly unbalanced
findCorrelation: a function to remove the optimal set of predictors to achieve low pair–wise correlations (Check out this link)
preProcess: Variable selection using PCA
predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models) (currently7 methods)
confusionMatrix, sensitivity, specificity, posPredValue, negPredValue: classes for assessing classifier performance

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 2 Responses to "CARET Package Implementation in R"

PalaniAugust 25, 2016 at 11:21 PM
Finding the Tuning Parameter for each of the algorithms is very useful information along with the other useful information.
I appreciate your service to the world communities on enhancing the skills without expecting anything.
Amazing!!!
Ngưu Ma VươngJune 9, 2021 at 1:16 AM
Thank for your sharing. But I have not understand which data you used for the example?