CARET Package Implementation in R

Deepanshu Bhalla 2 Comments , , , ,
In R, there is a package called caret which stands for Classification And REgression Training. It makes predictive modeling easy. It can run most of the predive modeling techniques with cross-validation. It can also perform data slicing and pre-processing data modeling steps.

Loading required libraries
library(C50)
library(ROCR)
library(caret)
library(plyr)
Set Parallel Processing - Decrease computation time
install.packages("doMC")
library(doMC)
registerDoMC(cores = 5)
Splitting data into training and validation

The following code splits 60% of data into training and remaining into validation.
trainIndex <- createDataPartition(data[,1], p = .6, list = FALSE, times = 1)
dev <- data[ trainIndex,]
val  <- data[-trainIndex,]
In this code, a data.frame named "data" contains full dataset. The list = FALSE avoids returns the data as a list. This function also has an argument, times, that can create multiple splits at once; the data indices are returned in a list of integer vectors.

Similarly, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data.

Repeated K Fold Cross-Validation
cvCtrl <- trainControl(method = "repeatedcv", number =10, repeats =3, classProbs = TRUE)
Explanation : 
  1. repeatedcv : repeated K-fold cross-validation 
  2. number = 10 : 10-fold cross-validations
  3. repeats = 3 : three separate 10-fold cross-validations are used.
  4. classProbs = TRUE : It should be TRUE if metric = " ROC " is used in the train function. It can be skipped if metric = "Kappa" is used.
Note : Kappa measures accuracy.

There are two ways to tune an algorithm in the Caret R package :
  1. tuneLength =  It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.
  2. tuneGrid =  It means user has to specify a tune grid manually. In the grid, each algorithm parameter can be specified as a vector of possible values. These vectors combine to define all the possible combinations to try. 
For example,  grid = expand.grid(.mtry= c(1:100))

grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000, by = 50), .shrinkage = c(0.01, 0.1))

Example 1 : train with tuneGrid (Manual Grid)
grid <- expand.grid(.model = "tree", .trials = c(1:100), .winnow = FALSE)

set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
               tuneGrid = grid, trControl = cvCtrl)

Example 2 : train with tunelength (Automatic Grid)
set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
               tunelength = 10, trControl = cvCtrl)

Finding the Tuning Parameter for each of the algorithms

Visit this link - http://topepo.github.io/caret/modelList.html

Calculating the Variable Importance
varImp(tuned$finalModel , scale=FALSE)
plot(varImp(tuned$finalModel))
To get the area under the ROC curve for each predictor, the filterVarImp function can be used. The area under the ROC curve is computed for each class.
RocImp <- filterVarImp(x = dev[, -1], y = dev[,1])
RocImp
# Seeing result
tuned

# Seeing Parameter Tuning
trellis.par.set(caretTheme())
plot(tuned, metric = "ROC")

# Seeing final model result
print(tuned$finalModel)

#Summaries of C5.0 Model
summary(tuned$finalModel)

# variable Importance
C5imp(tuned$finalModel, metric="usage")

#Scoring
val1 = predict(tuned$finalModel, val[, -1], type = "prob")

Other Useful Functions
  1. nearZeroVar: a function to remove predictors that are sparse and highly unbalanced
  2. findCorrelation: a function to remove the optimal set of predictors to achieve low pair–wise correlations (Check out this link)
  3. preProcess: Variable selection using PCA 
  4. predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models) (currently7 methods)
  5. confusionMatrix, sensitivity, specificity, posPredValue, negPredValue: classes for assessing classifier performance
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

2 Responses to "CARET Package Implementation in R"
  1. Finding the Tuning Parameter for each of the algorithms is very useful information along with the other useful information.
    I appreciate your service to the world communities on enhancing the skills without expecting anything.
    Amazing!!!

    ReplyDelete
  2. Thank for your sharing. But I have not understand which data you used for the example?

    ReplyDelete
Next → ← Prev