Caret Package : Train Models

This article explains about useful functions of caret package in R. If you are new to the caret package, check out Part I Tutorial.

Method Functions in trainControl Parameter

none - No cross validation or Bootstrapping
boot - Bootstrapping
cv - Cross validation
repeatedcv - Repeated Cross Validation
oob - Out of Bag (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models)

The idea of cross validation or bootstrapping the training samples is to select the best parameters for a model. See the detailed explanation below under 'How it works' section.

Example

1. Specifies a parameter grid for fine tuning GBM model

grid <- expand.grid( .n.trees=seq(10,50,10), .interaction.depth=seq(1,4,1), .shrinkage=c(0.01,0.001), .n.minobsinnode=seq(5,20,5))

n.trees=seq(10,50,10) implies fine tune model by taking number of trees 10, 20, 30, 40, 50.

2. 10 fold Cross Validation

train_control <- trainControl(method = 'cv', number =10, classProbs = TRUE)

3. Train GBM Model

fit <- train(x,y,method="gbm",metric="roc", trControl=train_control, tuneGrid=grid)

How cross validation works in caret

This approach is used to select the final model.

In the above example there are 160 (5*4*2*4) possible parameter combinations
For each parameter combination train performs a 10-fold cross validation
For each parameter combination and for each fold (of the 10 folds) the performance metric (AUC) is computed (1600 AUC scores are computed)
For each parameter combination the mean of the performance metric is computed over the 10 folds
The parameter combination that has the best mean performance metric are considered the best parameters for the model

How to see the best model and best tuning parameter

1. Submit fit$results to see the model results of each parameter combination

2. Submit fit$bestTune to see the best tuning paramter

3. Submit fit$finalModel to see the results of final model

4. Submit fit$resample to see the performance over 10 folds

5. To see the individual predictions done during Cross Validation you can enable savePredictions = T in trainControl, then look at fit$pred

Selecting the Least Complex Model

Step I : Train your model

set.seed(825)
gbmFit3 <- train(Class ~ ., data = training, method = "gbm", trControl = fitControl, verbose = FALSE, tuneGrid = gbmGrid, metric = "ROC")

Step II : Tolerance function

It selects the least complex model within some percent tolerance of the best value. In the formula below, tol =2 means 2% loss of AUC score.

whichTwoPct <- tolerance(gbmFit3$results, metric = "ROC", tol = 2, maximize = TRUE)
cat("best model within 2 pct of best:\n")
gbmFit3$results[whichTwoPct,1:6]

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn