Caret Package : Train Models

This article explains about useful functions of caret package in R. If you are new to the caret package, check out Part I Tutorial.

Method Functions in trainControl Parameter
1. none - No cross validation or Bootstrapping
2. boot - Bootstrapping
3. cv - Cross validation
4. repeatedcv - Repeated Cross Validation
5. oob - Out of Bag (only for random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models)
The idea of cross validation or bootstrapping the training samples is to select the best parameters for a model. See the detailed explanation below under 'How it works' section.

Example

1. Specifies a parameter grid for fine tuning GBM model
grid <- expand.grid( .n.trees=seq(10,50,10), .interaction.depth=seq(1,4,1), .shrinkage=c(0.01,0.001), .n.minobsinnode=seq(5,20,5))
n.trees=seq(10,50,10) implies fine tune model by taking number of trees 10, 20, 30, 40, 50.

2. 10 fold Cross Validation
train_control <- trainControl(method = 'cv', number =10, classProbs = TRUE)
3. Train GBM Model
fit <- train(x,y,method="gbm",metric="roc", trControl=train_control, tuneGrid=grid)
How cross validation works in caret

This approach is used to select the final model.

1. In the above example there are 160 (5*4*2*4) possible parameter combinations
2. For each parameter combination train performs a 10-fold cross validation
3. For each parameter combination and for each fold (of the 10 folds) the performance metric (AUC) is computed (1600 AUC scores are computed)
4. For each parameter combination the mean of the performance metric is computed over the 10 folds
5. The parameter combination that has the best mean performance metric are considered the best parameters for the model

How to see the best model and best tuning parameter

1. Submit fit\$results to see the model results of each parameter combination
2. Submit fit\$bestTune to see the best tuning paramter
3. Submit fit\$finalModel to see the results of final model
4. Submit fit\$resample to see the performance over 10 folds
5. To see the individual predictions done during Cross Validation you can enable savePredictions = T in trainControl, then look at fit\$pred

Selecting the Least Complex Model

Step I : Train your model
set.seed(825)
gbmFit3 <- train(Class ~ ., data = training, method = "gbm", trControl = fitControl, verbose = FALSE, tuneGrid = gbmGrid, metric = "ROC")
Step II : Tolerance function

It selects the least complex model within some percent tolerance of the best value. In the formula below, tol =2 means 2% loss of AUC score.
whichTwoPct <- tolerance(gbmFit3\$results, metric = "ROC", tol = 2, maximize = TRUE)
cat("best model within 2 pct of best:\n")
gbmFit3\$results[whichTwoPct,1:6]