CARET Package Implementation in R

In R, there is a package called caret which stands for Classification And REgression Training. It makes predictive modeling easy. It can run most of the predive modeling techniques with cross-validation. It can also perform data slicing and pre-processing data modeling steps.

Loading required libraries
library(C50)
library(ROCR)
library(caret)
library(plyr)
Set Parallel Processing - Decrease computation time
install.packages("doMC")
library(doMC)
registerDoMC(cores = 5)
Splitting data into training and validation

The following code splits 60% of data into training and remaining into validation.
trainIndex <- createDataPartition(data[,1], p = .6, list = FALSE, times = 1)
dev <- data[ trainIndex,]
val  <- data[-trainIndex,]
In this code, a data.frame named "data" contains full dataset. The list = FALSE avoids returns the data as a list. This function also has an argument, times, that can create multiple splits at once; the data indices are returned in a list of integer vectors.

Similarly, createResample can be used to make simple bootstrap samples and createFolds can be used to generate balanced cross–validation groupings from a set of data.

Repeated K Fold Cross-Validation
cvCtrl <- trainControl(method = "repeatedcv", number =10, repeats =3, classProbs = TRUE)
Explanation : 
  1. repeatedcv : repeated K-fold cross-validation 
  2. number = 10 : 10-fold cross-validations
  3. repeats = 3 : three separate 10-fold cross-validations are used.
  4. classProbs = TRUE : It should be TRUE if metric = " ROC " is used in the train function. It can be skipped if metric = "Kappa" is used.
Note : Kappa measures accuracy.

There are two ways to tune an algorithm in the Caret R package :
  1. tuneLength =  It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.
  2. tuneGrid =  It means user has to specify a tune grid manually. In the grid, each algorithm parameter can be specified as a vector of possible values. These vectors combine to define all the possible combinations to try. 
For example,  grid = expand.grid(.mtry= c(1:100))

grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000, by = 50), .shrinkage = c(0.01, 0.1))

Example 1 : train with tuneGrid (Manual Grid)
grid <- expand.grid(.model = "tree", .trials = c(1:100), .winnow = FALSE)

set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
               tuneGrid = grid, trControl = cvCtrl)

Example 2 : train with tunelength (Automatic Grid)
set.seed(825)
tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",
               tunelength = 10, trControl = cvCtrl)

Finding the Tuning Parameter for each of the algorithms

Visit this link - http://topepo.github.io/caret/modelList.html

Calculating the Variable Importance
varImp(tuned$finalModel , scale=FALSE)
plot(varImp(tuned$finalModel))
To get the area under the ROC curve for each predictor, the filterVarImp function can be used. The area under the ROC curve is computed for each class.
RocImp <- filterVarImp(x = dev[, -1], y = dev[,1])
RocImp
# Seeing result
tuned

# Seeing Parameter Tuning
trellis.par.set(caretTheme())
plot(tuned, metric = "ROC")

# Seeing final model result
print(tuned$finalModel)

#Summaries of C5.0 Model
summary(tuned$finalModel)

# variable Importance
C5imp(tuned$finalModel, metric="usage")

#Scoring
val1 = predict(tuned$finalModel, val[, -1], type = "prob")

Other Useful Functions
  1. nearZeroVar: a function to remove predictors that are sparse and highly unbalanced
  2. findCorrelation: a function to remove the optimal set of predictors to achieve low pair–wise correlations (Check out this link)
  3. preProcess: Variable selection using PCA 
  4. predictors: class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models) (currently7 methods)
  5. confusionMatrix, sensitivity, specificity, posPredValue, negPredValue: classes for assessing classifier performance

R Tutorials : 75 Free R Tutorials


Statistics Tutorials : 50 Statistics Tutorials

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

1 Response to "CARET Package Implementation in R"

  1. Finding the Tuning Parameter for each of the algorithms is very useful information along with the other useful information.
    I appreciate your service to the world communities on enhancing the skills without expecting anything.
    Amazing!!!

    ReplyDelete

Next → ← Prev