In R, there is a package called

The following code splits 60% of data into training and remaining into validation.

Similarly,

grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000, by = 50), .shrinkage = c(0.01, 0.1))

**caret**which stands for Classification And REgression Training. It makes predictive modeling easy. It can run most of the predive modeling techniques with cross-validation. It can also perform data slicing and pre-processing data modeling steps.**Loading required libraries**library(C50)

library(ROCR)

library(caret)

library(plyr)

**Set Parallel Processing - Decrease computation time**install.packages("doMC")

library(doMC)

registerDoMC(cores = 5)

**Splitting data into training and validation**The following code splits 60% of data into training and remaining into validation.

trainIndex <- createDataPartition(data[,1], p = .6, list = FALSE, times = 1)In this code, a data.frame named

dev <- data[ trainIndex,]

val <- data[-trainIndex,]

**"data"**contains full dataset.**The****list = FALSE**avoids returns the data as a list. This function also has an argument,**times**, that can create multiple splits at once; the data indices are returned in a list of integer vectors.Similarly,

**createResample**can be used to make simple bootstrap samples and**createFolds**can be used to generate balanced cross–validation groupings from a set of data.**Repeated K Fold Cross-Validation**

cvCtrl <- trainControl(method = "repeatedcv", number =10, repeats =3, classProbs = TRUE)

**Explanation :**

**repeatedcv :**repeated K-fold cross-validation**number = 10 :**10-fold cross-validations**repeats = 3 :**three separate 10-fold cross-validations are used.**classProbs = TRUE :**It should be TRUE if metric = " ROC " is used in the train function. It can be skipped if metric = "Kappa" is used.

**Note :**Kappa measures accuracy.

**There are two ways to tune an algorithm in the Caret R package :**

**tuneLength =**It allows system to tune algorithm**automatically**. It indicates the number of different values to try for each tunning parameter.**For example,**mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.**tuneGrid =**It means user has to specify a tune grid**manually**. In the grid, each algorithm parameter can be specified as a vector of possible values. These vectors combine to define all the possible combinations to try.

**For example,**grid = expand.grid(.mtry= c(1:100))

grid = expand.grid(.interaction.depth = seq(1, 7, by = 2), .n.trees = seq(100, 1000, by = 50), .shrinkage = c(0.01, 0.1))

**Example 1 : train with tuneGrid (Manual Grid)**

grid <- expand.grid(.model = "tree", .trials = c(1:100), .winnow = FALSE)

set.seed(825)

tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",

tuneGrid = grid, trControl = cvCtrl)

**Example 2 : train with tunelength (Automatic Grid)**

set.seed(825)

tuned <- train(dev[, -1], dev[,1], method = "C5.0", metric = "ROC",

tunelength = 10, trControl = cvCtrl)

**Finding the Tuning Parameter for each of the algorithms**

**Visit this link -**http://topepo.github.io/caret/modelList.html

**Calculating the Variable Importance**

varImp(tuned$finalModel , scale=FALSE)To get the area under the ROC curve for each predictor, the

plot(varImp(tuned$finalModel))

**filterVarImp**function can be used. The area under the ROC curve is computed for each class.

RocImp <- filterVarImp(x = dev[, -1], y = dev[,1])

RocImp

# Seeing result

tuned

# Seeing Parameter Tuning

trellis.par.set(caretTheme())

plot(tuned, metric = "ROC")

# Seeing final model result

print(tuned$finalModel)

#Summaries of C5.0 Model

summary(tuned$finalModel)

# variable Importance

C5imp(tuned$finalModel, metric="usage")

#Scoring

val1 = predict(tuned$finalModel, val[, -1], type = "prob")

**Other Useful Functions**

**nearZeroVar:**a function to remove predictors that are sparse and highly unbalanced**findCorrelation:**a function to remove the optimal set of predictors to achieve low pair–wise correlations (Check out this link)**preProcess:**Variable selection using PCA**predictors:**class for determining which predictors are included in the prediction equations (e.g. rpart, earth, lars models) (currently7 methods)**confusionMatrix, sensitivity, specificity, posPredValue, negPredValue:**classes for assessing classifier performance

Finding the Tuning Parameter for each of the algorithms is very useful information along with the other useful information.

ReplyDeleteI appreciate your service to the world communities on enhancing the skills without expecting anything.

Amazing!!!

Thank for your sharing. But I have not understand which data you used for the example?

ReplyDelete