**Ensemble learning**

The main advantages of Ensemble learning methods are :

**Reduced variance :**Overcome overfitting problem. Low variance means model independent of training data. Results are less dependent on features of a single model and training set.**Reduced bias :**Overcome underfitting problem. Low bias means linear regression applied to linear data, second degree polynomial applied to quadratic data. Combination of multiple classifiers may produce more reliable classification than single classifier.

**Bagging (Bootstrap Aggregating)**

Generates m new training data sets. Each new training data set picks a sample of observations with replacement (bootstrap sample) from original data set. By sampling with replacement, some observations may be repeated in each new training data set. The m models are fitted using the above m bootstrap samples and combined by averaging the output (for regression) or voting (for classification).

**Example :**

**Random forest**is one of the most important bagging ensemble learning algorithm, In random forest, approx. 2/3rd of the total training data (63.2%) is used for growing each tree. And the remaining one-third of the cases (36.8%) are left out and not used in the construction of each tree. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes over all the trees in the forest. For a binary dependent variable, the vote will be YES or NO, count up the YES votes. This is the RF score and the percent YES votes received is the predicted probability. In regression case, it is average of dependent variable.

**Uses of Bagging Algorithm**

- It is designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression.
- It also
**reduces variance**and helps to**avoid overfitting**.

**Boosting Algorithm**

**Layman's Term : AdaBoost Algorithm**

In this example, we focus on boosting algorithm, especially Adaboost algorithm.

- Start by creating a tree on training data, where each observation is assigned an
**equal weight**. - Then compute the predicted classification and weights are redetermined and assign
**greater weight**to those observations that are difficult to classify and**lower weights**to those that are easy to classify. Weights for all observations must sum to 1. - Second tree is grown on weighted data (weights are based on residuals or misclassification error). Idea is to improve prediction of first tree.
- Weights are redetermined and assign higher weights if it is classified incorrectly.
- New Model is now Tree 1 + Tree 2
- Compute residuals or classification error from this new 2-tree model (Tree1 + Tree2) and grow 3rd tree to predict revised residuals.
- Then subsequent trees help in classifying observations that are not well classified by preceding trees.
- The final prediction is a weighted sum of the predictions made by previous tree models.

**Loss or Error Function for Classification**

Minimizing Loss Function means maximising accuracy of the classifier.AdaBoost minimizes

**Exponential Loss**function. Gradient Boosting minimize logistic regression's LogLoss (aka negative log-likelihood or bernoulli function)

**.**Bernoulli distribution is a special case of the binomial distribution with n=1.

**Exponential Loss :**∑exp(-(b0 + b1x1 + b2x2+ ....+bkxk))

**Log Loss or Bernoulli :**∑log(1 + exp (-(b0 + b1x1 + b2x2+ ....+bkxk)))

LogLoss can also be expressed in form of

**Logloss = [∑ y log(p) + (1-y) log (1 - p)] * -(1/N)**

LogLoss = - (sum((actual*log(predicted)+(1-actual)*log(1-predicted)) / length(actual)In this case,

**actual**means 1 and

**predicted**means predicted probability (It's same probability that is derived from model).

*LogLoss of predicted probability 0.9 is lower than logloss of predicted probability 0.7.*

**Important Points about LogLoss**

**negative log-likelihood**. It can also be read as maximising log-likelihood.

The reason for taking the negative of the logarithm of the likelihood are as follows -

- It is more convenient to work with the log, because the log-likelihood of statistically independent observation will simply be the sum of the log-likelihood of each observation.
- We usually prefer to write the objective function as a cost function to minimize.

**Loss function for Regression : Squared Error**

(y - f(x))^2

**Why Boosting works?**

**Main Idea**

- Combine several simple decision trees (can be hundred or thousands)
- Each tree complements the previous ones (Avoid redundancy)
- Keep track of the errors of the previous trees

**This technique is faced with a problem of overfitting**

**Stochastic Gradient Boosting Algorithm**

To overcome problem of overfitting, stochastic gradient boosting comes into picture.

A general solution to the overfitting problem is to evaluate the quality of the fitted model by predicting observations in a test-sample of data that have not been used before to estimate the respective model(s). In Stochastic Gradient Boosting (GBM), a similar solution is implemented.

It tries to optimize the loss function by tuning different values of weights or coefficients to minimize the error. It is a method which comprises a vector of weights (or coefficients) where we calculate their partial derivative with respective to zero. The motive behind calculating their partial derivative is to find the local minima of the loss function (Residual Sum of Squares), which is convex in nature.

A general solution to the overfitting problem is to evaluate the quality of the fitted model by predicting observations in a test-sample of data that have not been used before to estimate the respective model(s). In Stochastic Gradient Boosting (GBM), a similar solution is implemented.

**Let's first understand Gradient Descent!****What is Gradient Descent?**It tries to optimize the loss function by tuning different values of weights or coefficients to minimize the error. It is a method which comprises a vector of weights (or coefficients) where we calculate their partial derivative with respective to zero. The motive behind calculating their partial derivative is to find the local minima of the loss function (Residual Sum of Squares), which is convex in nature.

**Loss function**: L(y, F(x)) = (y - F(x))^2 / 2**We want to minimize**J = ∑ L(yi, F(xi )) by adjusting F(x1), F(x2),.... F(xn)**Important Points :**- Residuals can be interpreted as negative gradients.
- For Classification, Gradient Descent Loss Function = Exponential Loss or Bernoulli function.

**How Stochastic Gradient Boosting Works**

- Simple tree is built on original target variable by taking only a randomly selected
**subsample**of the dataset**.**It implements**'Stochastic'**boosting technique**-**only 50% of the data used at each iteration.**Parameter :**bag.fraction : sampling rate (0.5). In this case, 0.5 implies - The tree is built as small as two nodes and typically in the range of 4-8 nodes (
**Parameter:****interaction.depth :**Nodes per tree) - We score full training data using this tree model and compute the negative gradient (i.e. residual or classification error) for every record
- Make residual as a new target (dependent) variable
- Grow second tree to predict the
**residuals**from first tree. Idea is to improve first tree. - Each tree is weighted by a small weight prior to being used to compute the current prediction or score produced by the model.This means that the model prediction changes by very small amounts after a new tree is added to the model. It is to reduce the impact of each additional tree. The weighting factor is also called
**shrinkage or learning rate**. - We
**combine two trees**by taking score for any record that come from the first tree and add to it the score that comes from the second tree. The second tree can therefore be thought of as correcting the predictions of the first tree. So a model from the end of a Gradient Boosting process is simply the**sum of all the scores of all the trees that were built.** - Now, we have revised model with two trees. Third tree is grown on residuals from model consisting of first two tree. The whole process keeps on going.
- It yield models (i.e.
**additive weighted expansions of simple trees**) that generalize well to new observations, i.e., exhibit good predictive validity.

**Important Point**

Prediction in Gradient Boosting models are made not be combined voting of all the trees, but cumulative predictions of all the trees.

**Difference between Bagging and Boosting**

- In Boosting, each model is built on top of the previous ones. Whereas in bagging each model is built independently.
- The final boosting ensemble uses weighted majority vote while bagging uses a simple majority vote.
- Bagging is a method of reducing variance while boosting can reduce the variance and bias of the base classifier
- Boosting is better than bagging on non-noisy data
- Bagging is effective more often than boosting

**R Packages for Boosting and Bagging**

- gbm - The gbm package offers two versions of boosting for classification (gentle boost under logistic and exponential loss). In addition, it includes squared error, absolute error, Poisson and Cox type loss functions.
- ada - Performs discrete, real, and gentle boost under both exponential and logistic loss on a given data set.
- randomForest - Random Forest Package (Bagging)

## 0 Response to "Ensemble Learning : Boosting and Bagging"

## Post a Comment