There are two main measures for assessing performance of a predictive model:

Discrimination refers to the ability of the model to distinguish between events and non-events.

It plots true positive rate (aka Sensitivity) and false positive rate (aka 1-Specificity). Mathematically, It is calculated using the formula below -

If C>= 0.9, the model is considered to have outstanding discrimination.

If 0.8 <= C < 0.9, the model is considered to have excellent discrimination;

It is a common measure for assessing predictive power of a credit risk model. It measures the degree to which the model has better discrimination power than the model with random scores.

It looks at maximum difference between distribution of cumulative events and cumulative non-events.

It is a measure of how close the predicted probabilities are to the actual rate of events.

It measures the association between actual events and predicted probability.

**Discrimination**and**Calibration**. These measures are not restricted to logistic regression. They can be used for other classification techniques as well such as decision tree, random forest, gradient boosting, support vector machine (SVM) etc. The explanation of these two measures are shown below -**1. Discrimination**

Discrimination refers to the ability of the model to distinguish between events and non-events.

**Area under the ROC curve (AUC / C statistics)**

It plots true positive rate (aka Sensitivity) and false positive rate (aka 1-Specificity). Mathematically, It is calculated using the formula below -

**Concordant :**Percentage of pairs where the observation with the desired outcome (event) has a higher predicted probability than the observation without the outcome (non-event).

**Discordant :**Percentage of pairs where the observation with the desired outcome (event) has a lower predicted probability than the observation without the outcome (non-event).

**Tied :**Percentage of pairs where the observation with the desired outcome (event) has same predicted probability than the observation without the outcome (non-event).

**Rules :**

If C>= 0.9, the model is considered to have outstanding discrimination.

**Caution :**The model may be faced with problem of over-fitting.

If 0.8 <= C < 0.9, the model is considered to have excellent discrimination;

If 0.7<=C < 0.8, the model is considered to have acceptable discrimination;

If C = 0.5, the model has no discrimination (random case)

If C < 0.5, the model is worse than random

**Gini (Somer's D)**

It is a common measure for assessing predictive power of a credit risk model. It measures the degree to which the model has better discrimination power than the model with random scores.

Somer's D = 2 AUC - 1It should be greater than 0.4.

or

Somer's D = (Concordant Percent - Discordant Percent) / 100

**Kolmogorov-Smirnoff Statistic (KS)**

It looks at maximum difference between distribution of cumulative events and cumulative non-events.

**KS statistics**should be in top 3 deciles.**KS statistics**should be between 40 and 70.

It implies the model should predict the highest number of events in the first decile and then goes progressively down. For example, there should not be a case that the decile 2 predicts higher number of events than the first decile.

**2. Calibration**

It is a measure of how close the predicted probabilities are to the actual rate of events.

**I. Hosmer and Lemeshow Test (HL)**It measures the association between actual events and predicted probability.

In HL test, null hypothesis states that sample of observed events and non-events supports the claim about the predicted events and non-events. In other words, the model fits data well.

The null hypothesis states the model fits the data well. In other words, null hypothesis is that the fitted model is correct.

The Brier score is an important measure of calibration i.e. the mean squared difference between the predicted probability and the actual outcome.

**Calculation**

- Calculate estimated probability of events
- Split data into 10 sections based on descending order of probability
- Calculate number of actual events and non-events in each section
- Calculate Predicted Probability = 1 by averaging probability in each section
- Calculate Predicted Probability = 0 by subtracting Predicted Probability=1 from 1
- Calculate expected frequency by multiplying number of cases by Predicted Probability = 1
- Calculate chi-square statistics taking frequency of observed (actual) and predicted events and non-events

Hosmer Lemeshow Test |

Rule: If p-value > .05. the model fits data well

**II. Deviance and Residual Test**

The null hypothesis states the model fits the data well. In other words, null hypothesis is that the fitted model is correct.

Deviance and Residual Test |

Since p-value is greater than 0.05 for both the tests, we can say the model fits the data well.In SAS, these tests can be computed by using option

**scale = none aggregate**in PROC LOGISTIC.

**III. Brier Score**

The Brier score is an important measure of calibration i.e. the mean squared difference between the predicted probability and the actual outcome.

Lower the Brier score is for a set of predictions, the better the predictions are calibrated.

- If the predicted probability is 1 and it happens, then the Brier Score is 0, the best score achievable.
- If the predicted probability is 1 and it does not happen, then the Brier Score is 1, the worst score achievable.
- If the predicted probability is 0.8 and it happens, then the Brier Score is (0.8-1)^2 =0.04.
- If the predicted probability is 0.2 and it happens, then the Brier Score is (0.2-1)^2 =0.64.
- If the predicted probability is 0.5, then the Brier Score is (0.5-1)^2 =0.25, irregardless of whether it happens.

**fitstat**option in proc logistic, SAS returns Brier score and other fit statistics such as AUC, AIC, BIC etc.

proc logistic data=train;

model y(event="1") = entry;

score data=valid out=valpredfitstat;

run;

*A complete assessment of model performance should take into consideration both discrimination and calibration. It is believed that discrimination is more important than calibration.*

**SAS Macro : Best Model Selection**

Awesome work man :)....great site keep it up...please add arima also :)

ReplyDeleteThank you for your appreciation. Check out the series of ARIMA articles -

Deletehttp://www.listendata.com/search/label/Time%20Series

In the above tabulate of Hosmers lemeshow u were supposed to create 10 deciles but I can see only 8.

ReplyDeleteThank you for putting this site together. You're explanations are so clear and straight to the point; very helpful.

ReplyDeleteCannot open the macro file..is it password protected?

ReplyDeleteCannot open the macro file..is it password protected?

ReplyDeleteHi, thanks for the post. The file "SAS Macro : Best Model Selection" requires a password. Whats the password ?

ReplyDelete