## Cumulative Accuracy Profile (CAP)

Cumulative Accuracy profile (CAP) of a credit rating model shows percentage of all borrowers (debtors) on the x-axis and the percentage of defaulters (bad customers) on the y-axis. In marketing analytics, it is called`Gain Chart`

. It is also called Power Curve in some other domains.
- Sort estimated probability of default in descending order and split it in 10 parts (decile). It means riskiest borrowers with high PD should be at top decile and safest borrowers should appear at bottom decile. Splitting score in 10 parts is not a thumb rule. Instead you can use rating grade.
- Calculate number of borrowers (observations) in each decile
- Calculate number of bad customers in each decile
- Calculate cumulative Number of bad customers in each decile
- Calculate percentage of bad customers in each decile
- Calculate cumulative percentage of bad customers in each decile

**Till now, we have done calculation based on the PD model (Remember first step is based on the probabilities obtained from PD model).**

**Next step :** What should be the number of bad customers in each decile based on perfect model?

- In perfect model, First decile should capture all the bad customers as first decile refers to worst rating grade OR borrowers with highest likelihood to default. In our case, first decile cannot capture all the bad customers as number of borrowers fall in the first decile is less than the total number of bad customers.
- Calculate cumulative number of bad customers in each decile based on perfect model
- Calculate cumulative % of bad customers in each decile based on perfect model

**Next step :**Calculate the cumulative percentage of bad customers in each decile based on random model In random model, each decile should constitute 10%. When we calculate cumulative %, it will be 10% in decile 1, 20% in decile 2 and so on till 100% in decile 10.

**Next step :**Create a plot with Cumulative % of Bads based on Current, Random and Perfect Model. In x axis, it shows percentage of borrowers (observations) and y axis represents percentage of Bad Customers.

## Accuracy Ratio

In the case of CAP (Cumulative Accuracy Profile), Accuracy ratio is the ratio of the area between your current predictive model and the diagonal line and the area between the perfect model and the diagonal line. In other words, it is the ratio of the performance improvement of the current model over the random model to the performance improvement of the perfect model over the random model.( x_{i+1}– x_{i}) * ( y_{i}+ y_{i+1}) * 0.5

( xIn this case,_{i+1}– x_{i}) is the width of subinterval and (y_{i}+ y_{i+1})*0.5 is the average height.

**x**refers to values of cumulative proportion of borrowers at different decile levels and

**y**refers to cumulative proportion of bad customers at different decile levels. Value of x

_{0}and y

_{0}is 0.

Once above step is completed, next step is to subtract 0.5 from the area returned from the previous step. You must be wondering relevance of 0.5. It is the area below diagonal line. We are subtracting because we only need area between current model and diagonal line (let's call it `B`

).

Now we need denominator which is the area between perfect model and diagonal line, `A + B`

. It is equivalent to `0.5*(1 - Prob(Bad))`

. See all the calculation steps shown in the table below -

In the R code below, we prepared sample data for example. Variable name `pred`

refers to predicted probabilities. Variable `y`

refers to dependent variable (actual event). We only need these two variables to calculate Accuracy Ratio.

library(magrittr) library(dplyr) # Sample Data for demonstration mydata = data.frame(pred = c(0.6,0.1,0.8,0.3,0.5,0.6,0.4,0.3,0.5), y = c(1,0,1,0,1,1,0,1,0)) # Sort data in descending order of predicted prob. mydata %<>% arrange(desc(pred)) # Cumulative % Borrowers random = 1:length(mydata$pred)/length(mydata$pred) # Cumulative % of Bads cumpercentbad = cumsum(mydata$y)/sum(mydata$y) # Calculate AR random = c(0,random) cumpercentbad = c(0,cumpercentbad) idx = 2:length(cumpercentbad) testdf=data.frame(cumpercentpop = (random[idx] - random[idx-1]), cumpercentbad = (cumpercentbad[idx] + cumpercentbad[idx-1])) Area = sum(testdf$cumpercentbad * testdf$cumpercentpop/2) Numerator = Area - 0.5 Denominator = 0.5*(1-mean(mydata$y)) (AR = Numerator / Denominator)

## Gini Coefficient

Gini coefficient is very similar to CAP but it shows proportion (cumulative) of good customers instead of all customers. It shows the extent to which the model has better classification capabilities in comparison to the random model. It is also called Gini Index. Gini Coefficient can take values between -1 and 1. Negative values correspond to a model with reversed meanings of scores.Gini = B / (A+B). Or Gini = 2B since Area of A + B is 0.5See the calculation steps of Gini Coefficient below :

Gini coefficient is a special case of Somer's D statistics. If you have concordance and discordance percent, you can compute Gini Coefficient.

`Gini Coefficient = (Concordance percent - Discordance Percent)`

Concordance percentrefers to proportion of pairs where defaulters have a higher predicted probability than the good customers.

Discordance percentrefers to proportion of pairs where defaulters have a lower predicted probability than the good customers.

*Another way of calculating Gini Coefficient is using concordance and discordance percent (as explained above). Refer the R code below.*

ModelPerformance <- function (actuals, predictedScores){ fitted <- data.frame (Actuals=actuals, PredictedScores=predictedScores) # actuals and fitted colnames(fitted) <- c('Actuals','PredictedScores') # rename columns ones <- fitted[fitted$Actuals==1, ] # Subset ones zeros <- fitted[fitted$Actuals==0, ] # Subsetzeros totalPairs <- nrow (ones) * nrow (zeros) # calculate total number of pairs to check # A pair is concordant if 1 (event) has a higher predicted probability than 0 conc <- sum (c(vapply(ones$PredictedScores, function(x) {((x > zeros$PredictedScores))}, FUN.VALUE=logical(nrow(zeros)))), na.rm=T) # A pair is disconcordant if 1 (event) has a lower predicted probability than 0 disc <- sum(c(vapply(ones$PredictedScores, function(x) {((x < zeros$PredictedScores))}, FUN.VALUE = logical(nrow(zeros)))), na.rm = T) # Calculate concordance, discordance, ties and AUC concordance <- conc/totalPairs discordance <- disc/totalPairs tiesPercent <- (1-concordance-discordance) Gini = (conc-disc)/totalPairs AUC = concordance + 0.5*tiesPercent return(list("Concordance"=concordance, "Discordance"=discordance, "Tied"=tiesPercent, "Gini"= Gini,"AUC"=AUC)) } ModelPerformance(mydata$y, mydata$pred)

## Are Gini Coefficient and Accuracy Ratio equivalent?

Yes, they are always equal. Hence Gini Coefficient is sometimes called Accuracy Ratio (AR).Yes, I know axes in Gini and AR are different. Question arises how they are still same. If you solve the equation, you would find Area B in Gini Coefficient is same as Area B / Prob(Good) in Accuracy Ratio (which is equivalent to (1/2)*AR ). Multiplying both sides by 2, you will get Gini = 2*B and AR = Area B / (Area A + B)

## Area under ROC Curve (AUC)

AUC or ROC curve shows proportion of true positives (defaulter is correctly classified as a defaulter) versus the proportion of false positives (non-defaulter is wrongly classified as a defaulter). AUC score is the summation of all the individual values calculated at rating grade or decile level.**4 Methods to calculate AUC Mathematically**

## Relationship between AUC and Gini Coefficient

You must be wondering how they are related.Gini = 2*AUC - 1.

If you reverse the axis of chart shown in the above section named "Gini Coefficient", you would get similar to the chart below. Here `Gini = B / (A + B)`

. Area of A + B is 0.5 so Gini = B / 0.5 which simplifies to `Gini = 2*B`

. `AUC = B + 0.5`

which further simplifies to B = AUC - 0.5. Put this equation in `Gini = 2*B`

Gini = 2*(AUC - 0.5)

Gini = 2*AUC - 1

Thanks for this post. I have been waiting for this post.

ReplyDeleteJust wanted to reconfirm AR and Gini values will be same for any number of defaults.

Yes, it will be same even for low default portfolios.

DeleteThanks for the clarification.

DeleteI really appreciate the post, it was concise and most helpful.

ReplyDeletePlease can you include your own/other references at the end.

I'm busy doing a study on low/high defaults credit models and would also love to read more.

It would much be appreciated.

Thank you very much :)

Hey good explanations there! But is it true that (A + B) is theoretically less than 0.5 as long as there are defaulters? In other words, AR and Gini are not always equal?

ReplyDeleteHi Deepanshu! Thank you for this very informative content.

ReplyDeleteHope you could expound further on this:

GINI: Negative values correspond to a model with reversed meanings of scores.

How do I actually interpret the model with negative Gini? Is this actually inconclusive?

Hi,

ReplyDeleteCould you please tell me the significance of Gino over KS?

Does AR has limitations? If the model has a lower bad rate, it makes the perfect model curve a lot higher & the overall AR becomes low. For eg, I had a model with 6% bad rate for 300K model population. AR was 47% as compared to a model with 32% bad rate and 300K population, AR was 66%. It does not mean 47% AR is bad right?

ReplyDelete