A Complete Guide to Area Under Curve (AUC)

This tutorial provides detailed explanation and multiple methods to calculate area under curve (AUC) or ROC curve mathematically along with its implementation in SAS and R. By default, every statistical package or software generate this model performance statistics when you run classification model. However, it is important to know how it is calculated. One more reason to know the calculation behind this metric is that it would give you confidence to explain it and you will have an edge over your peers when your predictive model demands calibration or refitting. The idea is to show calculation of AUC using both SAS and R so that people having access to either commercial software or open source can learn and code without any technical issue.
What is Area under Curve?
Area under Curve (AUC) or Receiver operating characteristic (ROC) curve is used to evaluate and compare the performance of binary classification model. It measures discrimination power of your predictive classification model. In simple words, it checks how well model is able to distinguish (separates) events and non-events. Suppose you are building a predictive model for bank to identify customers who are likely to buy credit card. In this case case, purchase of credit card is event (or desired outcome) and non-purchase of credit card is non-event.
AUC or ROC curve is a plot of the proportion of true positives (events correctly predicted to be events) versus the proportion of false positives (nonevents wrongly predicted to be events) at different probability cutoffs. True Positive Rate is also called Sensitivity. False Positive Rate is also called (1-Specificity). Sensitivity is on Y-axis and (1-Specificity) is on X-axis. Higher the AUC score, better the model.

Diagonal line represents random classification model. It is equivalent to prediction by tossing a coin. All points along the diagonal line say same true positive and false positive rate.
See below how it works. Cut-off represents minimum threshold after that predicted probability would be classified as 'event' (desired outcome). In other words, predictive probability greater than or equal to cut-off would be classified as 1. Let's say cutoff is 0.5. In the case of propensity to buy model, predicted probability >= 0.5 would be classified as 'purchase of product'. To generate ROC curve, we calculate Sensitivity and (1-Specificity) at all possible cutoffs and then we plot them.
Cut-off Sensitivity Specificity 1-specificity
0 1 0 1
0.01 0.979 0.081 0.919
0.02 0.938 0.158 0.842
….
….
….
0.99 0.02 0.996 0.004
1 0 1 0
ROC Curve
Example
In the case of customer attrition model which refers to likelihood of customer to leave or end relationship with us. AUC refers to trade-off between proportion of (attritors correctly predicted as attritors) and proportion of (non-attritors wrongly predicted as attritors).
Methods to calculate Area under Curve

AUC using Concordance and Tied Percent

  1. Calculate the predicted probability in logistic regression (or any other binary classification model). It is not restricted to logistic regression.
  2. Divide the data into two datasets. One dataset contains observations having actual value of dependent variable with value 1 (i.e. event) and corresponding predicted probability values. And the other dataset contains observations having actual value of dependent variable 0 (non-event) against their predicted probability scores.
  3. Compare each predicted value in first dataset with each predicted value in second dataset.
  4. Total Number of pairs to compare = x * y
    x : Number of observations in first dataset (actual values of 1 in dependent variable)
    y : Number of observations in second dataset (actual values of 0 in dependent variable).

    In this step, we are performing cartesian product (cross join) of events and non-events. For example, you have 100 events and 1000 non-events. It would create 100k (100*1000) pairs for comparison.
  5. A pair is concordant if 1 (observation with the desired outcome i.e. event) has a higher predicted probability than 0 (observation without the outcome i.e. non-event).
  6. A pair is discordant if 0 (observation without the desired outcome i.e. non-event) has a higher predicted probability than 1 (observation with the outcome i.e. event).
  7. A pair is tied if 1 (observation with the desired outcome i.e. event) has same predicted probability than 0 (observation without the outcome i.e. non-event).
  8. The final percent values are calculated using the formula below -
  9. Percent Concordant = 100*[(Number of concordant pairs)/Total number of pairs]
    Percent Discordant = 100*[(Number of discordant pairs)/Total number of pairs]
    Percent Tied = 100*[(Number of tied pairs)/Total number of pairs]
  10. Area under curve (AUC) = (Percent Concordant + 0.5 * Percent Tied)/100

Interpretation of Concordant, Discordant and Tied Percent

Percent Concordant : Percentage of pairs where the observation with the desired outcome (event) has a higher predicted probability than the observation without the outcome (non-event).

Percent Discordant : Percentage of pairs where the observation with the desired outcome (event) has a lower predicted probability than the observation without the outcome (non-event).

Percent Tied : Percentage of pairs where the observation with the desired outcome (event) has same predicted probability than the observation without the outcome (non-event).

AUC :Area under curve (AUC) is also known as c-statistics. Some statisticians also call it AUROC which stands for area under the receiver operating characteristics. It is calculated by adding Concordance Percent and 0.5 times of Tied Percent.

Gini coefficient or Somers' D statistic is closely related to AUC. It is calculated by (2*AUC - 1). It can also be calculated by (Percent Concordant - Percent Discordant)

In general, higher percentages of concordant pairs and lower percentages of discordant and tied pairs indicate a more desirable model.

SAS and R Code for ROC, Concordant / Discordant :
Download the CSV data file from UCLA website.
The code below calculates these performance metrics in SAS and R. It executes each step explained above theoretically.

SAS Code

FILENAME PROBLY TEMP;
PROC HTTP
 URL="https://stats.idre.ucla.edu/stat/data/binary.csv"
 METHOD="GET"
 OUT=PROBLY;
RUN;

OPTIONS VALIDVARNAME=ANY;
PROC IMPORT
  FILE=PROBLY
  OUT=WORK.binary REPLACE
  DBMS=CSV;
RUN;

ods graphics on;
Proc logistic data= WORK.binary descending plots(only)=roc;
class rank / param=ref ;
model admit = gre gpa rank;
output out = estprob p= pred;
run;

/*split the data into two datasets- event and non-event*/ 
Data event nonevent;
Set estprob;
If admit = 1 then output event;
else if admit = 0 then output nonevent;
run;

/*Cartesian product of event and non-event actual cases*/ 
Proc SQL noprint;
create table pairs as
select a.admit as admit1, b.admit as admit0,
a.pred as pred1,b.pred as pred0
from event a cross join nonevent b;
quit;

/*Calculating concordant,discordant and tied percent*/
Data pairs;
set pairs;
concordant =0;
discordant=0;
tied=0;
If pred1 > pred0 then concordant = 1;
else If pred1 < pred0 then discordant = 1;
else tied = 1;
run;

/*Mean values - Final Result*/
proc sql;
select mean(Concordant)*100 as Percent_Concordant,
mean(Discordant) *100 as Percent_Discordant,
mean(Tied)*100 as Percent_Tied,
(calculated Percent_Concordant + 0.5* calculated Percent_Tied)/100 as AUC,
2*calculated AUC - 1 as somers_d
from pairs;
quit;

R Code

# Read Data
df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# Factor Variables
df$admit = as.factor(df$admit)
df$rank = as.factor(df$rank)

# Logistic Model
df$rank <- relevel(df$rank, ref='4')
mylogistic <- glm(admit ~ ., data = df, family = "binomial")
summary(mylogistic)$coefficient

# Predict
pred = predict(mylogistic, type = "response")
finaldata = cbind(df, pred)


AUC <- function (actuals, predictedScores){
  fitted <- data.frame (Actuals=actuals, PredictedScores=predictedScores)
  colnames(fitted) <- c('Actuals','PredictedScores')
  ones <- fitted[fitted$Actuals==1, ] # Subset ones
  zeros <- fitted[fitted$Actuals==0, ] # Subsetzeros
  totalPairs <- nrow (ones) * nrow (zeros) # calculate total number of pairs to check
  conc <- sum (c(vapply(ones$PredictedScores, function(x) {((x > zeros$PredictedScores))}, FUN.VALUE=logical(nrow(zeros)))), na.rm=T)
  disc <- sum(c(vapply(ones$PredictedScores, function(x) {((x < zeros$PredictedScores))}, FUN.VALUE = logical(nrow(zeros)))), na.rm = T)
  concordance <- conc/totalPairs
  discordance <- disc/totalPairs
  tiesPercent <- (1-concordance-discordance)
  AUC = concordance + 0.5*tiesPercent
  Gini = 2*AUC - 1
  return(list("Concordance"=concordance, "Discordance"=discordance,
              "Tied"=tiesPercent, "AUC"=AUC, "Gini or Somers D"=Gini))
}

AUC(finaldata$admit, finaldata$pred)
Result

Calculate AUC using Integration Method

Trapezoidal Rule Numerical Integration method is used to find area under curve. The area of a trapezoid is
( xi+1 – xi ) * ( yi + yi+1 ) / 2
In our case, x refers to values of false positive rate (1-Specificity) at different probability cut-offs, y refers to true positive rate (Sensitivity) at different cut-offs. Vector x needs to be sorted. Any observation with predicted probability that exceeds or equals probability cut-off is predicted to be an event; otherwise, it is predicted to be a nonevent.
( fpri+1 – fpri ) * ( tpri + tpri+1 ) / 2
fpr represents false positive rate (1- specificity). tpr represents true positive rate (sensitivity). See the image below showing step by step calculation. It includes a very few cut-offs for demonstration purpose.
AUC Calculation Steps

SAS Code

In the SAS program below, we are using PROC IML procedure to perform integration calculations.
FILENAME PROBLY TEMP;
PROC HTTP
 URL="https://stats.idre.ucla.edu/stat/data/binary.csv"
 METHOD="GET"
 OUT=PROBLY;
RUN;

OPTIONS VALIDVARNAME=ANY;
PROC IMPORT
  FILE=PROBLY
  OUT=WORK.binary REPLACE
  DBMS=CSV;
RUN;

ods graphics on;
Proc logistic data= WORK.binary descending plots(only)=roc;
class rank / param=ref ;
model admit = gre gpa rank / outroc=performance;
output out = estprob p= pred;
run;

proc sort data=performance;
by _1MSPEC_;
run;

proc iml;
use performance;
read all var {_SENSIT_} into sensitivity;
read all var {_1MSPEC_} into falseposrate;
N  = 2 : nrow(falseposrate);
fpr = falseposrate[N] - falseposrate[N-1];
tpr = sensitivity[N] + sensitivity[N-1];
ROC = fpr`*tpr/2;
Gini= 2*ROC - 1;
print ROC Gini;

R Code

# Read Data
df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

# Factor Variables
df$admit = as.factor(df$admit)
df$rank = as.factor(df$rank)

# Logistic Model
df$rank <- relevel(df$rank, ref='4')
mylogistic <- glm(admit ~ ., data = df, family = "binomial")
summary(mylogistic)$coefficient

# Predict
pred = predict(mylogistic, type = "response")
finaldata = cbind(df, pred)

library(ROCR)
predobj <- prediction(finaldata$pred, finaldata$admit)
perf <- performance(predobj,"tpr","fpr")
plot(perf)

# Trapezoidal rule of integration
# Computes the integral of Sensitivity (Y) with respect to FalsePosRate (x)
x = perf@x.values[[1]]
y = perf@y.values[[1]]
idx = 2:length(x)
testdf=data.frame(FalsePosRate = (x[idx] - x[idx-1]), Sensitivity = (y[idx] + y[idx-1]))
(AUROC = sum(testdf$FalsePosRate * testdf$Sensitivity)/2)

Calculate AUC using Mann–Whitney U Test

Area under curve (AUC) is directly related to Mann Whitney U test. People from analytics community also call it Wilcoxon rank-sum test.

This test assumes that the predicted probability of event and non-event are two independent continuous random variables. Area under the curve = Probability that Event produces a higher probability than Non-Event. AUC=P(Event>=Non-Event)

AUC = U1/(n1 * n2) Here U1 = R1 - (n1*(n1 + 1) / 2)

where U1 is the Mann Whitney U statistic and R1 is the sum of the ranks of predicted probability of actual event. It is calculated by ranking predicted probabilities and then selecting only those cases where dependent variable is 1 and then take sum of all these cases. n1 is the number of 1s (event) in dependent variable. n2 is the number of 0s (non-events) in dependent variable.

n1*n2 is the total number of pairs (or cross product of number of events and non-events). It is similar to what we have done in concordance method to calculate AUC.


R Code

# AUC using Mann–Whitney test
auc_mannWhitney  <- function(y, pred){
  y  <- as.logical(y)
  n1 <- sum(y)
  n2 <- sum(!y)
  R1 <- sum(rank(pred)[y])
  U1 <- R1 - n1 * (n1 + 1)/2
  U1/(n1 * n2)
}

auc_mannWhitney(as.numeric(as.character(finaldata$admit)), finaldata$pred)

SAS Code

ods select none;
ods output WilcoxonScores=WilcoxonScore;
proc npar1way wilcoxon data= estprob ;
where admit^=.;
class admit;
var  pred;
run;
ods select all;

data AUC;
set WilcoxonScore end=eof;
retain v1 v2 1;
if _n_=1 then v1=abs(ExpectedSum - SumOfScores);
v2=N*v2;
if eof then do;
d=v1/v2;
Gini=d * 2;
AUC=d+0.5;   
put AUC=  GINI=;
keep AUC Gini;
output;
end;
proc print noobs;
run;

Calculate AUC using Cumulative Events and Non-Events

In this method, we will see how we can calculate area under curve using decile (binned) data.
  1. Sort predicted probabilities in descending order. It means customer having high likelihood to buy a product should appear at top (in case of propensity model)
  2. Split or rank into 10 parts. It is similar to concept of calculating decile.
  3. Calculate number of cases in each decile level. It would be same in each level as we divided the data in 10 equal parts.
  4. Calculate number of 1s (event) in each decile level. Maximum 1s should be captured in first decile (if your model is performing fine!)
  5. Calculate cumulative percent of 1s in each decile level. Last decile should have 100% as it is cumulative in nature.
  6. Similar to the above step, we will calculate cumulative percent of 0s in each decile level.
  7. AUC would be calculated using trapezoidal rule numeric integration formula. In this case, xis cumulative % of 0s and yis cumulative % of 1s
This method returns an approximation of AUC score since we are using 10 bins instead of raw values. In other words, number of observations are greater than the number of bins here.
Related Posts
About Author:

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 8 years of experience in data science. During his tenure, he has worked with global clients in various domains like Banking, Insurance, Telecom and Human Resource.

19 Responses to "A Complete Guide to Area Under Curve (AUC)"
  1. Very precise and clear explanation of concordance and discordance. Also the code helps in better understanding of the phenomenon. Thanks.

    ReplyDelete
  2. Neat explanations, really helpful to understood these definitions. Thanks!

    ReplyDelete
  3. Very clear explanation, thank you :)

    ReplyDelete
  4. Thanks for the post! Shouldn't it be proc logistic with descending option? as we are treating 1s as events and 0 as nonevents

    ReplyDelete
  5. First time I understood concordance and discordance. Thanks

    ReplyDelete
  6. For a good model what should be the concordance?

    ReplyDelete
  7. Very informative, clear, and to the point

    ReplyDelete
  8. Very good explanation and informative. Thanks Buddy keep sharing

    ReplyDelete
  9. Can you please give the calculation of concordance and disconcordance in excel format with example which will be easy to understand the calculation.

    ReplyDelete
  10. The above codes are very useful. Any suggestions for weighted data?

    ReplyDelete
  11. Hello, I want to know, what to do in cases where tied percentage is high, say 20%. How to reduce tied percentage?

    ReplyDelete
  12. Excellent Work. Thanks for such detailed description.

    ReplyDelete
  13. Thorough and very useful. However can you let me know how to derive the equation: AUC = (Percent Concordant + 0.5 * Percent Tied)/100. Basically I want to know the steps to get the above equation

    ReplyDelete

Next → ← Prev
Love this Post? Spread the Word!
Share