Precision Recall Curve Simplified

This article outlines precision recall curve and how it is used in real-world data science application. It includes explanation of how it is different from ROC curve. It also highlights limitation of ROC curve and how it can be solved via area under precision-recall curve. This article also covers implementation of area under precision recall curve in Python, R and SAS.

Table of Contents

What is Precision Recall Curve?

Before getting into technical details, we first need to understand precision and recall terms in layman's term. It is essential to understand the concepts in simple words so that you can recall it for future work when it is required. Both Precision and Recall are important metrics to check the performance of binary classification model.

Precision

Precision is also called Positive Predictive Value. Suppose you are building a customer attrition model which has objective to identify customers who are likely to close relationship with the company. The use of this model is to prevent attrition and boost customer profitability.

It's a binary classification problem in which dependent variable is binary in nature. It has only two values either 0 or 1. 1 refers to customers who left us. 0 refers to active customers who are still with us. In this case, precision is the proportion of customers our predictive model call as attritors actually left us (attrited).

Let's understand it by confusion matrix

Precision = True Positive / (True Positive + False Positive)

True Positive : Number of customers who actually attrited whom we correctly predicted as attritors.

False Positive : Number of customers who actually did not attrite whom we incorrectly predicted as attritors.

Recall

Recall is also called Sensitivity which tells us the proportion of customers who actually left us (attrited) were predicted by us as attritors.

Recall = True Positive / (True Positive + False Negative)

True Positive : Number of customers who actually attrited whom we correctly predicted as attritors.

False Negative: Number of customers who actually attrited whom we incorrectly predicted as non-attritors.

Difference between Precision and Recall

The main difference is denominator. In Precision, False Positive is included whereas false negative is considered in recall.

Let's say there are in total 160 customers. Out of 160, 80 customers actually left us (attrited). Out of 80, we correctly predicted 70 of them as attritors. Recall would be 87.5% (70 divided by 80).

Out of total 160 customers, we predicted 130 customers as attritors. Out of 130, we correctly predicted 70 of them as attritors. Precision would be 53.8% (70 divided by 130).

Suppose your wife asked you about the dates of 4 important events - your wedding anniversary, her birthday, your mother-in law and father-in law birthday dates. You were able to recall all these 4 dates but with 8 attempts in total. Your recall score is 100% but your precision score is 50% which is 4 divided by 8.

Precision Recall Curve Demystified

It is a popular model performance metrics to evaluate binary classification model. In x-axis, it shows recall and y-axis represents precision.

Step 1 : Calculate recall and precision values from multiple confusion matrices for different cut-offs (thresholds). Let's say cut-off is 0.5 which means all the customers have probability score greater than 0.5 is considered as attritors. For Prob(Attrition) > 0.5, you calculate Recall-Precision values based on True Positive, True Negative, False Positive and False Negative. Similarly, you calculate for the other remaining thresholds.

Cutoff	0.9	0.75	0.6	0.5	0.4
Recall (X)	0.12	0.39	0.67	0.85	0.90
Precision (Y)	0.90	0.84	0.83	0.83	0.50

Let's plot these 5 data points to create a precision-recall curve.

How to calculate area under precision recall curve mathematically?

Once we have precision-recall for different thresholds, we can calculate area under curve using Trapezoidal Rule Numerical Integration. In simple words, it adds up all trapezoids under the curve. Make sure different recall values should be sorted before using Trapezoidal Rule.

Note : Trapezoid rule is not the most accurate way of calculating integrals. Simpson's rule is more accurate.

Follow the calculations shown in the image below -

ROC Curve vs Area under Precision Recall Curve (AUPRC)

ROC Curve shows the trade-off between True Positive Rate and False Positive Rate using different probability cut-offs. Whereas AUPRC represents a different trade-off which is in between the true positive rate and the positive predictive value.

ROC Curve

Advantage of using AUPRC over ROC

ROC curves can be misleading in rare-event problem (or called as imbalanced data) wherein percentage of non-events are significantly higher than events. For example, you are building a machine learning model which identifies cancer patients. In general, percentage of cancer patients is very small as compared to non-cancer patients. In this case, % of events is very low (less than 1%). Here True Positive means patients who had cancer whom we correctly diagnosed as having cancer. In this type of rare-event problem, ROC score can be misleading and AUPRC is a better choice for assessing model performance.

R : Area under precision recall curve

Let's see how to compute area under curve based on Trapezoidal Rule.

recall=c(0.12, 0.39, 0.67, 0.85, 0.90)
precision=c(0.90, 0.84, 0.83, 0.83, 0.50)
i = 2:length(recall)
recall    = recall[i] - recall[i-1]
precision = precision[i] + precision[i-1]
(AUPRC = sum(recall * precision)/2)

Output
0.65135

Let's build a logistic regression model on slightly imbalanced data. It has a target variable named admit.

Let's import the dataset. To make it imbalanced for demonstration purpose, we are removing some 1s from target variable. Event rate is now 15%.

df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
df = df[!(df$gre>=600 & df$gre>=3.5 & df$admit==1),]
mean(df$admit)

# Factor Variables
df$admit = as.factor(df$admit)
df$rank  = as.factor(df$rank)

# Logistic Model
df$rank <- relevel(df$rank, ref='4')
mylogistic <- glm(admit ~ ., data = df, family = "binomial")
summary(mylogistic)$coefficient

# Predict
pred = predict(mylogistic, type = "response")
finaldata = cbind(df, pred)

# Store precision and recall scores at different cutoffs
library(ROCR)
predobj <- prediction(finaldata$pred, finaldata$admit)
perf <- performance(predobj,"prec", "rec")
plot(perf)

# Trapezoidal rule of integration
x = perf@x.values[[1]]
y = perf@y.values[[1]]

idx = 2:length(x)
testdf=data.frame(recall = (x[idx] - x[idx-1]), precision = (y[idx] + y[idx-1]))

# Ignore NAs
testdf = subset(testdf, !is.na(testdf$precision))
(AUPRC = sum(testdf$recall * testdf$precision)/2)

# ROC Curve
(AUROC <- performance(predobj,"auc")@y.values)

Findings : AUROC score is more than 0.75 whereas AUPRC is 0.3315.

Python : Area under precision recall curve

Calculate area under curve using recall and precision scores at different probability thresholds.

import numpy as  np
recall    = np.array([0.12, 0.39, 0.67, 0.85, 0.90])
precision = np.array([0.90, 0.84, 0.83, 0.83, 0.50])
i = np.array(range(1,len(recall)))
re = recall[i] - recall[i-1]
pr = precision[i] + precision[i-1]

#Multiply re and pr lists and then take sum and divide by 2
np.sum(re * pr)/2

Here we are replicating R program in Python to build logistic regression.

#Load Required Libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from patsy import dmatrices, Treatment
from sklearn.metrics import precision_recall_curve, auc

#Read CSV File
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
df = df[~((df.gre>=600) & (df.gre>=3.5) & (df.admit==1))]

#Set Reference Group
y, X = dmatrices('admit ~ gre + gpa + C(rank, Treatment(reference=4))', df, return_type = 'dataframe')

#Sklearn
model = LogisticRegression(fit_intercept = False, C = 1e16,  solver = "newton-cg", max_iter=10000000)
y = np.ravel(y)
model = model.fit(X, y)
model.coef_

#Combining coeff and variable names
coef_dict = {}
for coef, feat in zip(model.coef_[0,:],X.columns):
    coef_dict[feat] = coef

#Probability estimates
y_scores=model.predict_proba(X)
precision, recall, thresholds = precision_recall_curve(y, y_scores[:,1])

#Area under precision recall curve
auc(recall, precision)

#Sort Recall scores
ct = np.column_stack([recall, precision])
ct= ct[ct[:, 0].argsort()]
recall    = ct[:,0]
precision = ct[:,1]

#Integration
i =  np.array(range(1,len(recall)))
re = recall[i] - recall[i-1]
pr = precision[i] + precision[i-1]

#Multiply re and pr lists and then take sum and divide by 2
print(np.sum(re * pr)/2)

SAS : Area under precision recall curve

With the use of PROC IML, we can calculate area under curve.

proc iml;
recall = {0.12, 0.39, 0.67, 0.85, 0.90}; 
precision = {0.90, 0.84, 0.83, 0.83, 0.50};

N  = 2 : nrow(recall);
re = recall[N] - recall[N-1];
pr = precision[N] + precision[N-1];
AUPRC= re`*pr/2;
print AUPRC;

In this section, we are building logistic regression model and then calculate area under precision recall curve in SAS.

FILENAME PROBLY TEMP;
PROC HTTP
 URL="https://stats.idre.ucla.edu/stat/data/binary.csv"
 METHOD="GET"
 OUT=PROBLY;
RUN;

OPTIONS VALIDVARNAME=ANY;
PROC IMPORT
  FILE=PROBLY
  OUT=WORK.binary REPLACE
  DBMS=CSV;
RUN;

proc sql;
create table binary2 as 
select * from binary
where not(gre>=600 and gre>=3.5 and admit=1);
quit;

Proc logistic data= WORK.binary2 descending plots(only)=roc;
class rank / param=ref ;
model admit = gre gpa rank / outroc=performance;
output out = estprob p= pred;
run;

/* Precision Recall Curve SAS */
data precision_recall;
set performance;
precision = _POS_/(_POS_ + _FALPOS_);
recall = _POS_/(_POS_ + _FALNEG_);
F_stat = harmean(precision,recall);
run;

proc sort data=precision_recall;
by recall;
run;

proc iml;
use precision_recall;
read all var {recall} into sensitivity;
read all var {precision} into precision;
N  = 2 : nrow(sensitivity);
tpr = sensitivity[N] - sensitivity[N-1];
prec = precision[N] + precision[N-1];
AUPRC = tpr`*prec/2;
print AUPRC;

title1 "Area under Precision Recall Curve";
symbol1 interpol=join value=dot;
proc gplot data=precision_recall;
plot precision*recall /  haxis=0 to 1 by .2
                        vaxis=0 to 1 by .2;
run;
quit;

What is good area under precision-recall score?

There is no thumb rule regarding AUPRC which says whether model is good or bad. If you have 10% events, a random estimator would have a AUPRC of 0.1. You should evaluate several machine learning algorithms and then compare the AUPRC score and should see how far your AUPRC score is from random estimator score. It also depends on the context of the problem. Suppose you are building a model which identifies persons who are likely to be terrorist. It is a highly-imbalanced problem. In this case, AUPRC is more informative than AUC. It is important to note that a good ROC score doesn't necessarily mean a good Precision-Recall score.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn