This article outlines precision recall curve and how it is used in real-world data science application. It includes explanation of how it is different from ROC curve. It also highlights limitation of ROC curve and how it can be solved via area under precision-recall curve. This article also covers implementation of area under precision recall curve in Python, R and SAS.

## What is Precision Recall Curve?

Before getting into technical details, we first need to understand precision and recall terms in layman's term. It is essential to understand the concepts in simple words so that you can recall it for future work when it is required. Both Precision and Recall are important metrics to check the performance of binary classification model.

### Precision

Precision is also called **Positive Predictive Value**. Suppose you are building a customer attrition model which has objective to identify customers who are likely to close relationship with the company. The use of this model is to prevent attrition and boost customer profitability.

It's a binary classification problem in which dependent variable is binary in nature. It has only two values either 0 or 1. 1 refers to customers who left us. 0 refers to active customers who are still with us. In this case, **precision is the proportion of customers our predictive model call as attritors actually left us (attrited).**

Let's understand it by confusion matrix

Precision = True Positive / (True Positive + False Positive)

- True Positive : Number of customers who actually attrited whom we correctly predicted as attritors.
- False Positive : Number of customers who actually did not attrite whom we incorrectly predicted as attritors.

### Recall

Recall is also called**Sensitivity**which tells us the proportion of customers who actually left us (attrited) were predicted by us as attritors.

Recall = True Positive / (True Positive + False Negative)

- True Positive : Number of customers who actually attrited whom we correctly predicted as attritors.
- False Negative: Number of customers who actually attrited whom we incorrectly predicted as non-attritors.

### Difference between Precision and Recall

The main difference is denominator. In Precision, False Positive is included whereas false negative is considered in recall.Let's say there are in total 160 customers. Out of 160, 80 customers *actually* left us (attrited). Out of 80, we *correctly* predicted 70 of them as attritors. **Recall would be 87.5% (70 divided by 80).**

Out of total 160 customers, we predicted 130 customers as attritors. Out of 130, we *correctly* predicted 70 of them as attritors. **Precision would be 53.8% (70 divided by 130).**

Suppose your wife asked you about the dates of 4 important events - your wedding anniversary, her birthday, your mother-in law and father-in law birthday dates. You were able to recall all these 4 dates but with 8 attempts in total. Your recall score is 100% but your precision score is 50% which is 4 divided by 8.

### Precision Recall Curve Demystified

It is a popular model performance metrics to evaluate binary classification model. In x-axis, it shows recall and y-axis represents precision.**Step 1 :** Calculate recall and precision values from multiple confusion matrices for different cut-offs (thresholds). Let's say cut-off is 0.5 which means all the customers have probability score greater than 0.5 is considered as attritors. For Prob(Attrition) > 0.5, you calculate Recall-Precision values based on True Positive, True Negative, False Positive and False Negative. Similarly, you calculate for the other remaining thresholds.

Cutoff | 0.9 | 0.75 | 0.6 | 0.5 | 0.4 |
---|---|---|---|---|---|

Recall (X) | 0.12 | 0.39 | 0.67 | 0.85 | 0.90 |

Precision (Y) | 0.90 | 0.84 | 0.83 | 0.83 | 0.50 |

Let's plot these 5 data points to create a precision-recall curve.

### How to calculate area under precision recall curve mathematically?

Once we have precision-recall for different thresholds, we can calculate area under curve using Trapezoidal Rule Numerical Integration. In simple words, it adds up all trapezoids under the curve. Make sure different recall values should be sorted before using Trapezoidal Rule.

**Note : ** Trapezoid rule is not the most accurate way of calculating integrals. Simpson's rule is more accurate.

*Follow the calculations shown in the image below -*

## ROC Curve vs Area under Precision Recall Curve (AUPRC)

ROC Curve shows the trade-off between True Positive Rate and False Positive Rate using different probability cut-offs. Whereas AUPRC represents a different trade-off which is in between the true positive rate and the positive predictive value.

ROC curves can be misleading in rare-event problem (or called as imbalanced data) wherein percentage of non-events are significantly higher than events. For example, you are building a machine learning model which identifies cancer patients. In general, percentage of cancer patients is very small as compared to non-cancer patients. In this case, % of events is very low (less than 1%). Here True Positive means patients who had cancer whom we correctly diagnosed as having cancer. In this type of rare-event problem, ROC score can be misleading and AUPRC is a better choice for assessing model performance.

## R : Area under precision recall curve

Let's see how to compute area under curve based on Trapezoidal Rule.

recall=c(0.12, 0.39, 0.67, 0.85, 0.90) precision=c(0.90, 0.84, 0.83, 0.83, 0.50) i = 2:length(recall) recall = recall[i] - recall[i-1] precision = precision[i] + precision[i-1] (AUPRC = sum(recall * precision)/2)Output0.65135

Let's build a logistic regression model on slightly imbalanced data. It has a target variable named `admit`

.

**Let's import the dataset**. To make it imbalanced for demonstration purpose, we are removing some 1s from target variable. Event rate is now 15%.

df = read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv") df = df[!(df$gre>=600 & df$gre>=3.5 & df$admit==1),] mean(df$admit)

# Factor Variables df$admit = as.factor(df$admit) df$rank = as.factor(df$rank) # Logistic Model df$rank <- relevel(df$rank, ref='4') mylogistic <- glm(admit ~ ., data = df, family = "binomial") summary(mylogistic)$coefficient # Predict pred = predict(mylogistic, type = "response") finaldata = cbind(df, pred) # Store precision and recall scores at different cutoffs library(ROCR) predobj <- prediction(finaldata$pred, finaldata$admit) perf <- performance(predobj,"prec", "rec") plot(perf) # Trapezoidal rule of integration x = perf@x.values[[1]] y = perf@y.values[[1]] idx = 2:length(x) testdf=data.frame(recall = (x[idx] - x[idx-1]), precision = (y[idx] + y[idx-1])) # Ignore NAs testdf = subset(testdf, !is.na(testdf$precision)) (AUPRC = sum(testdf$recall * testdf$precision)/2) # ROC Curve (AUROC <- performance(predobj,"auc")@y.values)

Findings :AUROC score is more than 0.75 whereas AUPRC is 0.3315.

## Python : Area under precision recall curve

Calculate area under curve using recall and precision scores at different probability thresholds.import numpy as np recall = np.array([0.12, 0.39, 0.67, 0.85, 0.90]) precision = np.array([0.90, 0.84, 0.83, 0.83, 0.50]) i = np.array(range(1,len(recall))) re = recall[i] - recall[i-1] pr = precision[i] + precision[i-1] #Multiply re and pr lists and then take sum and divide by 2 np.sum(re * pr)/2Here we are replicating R program in Python to build logistic regression.

#Load Required Libraries import pandas as pd import numpy as np from sklearn.linear_model import LogisticRegression from patsy import dmatrices, Treatment from sklearn.metrics import precision_recall_curve, auc #Read CSV File df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv") df = df[~((df.gre>=600) & (df.gre>=3.5) & (df.admit==1))] #Set Reference Group y, X = dmatrices('admit ~ gre + gpa + C(rank, Treatment(reference=4))', df, return_type = 'dataframe') #Sklearn model = LogisticRegression(fit_intercept = False, C = 1e16, solver = "newton-cg", max_iter=10000000) y = np.ravel(y) model = model.fit(X, y) model.coef_ #Combining coeff and variable names coef_dict = {} for coef, feat in zip(model.coef_[0,:],X.columns): coef_dict[feat] = coef #Probability estimates y_scores=model.predict_proba(X) precision, recall, thresholds = precision_recall_curve(y, y_scores[:,1]) #Area under precision recall curve auc(recall, precision) #Sort Recall scores ct = np.column_stack([recall, precision]) ct= ct[ct[:, 0].argsort()] recall = ct[:,0] precision = ct[:,1] #Integration i = np.array(range(1,len(recall))) re = recall[i] - recall[i-1] pr = precision[i] + precision[i-1] #Multiply re and pr lists and then take sum and divide by 2 print(np.sum(re * pr)/2)

## SAS : Area under precision recall curve

With the use of`PROC IML`

, we can calculate area under curve.
proc iml; recall = {0.12, 0.39, 0.67, 0.85, 0.90}; precision = {0.90, 0.84, 0.83, 0.83, 0.50}; N = 2 : nrow(recall); re = recall[N] - recall[N-1]; pr = precision[N] + precision[N-1]; AUPRC= re`*pr/2; print AUPRC;In this section, we are building logistic regression model and then calculate area under precision recall curve in SAS.

FILENAME PROBLY TEMP; PROC HTTP URL="https://stats.idre.ucla.edu/stat/data/binary.csv" METHOD="GET" OUT=PROBLY; RUN; OPTIONS VALIDVARNAME=ANY; PROC IMPORT FILE=PROBLY OUT=WORK.binary REPLACE DBMS=CSV; RUN; proc sql; create table binary2 as select * from binary where not(gre>=600 and gre>=3.5 and admit=1); quit; Proc logistic data= WORK.binary2 descending plots(only)=roc; class rank / param=ref ; model admit = gre gpa rank / outroc=performance; output out = estprob p= pred; run; /* Precision Recall Curve SAS */ data precision_recall; set performance; precision = _POS_/(_POS_ + _FALPOS_); recall = _POS_/(_POS_ + _FALNEG_); F_stat = harmean(precision,recall); run; proc sort data=precision_recall; by recall; run; proc iml; use precision_recall; read all var {recall} into sensitivity; read all var {precision} into precision; N = 2 : nrow(sensitivity); tpr = sensitivity[N] - sensitivity[N-1]; prec = precision[N] + precision[N-1]; AUPRC = tpr`*prec/2; print AUPRC; title1 "Area under Precision Recall Curve"; symbol1 interpol=join value=dot; proc gplot data=precision_recall; plot precision*recall / haxis=0 to 1 by .2 vaxis=0 to 1 by .2; run; quit;

Very Nice Deepanshu.

ReplyDeleteexcellent Deepanshu brother.

ReplyDelete