In this article we will cover how to calculate AUC (Area Under Curve) in R.
What is Area Under Curve?
The Area Under Curve (AUC) is a metric used to evaluate the performance of a binary classification model. It measures the ability of a model to distinguish between events and non-events.
The AUC (ROC) curve is created by plotting the true positive rate (sensitivity) against the false positive rate (1-specificity) at various classification thresholds. The true positive rate is the proportion of events correctly classified as events, and the false positive rate is the proportion of non-events incorrectly classified as events.
The AUC ranges from 0 to 1, where:
- AUC = 0.5: The classifier performs no better than random chance.
- AUC > 0.5 and ≤ 1: The classifier performs better than random chance. A higher AUC value indicates better performance, with 1 representing a perfect classifier.
We need to have two R packages named ISLR
and ROCR
installed as prerequisites. If they are not already installed, you can install them using the command.
install.packages("ISLR") install.packages("ROCR")
The performance()
function from ROCR package is used to calculate the Area Under the Curve (AUC) as a performance metric for the model. The following R code builds logistic regression model for binary classification on the "Default" dataset from the "ISLR" package and then calculates AUC.
library(ISLR) library(ROCR) # Load a binary classification dataset from ISLR package mydata <- ISLR::Default # Set seed set.seed(1234) # 70% of dataset goes to training data and remaining 30% to test data train_idx <- sample(c(TRUE, FALSE), nrow(mydata), replace=TRUE, prob=c(0.7,0.3)) train <- mydata[train_idx, ] test <- mydata[!train_idx, ] # Build logistic regression model model <- glm(default~., family="binomial", data=train) # Calculate predicted probability of default of test data predicted <- predict(model, test, type="response") # Storing Model Performance Scores pred <- prediction(predicted, test$default) # Calculating Area under Curve perf <- performance(pred,"auc") auc <- as.numeric(perf@y.values) auc
Result: 0.9466106
- The "Default" dataset is loaded from the "ISLR" package.
- A seed is set using
set.seed(1234)
to ensure reproducibility. It means same output will be generated in every run. - The dataset is split into a training set (70%) and a test set (30%) using random sampling.
- A logistic regression model is built using the
glm
function, where "default" is the binary dependent variable, and the rest of the variables are used as independent variables. - The
predict
function is used to calculate the predicted probabilities of default for the test data based on the logistic regression model. - The "ROCR" package is used to create a prediction object (pred) based on the predicted probabilities and the true default values from the test set.
- The performance of the model is evaluated by calculating the Area Under the Curve (AUC) using the
performance
function from "ROCR."
Let's see how we can plot the ROC curve in R. In the following code, we first calculate the ROC curve using the performance function with "tpr" (True Positive Rate or Sensitivity) and "fpr" (False Positive Rate) as arguments. Then, we use the plot function to plot the ROC curve. The abline function is used to draw the diagonal line from (0,0) to (1,1), representing the ROC curve of a random classifier.
# Plot ROC curve roc_curve <- performance(pred, "tpr", "fpr") plot(roc_curve, col = "blue", main = "ROC Curve", lwd = 2) abline(0, 1, col = "gray", lty = 2, lwd = 1) text(0.5, 0.3, paste("AUC =", round(auc, 2)), adj = c(0.5, 0.5), col = "black", cex = 1.5)
Share Share Tweet