How to Calculate AUC (Area Under Curve) of Training Dataset in R

The following code builds a logistic regression model for binary classification and evaluate its performance using the AUC metric on the training data. We are using the ROCR package to calculate the Area Under Curve (AUC) for the model.

library(ISLR)
library(ROCR)

# Load a binary classification dataset from ISLR package
mydata <- ISLR::Default

# Set seed for reproducibility
set.seed(1234)

# 70% of dataset goes to training data and remaining 30% to test data
train_idx  <- sample(c(TRUE, FALSE), nrow(mydata), replace=TRUE, prob=c(0.7,0.3))
train <- mydata[train_idx, ]
test <- mydata[!train_idx, ] 

# Build logistic regression model
model <- glm(default~., family="binomial", data=train)

# Calculate predicted probability of default
predicted <- predict(model, type="response")

# Storing Model Performance Scores
pred  <- prediction(predicted, train$default)

# Calculating Area under Curve
perf <- performance(pred,"auc")
auc <- as.numeric(perf@y.values)
auc

Result : auc = 0.9522256

We often make an error when calculating the AUC of a training dataset in R. We set up our training dataset in the predict function the same way we define our test data in the function. This is where the mistake occurs.

Incorrect Syntax

predicted <- predict(model, train, type="response")

It is incorrect because we are telling R to consider our training dataset as a new dataset and predict it.

Correct Syntax

predicted <- predict(model, type="response")

Steps to Calculate AUC of Training Dataset

Splits the dataset into training and test sets, with 70% of the data going to the training set and the remaining 30% to the test set.
Builds a logistic regression model using the training data with the response variable "default" and all other variables as predictors.
Calculates the predicted probabilities of default for the training data using the logistic regression model.
Stores the model performance scores by creating a prediction object using the predicted probabilities and the true binary labels from the training data.
Calculates the Area Under Curve (AUC) for the model using the ROCR package.
Use the performance() function from the ROCR package and set "auc" as a performance measure for the evaluation.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn