# Steps to Build Logistic Regression in Excel

This tutorial explains the steps to build a logistic regression model in Excel using a practical example. It also covers how to calculate the following metrics of logistic regression in Excel.

• Coefficient Table : Coefficients, P-Value, Standardized Coefficients, Odds Ratio etc.
• Model Performance Metrics : AUC, AIC, Confusion Matrix etc.
• Scoring or predicting outcomes for new data

Logistic regression is a statistical technique used for binary classification tasks. It is used when dependent variable has two possible values such as 0 and 1. For example : success/failure, attrition/retention etc.

Scenario

Suppose you are asked to perform logistic regression analysis in Excel to check whether a student will be admitted or not into graduate school based on their GRE and GPA grades along with the ranking of the undergraduate school they attended.

Sample Data

The raw data looks like the image shown below. We have 397 observations and 5 independent variables in our dataset. Data are in range A2:F398 in Excel. The column `admit` is a binary dependent variable in this case and the remaining columns contain data for independent variables.

## Step 1 : Setting Initial Coefficients

In this step, we are creating a coefficient table for independent variables and intercept. We are setting each coefficient to 0.001. Later we will be adjusting them to maximize the log likelihood.

1. Enter intercept and independent variable names in range M2:M7.
2. Enter initial values for coefficients as 0.001 in range N2:N7.

## Step 2 : Calculations of Logistic Regression

1. Logit

It is linear combination of the independent variables weighted by their coefficients. To calculate logit, we need to multiply coefficients with values of independent variables and then sum them.

In Excel, enter the following formula in cell G2 and then paste it down to the last row i.e cell G398.

`=\$N\$2+B2*\$N\$3+C2*\$N\$4+D2*\$N\$5+E2*\$N\$6+F2*\$N\$7`

Alternative Method : If you have a lot of independent variables, it is inefficient to write out the lengthy formula above. Alternatively, you can use the SUMPRODUCT formula.

`=\$N\$2+SUMPRODUCT(B2:F2,TRANSPOSE(\$N\$3:\$N\$7))`
Please note that the above formula is an array formula. If you are using versions of Excel prior to Office 365, you have to press CTRL + SHIFT + ENTER to confirm it as an array formula instead of just pressing ENTER. Otherwise you will get #VALUE error.
2. Exponential value of logit

In this step, we simply take exponential value of the logit using the Excel's EXP() function. In cell H2, enter the following formula.

`=EXP(G2)`
3. Probability of dependent variable is 1

The probability that the dependent variable y is 1 is calculated as:

prob(y=1) = exp(logit) / (1 + exp(logit))

In cell I2, enter the following formula and then paste it down to cell I398.

`=H2/(1+H2)`
4. Predicted Probabilities

In this step, we are calculating the probability of the dependent variable (y) is either 0 or 1.

prob(y=0) = 1 - prob(y=1)

In cell J2, enter the following formula and then paste it down to cell J398.

`=IF(A2=0,1-I2,I2)`
5. Log Likelihood

To calculate the log likelihood, we need to take the natural log of predicted probabilities.

In cell K2, enter the following formula to calculate the log likelihood and then paste it down to cell K398.

`=LN(J2)`

Next step is to sum the log likelihood for each observation. In the last row of the column i.e. cell K399, sum this column.

`=SUM(K2:K398)`

## Step 3 : Maximizing Log Likelihood

In this step, we need to adjust the coefficients to maximize the log likelihood.

Excel's solver add-in will be used to optimizate the likelihood function.

Installing Solver Instructions : Go to File > Options > Add-Ins. In the Manage box, select Excel Add-ins and click "Go". Check Solver Add-in and click "OK".

• Go to the "Data" tab and click on "Solver".
• Set the objective function to the cell K399 containing the sum of the log-likelihood.
• Select "Max" in the "To:" section.
• By Changing Variable Cells: Select cells N2:N7 containing coefficients (including intercept).
• Make Unconstrained Variables Non-Negative: Uncheck this option.
• Choose the Solver engine GRG Nonlinear.
• Click "Solve" to run the optimization.

The final logistic regression coefficients and predicted probabilities are shown below.

How to Predict Outcomes for New Data

Suppose you have a new data which was not used to build the logistic regression model. You need to estimate the predicted probability for the new data using the model you built in the previous steps.

gregparank1rank2rank3
6503.67001

All you need to do is drag the formula down (as shown in the previous steps) to the new data to calculate prob(y=1). Refer to the image below.

Standard Error

The standard error of a coefficient in logistic regression explains how reliable coefficient estimates are.

To calculate the standard error of coefficients, run the VBA code below.

``````
Sub StandardErrors()
Dim XData As String
Dim MatrixSize As Long
Dim CoeffData As String
Dim outputStartingCell As String
Dim i As Long, j As Long
Dim Hessian() As Double
Dim ws As Worksheet

' Set input values
Set ws = ThisWorkbook.Sheets("Sheet1") ' Change sheet name if needed
XData = "B2:F398"          ' Range for independent variables
CoeffData = "N2:N7"       ' Range for coefficients (incld. intercept)
outputStartingCell = "O2"  ' Starting cell where you want to populate standard errors (output)

' Initialize the Hessian matrix
MatrixSize = Range(XData).Columns.Count + 1 ' Include intercept
ReDim Hessian(1 To MatrixSize, 1 To MatrixSize)

Dim Coeff() As Double
Dim rng As Range
Set rng = ws.Range(CoeffData) ' Convert the string to a range
ReDim Coeff(1 To rng.Cells.Count)
i = 1
For Each cell In rng
Coeff(i) = cell.Value
i = i + 1
Next cell

Dim X() As Double
Dim inputRange As Range
Set inputRange = ws.Range(XData)
n = inputRange.Rows.Count

ReDim X(1 To n, 1 To MatrixSize)
Dim rowIndex As Long, colIndex As Long
For rowIndex = 1 To n
X(rowIndex, 1) = 1 ' Intercept
For colIndex = 1 To MatrixSize - 1
X(rowIndex, colIndex + 1) = inputRange.Cells(rowIndex, colIndex).Value
Next colIndex
Next rowIndex

' Calculate the Hessian matrix
Dim p_hat() As Double
ReDim p_hat(1 To n)
For rowIndex = 1 To n
Dim Xb As Double
Xb = 0
For j = 1 To MatrixSize
Xb = Xb + X(rowIndex, j) * Coeff(j)
Next j
p_hat(rowIndex) = Exp(Xb) / (1 + Exp(Xb)) ' Predicted probabilities
Next rowIndex

' Calculate the Hessian elements
For i = 1 To MatrixSize
For j = 1 To MatrixSize
Dim sum_term As Double
sum_term = 0
For rowIndex = 1 To n
sum_term = sum_term + X(rowIndex, i) * X(rowIndex, j) * p_hat(rowIndex) * (1 - p_hat(rowIndex))
Next rowIndex
Hessian(i, j) = sum_term
Next j
Next i

' Invert the Hessian matrix
Dim invHessian() As Double
invHessian = InverseMatrix(Hessian)

' Calculate standard errors
Dim se() As Double
ReDim se(1 To MatrixSize)
For i = 1 To MatrixSize
se(i) = Sqr(invHessian(i, i))
Next i

' Output standard errors to the worksheet
ws.Range(outputStartingCell).Value = se(1) ' Intercept
For i = 2 To MatrixSize
ws.Range(outputStartingCell).Offset(i - 1, 0).Value = se(i)
Next i

End Sub

Function InverseMatrix(mat() As Double) As Variant
' This function computes the inverse of a matrix using the Gaussian elimination method
' Note: This function assumes that the input matrix is square and invertible

Dim n As Long
n = UBound(mat, 1)

Dim augmented() As Double
ReDim augmented(1 To n, 1 To 2 * n)

Dim i As Long, j As Long, k As Long
Dim factor As Double

' Create augmented matrix [mat | I]
For i = 1 To n
For j = 1 To n
augmented(i, j) = mat(i, j)
augmented(i, j + n) = IIf(i = j, 1, 0)
Next j
Next i

' Perform Gaussian elimination
For i = 1 To n
' Make the diagonal element 1
factor = augmented(i, i)
For j = 1 To 2 * n
augmented(i, j) = augmented(i, j) / factor
Next j

' Make the other elements in the column 0
For k = 1 To n
If k <> i Then
factor = augmented(k, i)
For j = 1 To 2 * n
augmented(k, j) = augmented(k, j) - factor * augmented(i, j)
Next j
End If
Next k
Next i

' Extract the inverse matrix
Dim invMat() As Double
ReDim invMat(1 To n, 1 To n)

For i = 1 To n
For j = 1 To n
invMat(i, j) = augmented(i, j + n)
Next j
Next i

InverseMatrix = invMat
End Function
```
```
How to Run Macro
1. In Excel, open the VBA editor by pressing Alt + F11 keyboard shortcut key.
2. Select Insert > Module to create a module.
3. Paste the above macro code in the module.
4. Close the VBA editor by clicking the 'X' in the top-right corner.
5. Run the macro by pressing Alt + F8 and select "StandardErrors" macro
How to Modify Code
• `Set ws = ThisWorkbook.Sheets("Sheet1")`: Change "Sheet1" to your desired sheet name.
• `XData = "B2:F398"`: Specifies the range for independent variables. Modify "B2:F398" to match your data range.
• `CoeffData = "N2:N7"`: Defines the range for coefficients including the intercept. Adjust "N2:N7" to the range for your coefficients.
• `outputStartingCell = "O2"`: Sets the starting cell for outputting results. Change "O2" to the cell where you want to begin outputting data.
P-Value and Z-Statistic

The z-statistic measures how many standard deviations a coefficient estimate is away from 0. A small p-value less than 0.05 means the coefficient of independent variable is significantly different from 0.

`z-statistic = coefficient / standard error`

To calculate z-statistic, enter the following formula in cell P2. Cell N2 contains coefficients and cell O2 contains standard error.

`=N2 / O2`

To calculate p-value, enter the following formula in cell Q2.

`=2*(1-NORM.S.DIST(ABS(N2)/O2,1))`
Wald Chi-Square

To calculate wald chi-square, all we need to do is to take square of z-statistics. In cell R2, enter the following formula.

`=P2^2`
Confidence Intervals

The formula for calculating confidence intervals for a coefficient is as follows:

`CI = coefficient ± z * Standard Error`

Here z is the critical value. It is 1.96 for a 95% confidence interval

In cell S2, enter the following formula for lower confidence interval.

`=N2 - 1.96*O2`

In cell T2, enter the following formula for upper confidence interval.

`=N2 + 1.96*O2`
Standardized Coefficients

Standardized Coefficients help in understanding the relative importance of each independent variable. Variable with the largest absolute value of a standardized coefficient means the most important variable.

The formula for calculating confidence intervals for a coefficient is as follows:

`Standardized Coefficient = (β * SDX) / (π / √3)`

SDX is a standard deviation of an independent variable.

In cell U3, enter the following formula. Please note that it is not entered against intercept. Paste the formula down to the cells below to get standardized coefficients for each independent variable. In this formula, cell N3 refers to unstandardized coefficient; range B2:B398 refers to raw values of variable. By using the OFFSET function, this formula became dynamic and automatically changing the range of variables as it's pasted down.

`=(N3*STDEV.S(OFFSET(\$B\$2:\$B\$398,,ROW(A1)-1))/(PI()/SQRT(3)))`
Odds Ratio

Odds ratio is exponential value of a coefficient. It means how the odds of the dependent variable change with each unit change in the independent variable.

In cell V3, enter the following formula to calculate odds ratio in Excel.

`=EXP(N3)`
Model Performance Metrics

The main model performance metrics of logistic regression are Area Under Curve (AUC) of the ROC Curve, AIC and Hosmer-Lemeshow (HL) Test.

Area Under Curve (AUC)

AUC measures how well the model can differentiate between events (1) and non-events (0).

Insert the following VBA code by following the first four steps shown in this section of article.

```Function AUROC(predictions As Range, actuals As Range) As Double
Dim n As Long
Dim tpr() As Double, fpr() As Double
Dim sorted_indices() As Long
Dim i As Long

' Check if the predictions and actuals ranges have the same number of elements
If predictions.Count <> actuals.Count Then
MsgBox "Predictions and actuals must have the same number of elements."
AUROC = -1
Exit Function
End If

n = predictions.Count
ReDim tpr(0 To n + 1), fpr(0 To n + 1)

' Sort predictions and actuals in descending order of predictions
sorted_indices = SortIndicesDescending(predictions)

Dim num_pos As Long, num_neg As Long
num_pos = WorksheetFunction.CountIf(actuals, 1)
num_neg = n - num_pos

If num_pos = 0 Or num_neg = 0 Then
MsgBox "There must be both positive and negative actual values."
AUROC = -1
Exit Function
End If

Dim tp_count As Long, fp_count As Long
tp_count = 0
fp_count = 0

' Calculate TPR and FPR for each threshold
For i = 1 To n
If actuals.Cells(sorted_indices(i)).Value = 1 Then
tp_count = tp_count + 1
Else
fp_count = fp_count + 1
End If

tpr(i) = tp_count / num_pos
fpr(i) = fp_count / num_neg
Next i

' Append (0,0) and (1,1) to the ROC curve
tpr(0) = 0
tpr(n + 1) = 1
fpr(0) = 0
fpr(n + 1) = 1

' Calculate AUC from the ROC curve using trapezoidal rule
Dim auc As Double
auc = 0
For i = 1 To n
auc = auc + (tpr(i) + tpr(i - 1)) * (fpr(i - 1) - fpr(i)) / 2
Next i

' Ensure AUC is non-negative
AUROC = Abs(auc)
End Function

Function SortIndicesDescending(predictions As Range) As Variant
Dim i As Long, j As Long
Dim temp As Double
Dim indices() As Long
Dim sorted_predictions() As Double

Dim n As Long
n = predictions.Count
ReDim indices(1 To n)
ReDim sorted_predictions(1 To n)

For i = 1 To n
indices(i) = i
sorted_predictions(i) = predictions.Cells(i).Value
Next i

' Simple bubble sort
For i = 1 To n - 1
For j = i + 1 To n
If sorted_predictions(i) < sorted_predictions(j) Then
' Swap predictions
temp = sorted_predictions(i)
sorted_predictions(i) = sorted_predictions(j)
sorted_predictions(j) = temp

' Swap indices
temp = indices(i)
indices(i) = indices(j)
indices(j) = temp
End If
Next j
Next i

SortIndicesDescending = indices
End Function
```

After inserting the above code into the module, you can use the function `=AUROC(prob(y=1),dependent variable)` in any cell within your Excel workbook.

`=AUROC(I2:I398,A2:A398)`
Akaike Information Criterion (AIC)

AUC is used for model selection. Lower AIC values means better-fitting models. It penalizes models for their complexity and favors simpler models that explain the data well.

`AIC = −2×log-likelihood+2×(Number of Independent Variables+1)`

Cells K2:K398 refers to range of log-likelihood column. Cells B1:F1 refers to headers of independent variables.

`=-2*SUM(K2:K398)+2*(COUNTA(B1:F1)+1)`
Confusion Matrix

A confusion matrix is used to evaluate the performance of a classification model. It compares the actual values with the model's predictions. You can refer to this tutorial - Confusion Matrix in Excel

Hosmer-Lemeshow (HL) Test

The Hosmer-Lemeshow (HL) test is used to check the goodness of fit of a logistic regression model. It explains how well the predicted probabilities from the model match the actual outcomes.
Calculating Hosmer-Lemeshow (HL) Test in Excel

Related Posts