Steps to Build Logistic Regression in Excel

Deepanshu Bhalla Add Comment

This tutorial explains the steps to build a logistic regression model in Excel using a practical example. It also covers how to calculate the following metrics of logistic regression in Excel.

  • Coefficient Table : Coefficients, P-Value, Standardized Coefficients, Odds Ratio etc.
  • Model Performance Metrics : AUC, AIC, Confusion Matrix etc.
  • Scoring or predicting outcomes for new data

Logistic regression is a statistical technique used for binary classification tasks. It is used when dependent variable has two possible values such as 0 and 1. For example : success/failure, attrition/retention etc.

Scenario

Suppose you are asked to perform logistic regression analysis in Excel to check whether a student will be admitted or not into graduate school based on their GRE and GPA grades along with the ranking of the undergraduate school they attended.

Sample Data

The raw data looks like the image shown below. We have 397 observations and 5 independent variables in our dataset. Data are in range A2:F398 in Excel. The column admit is a binary dependent variable in this case and the remaining columns contain data for independent variables.

Calculating Logistic Regression in Excel

Step 1 : Setting Initial Coefficients

In this step, we are creating a coefficient table for independent variables and intercept. We are setting each coefficient to 0.001. Later we will be adjusting them to maximize the log likelihood.

  1. Enter intercept and independent variable names in range M2:M7.
  2. Enter initial values for coefficients as 0.001 in range N2:N7.
Setting Coefficients for Logistic Regression in Excel

Step 2 : Calculations of Logistic Regression

1. Logit

It is linear combination of the independent variables weighted by their coefficients. To calculate logit, we need to multiply coefficients with values of independent variables and then sum them.

In Excel, enter the following formula in cell G2 and then paste it down to the last row i.e cell G398.

=$N$2+B2*$N$3+C2*$N$4+D2*$N$5+E2*$N$6+F2*$N$7
Calculating logit in Excel

Alternative Method : If you have a lot of independent variables, it is inefficient to write out the lengthy formula above. Alternatively, you can use the SUMPRODUCT formula.

=$N$2+SUMPRODUCT(B2:F2,TRANSPOSE($N$3:$N$7))
Please note that the above formula is an array formula. If you are using versions of Excel prior to Office 365, you have to press CTRL + SHIFT + ENTER to confirm it as an array formula instead of just pressing ENTER. Otherwise you will get #VALUE error.
2. Exponential value of logit

In this step, we simply take exponential value of the logit using the Excel's EXP() function. In cell H2, enter the following formula.

=EXP(G2)
Exponential value of logit
3. Probability of dependent variable is 1

The probability that the dependent variable y is 1 is calculated as:

prob(y=1) = exp(logit) / (1 + exp(logit))

In cell I2, enter the following formula and then paste it down to cell I398.

=H2/(1+H2)
Excel : Probability of dependent variable is 1
4. Predicted Probabilities

In this step, we are calculating the probability of the dependent variable (y) is either 0 or 1.

prob(y=0) = 1 - prob(y=1)

In cell J2, enter the following formula and then paste it down to cell J398.

=IF(A2=0,1-I2,I2)
Predicted Probabilities
5. Log Likelihood

To calculate the log likelihood, we need to take the natural log of predicted probabilities.

In cell K2, enter the following formula to calculate the log likelihood and then paste it down to cell K398.

=LN(J2)
Log Likelihood in Excel

Next step is to sum the log likelihood for each observation. In the last row of the column i.e. cell K399, sum this column.

=SUM(K2:K398)
sum of log likelihood

Step 3 : Maximizing Log Likelihood

In this step, we need to adjust the coefficients to maximize the log likelihood.

Excel's solver add-in will be used to optimizate the likelihood function.

Installing Solver Instructions : Go to File > Options > Add-Ins. In the Manage box, select Excel Add-ins and click "Go". Check Solver Add-in and click "OK".

After installing the solver add-in, follow the steps below.

  • Go to the "Data" tab and click on "Solver".
  • Set the objective function to the cell K399 containing the sum of the log-likelihood.
  • Select "Max" in the "To:" section.
  • By Changing Variable Cells: Select cells N2:N7 containing coefficients (including intercept).
  • Make Unconstrained Variables Non-Negative: Uncheck this option.
  • Choose the Solver engine GRG Nonlinear.
  • Click "Solve" to run the optimization.
Maximizing Log Likelihood in Excel

The final logistic regression coefficients and predicted probabilities are shown below.

Excel : Logistic Regression Coefficients
How to Predict Outcomes for New Data

Suppose you have a new data which was not used to build the logistic regression model. You need to estimate the predicted probability for the new data using the model you built in the previous steps.

gregparank1rank2rank3
6503.67001

All you need to do is drag the formula down (as shown in the previous steps) to the new data to calculate prob(y=1). Refer to the image below.

Predicting New Data
Standard Error

The standard error of a coefficient in logistic regression explains how reliable coefficient estimates are.

Standard Error in Logistic Regression

To calculate the standard error of coefficients, run the VBA code below.


Sub StandardErrors()
    Dim XData As String
    Dim MatrixSize As Long
    Dim CoeffData As String
    Dim outputStartingCell As String
    Dim i As Long, j As Long
    Dim Hessian() As Double
    Dim ws As Worksheet

    ' Set input values
    Set ws = ThisWorkbook.Sheets("Sheet1") ' Change sheet name if needed
    XData = "B2:F398"          ' Range for independent variables
    CoeffData = "N2:N7"       ' Range for coefficients (incld. intercept)
    outputStartingCell = "O2"  ' Starting cell where you want to populate standard errors (output)

    ' Initialize the Hessian matrix
    MatrixSize = Range(XData).Columns.Count + 1 ' Include intercept
    ReDim Hessian(1 To MatrixSize, 1 To MatrixSize)

    ' Read coefficient values from cells (adjust cell references as needed)
    Dim Coeff() As Double
    Dim rng As Range
    Set rng = ws.Range(CoeffData) ' Convert the string to a range
    ReDim Coeff(1 To rng.Cells.Count)
    i = 1
    For Each cell In rng
        Coeff(i) = cell.Value
        i = i + 1
    Next cell

    ' Read feature data from cells (adjust cell references as needed)
    Dim X() As Double
    Dim inputRange As Range
    Set inputRange = ws.Range(XData)
    n = inputRange.Rows.Count

    ReDim X(1 To n, 1 To MatrixSize)
    Dim rowIndex As Long, colIndex As Long
    For rowIndex = 1 To n
        X(rowIndex, 1) = 1 ' Intercept
        For colIndex = 1 To MatrixSize - 1
            X(rowIndex, colIndex + 1) = inputRange.Cells(rowIndex, colIndex).Value
        Next colIndex
    Next rowIndex

    ' Calculate the Hessian matrix
    Dim p_hat() As Double
    ReDim p_hat(1 To n)
    For rowIndex = 1 To n
        Dim Xb As Double
        Xb = 0
        For j = 1 To MatrixSize
            Xb = Xb + X(rowIndex, j) * Coeff(j)
        Next j
        p_hat(rowIndex) = Exp(Xb) / (1 + Exp(Xb)) ' Predicted probabilities
    Next rowIndex

    ' Calculate the Hessian elements
    For i = 1 To MatrixSize
        For j = 1 To MatrixSize
            Dim sum_term As Double
            sum_term = 0
            For rowIndex = 1 To n
                sum_term = sum_term + X(rowIndex, i) * X(rowIndex, j) * p_hat(rowIndex) * (1 - p_hat(rowIndex))
            Next rowIndex
            Hessian(i, j) = sum_term
        Next j
    Next i

    ' Invert the Hessian matrix
    Dim invHessian() As Double
    invHessian = InverseMatrix(Hessian)

    ' Calculate standard errors
    Dim se() As Double
    ReDim se(1 To MatrixSize)
    For i = 1 To MatrixSize
        se(i) = Sqr(invHessian(i, i))
    Next i

    ' Output standard errors to the worksheet
    ws.Range(outputStartingCell).Value = se(1) ' Intercept
    For i = 2 To MatrixSize
        ws.Range(outputStartingCell).Offset(i - 1, 0).Value = se(i)
    Next i

End Sub


Function InverseMatrix(mat() As Double) As Variant
    ' This function computes the inverse of a matrix using the Gaussian elimination method
    ' Note: This function assumes that the input matrix is square and invertible

    Dim n As Long
    n = UBound(mat, 1)

    Dim augmented() As Double
    ReDim augmented(1 To n, 1 To 2 * n)

    Dim i As Long, j As Long, k As Long
    Dim factor As Double

    ' Create augmented matrix [mat | I]
    For i = 1 To n
        For j = 1 To n
            augmented(i, j) = mat(i, j)
            augmented(i, j + n) = IIf(i = j, 1, 0)
        Next j
    Next i

    ' Perform Gaussian elimination
    For i = 1 To n
        ' Make the diagonal element 1
        factor = augmented(i, i)
        For j = 1 To 2 * n
            augmented(i, j) = augmented(i, j) / factor
        Next j

        ' Make the other elements in the column 0
        For k = 1 To n
            If k <> i Then
                factor = augmented(k, i)
                For j = 1 To 2 * n
                    augmented(k, j) = augmented(k, j) - factor * augmented(i, j)
                Next j
            End If
        Next k
    Next i

    ' Extract the inverse matrix
    Dim invMat() As Double
    ReDim invMat(1 To n, 1 To n)

    For i = 1 To n
        For j = 1 To n
            invMat(i, j) = augmented(i, j + n)
        Next j
    Next i

    InverseMatrix = invMat
End Function

How to Run Macro
  1. In Excel, open the VBA editor by pressing Alt + F11 keyboard shortcut key.
  2. Select Insert > Module to create a module.
  3. Paste the above macro code in the module.
  4. Close the VBA editor by clicking the 'X' in the top-right corner.
  5. Run the macro by pressing Alt + F8 and select "StandardErrors" macro
How to Modify Code
  • Set ws = ThisWorkbook.Sheets("Sheet1"): Change "Sheet1" to your desired sheet name.
  • XData = "B2:F398": Specifies the range for independent variables. Modify "B2:F398" to match your data range.
  • CoeffData = "N2:N7": Defines the range for coefficients including the intercept. Adjust "N2:N7" to the range for your coefficients.
  • outputStartingCell = "O2": Sets the starting cell for outputting results. Change "O2" to the cell where you want to begin outputting data.
P-Value and Z-Statistic

The z-statistic measures how many standard deviations a coefficient estimate is away from 0. A small p-value less than 0.05 means the coefficient of independent variable is significantly different from 0.

z-statistic = coefficient / standard error

To calculate z-statistic, enter the following formula in cell P2. Cell N2 contains coefficients and cell O2 contains standard error.

=N2 / O2

To calculate p-value, enter the following formula in cell Q2.

=2*(1-NORM.S.DIST(ABS(N2)/O2,1))
Wald Chi-Square

To calculate wald chi-square, all we need to do is to take square of z-statistics. In cell R2, enter the following formula.

=P2^2
Confidence Intervals

The formula for calculating confidence intervals for a coefficient is as follows:

CI = coefficient ± z * Standard Error

Here z is the critical value. It is 1.96 for a 95% confidence interval

In cell S2, enter the following formula for lower confidence interval.

=N2 - 1.96*O2

In cell T2, enter the following formula for upper confidence interval.

=N2 + 1.96*O2
Standardized Coefficients

Standardized Coefficients help in understanding the relative importance of each independent variable. Variable with the largest absolute value of a standardized coefficient means the most important variable.

The formula for calculating confidence intervals for a coefficient is as follows:

Standardized Coefficient = (β * SDX) / (π / √3)

SDX is a standard deviation of an independent variable.

In cell U3, enter the following formula. Please note that it is not entered against intercept. Paste the formula down to the cells below to get standardized coefficients for each independent variable. In this formula, cell N3 refers to unstandardized coefficient; range B2:B398 refers to raw values of variable. By using the OFFSET function, this formula became dynamic and automatically changing the range of variables as it's pasted down.

=(N3*STDEV.S(OFFSET($B$2:$B$398,,ROW(A1)-1))/(PI()/SQRT(3)))
Standardized Coefficients in Excel
Odds Ratio

Odds ratio is exponential value of a coefficient. It means how the odds of the dependent variable change with each unit change in the independent variable.

In cell V3, enter the following formula to calculate odds ratio in Excel.

=EXP(N3)
Model Performance Metrics

The main model performance metrics of logistic regression are Area Under Curve (AUC) of the ROC Curve, AIC and Hosmer-Lemeshow (HL) Test.

Area Under Curve (AUC)

AUC measures how well the model can differentiate between events (1) and non-events (0).

Insert the following VBA code by following the first four steps shown in this section of article.

Function AUROC(predictions As Range, actuals As Range) As Double
    Dim n As Long
    Dim tpr() As Double, fpr() As Double
    Dim sorted_indices() As Long
    Dim i As Long
    
    ' Check if the predictions and actuals ranges have the same number of elements
    If predictions.Count <> actuals.Count Then
        MsgBox "Predictions and actuals must have the same number of elements."
        AUROC = -1
        Exit Function
    End If
    
    n = predictions.Count
    ReDim tpr(0 To n + 1), fpr(0 To n + 1)
    
    ' Sort predictions and actuals in descending order of predictions
    sorted_indices = SortIndicesDescending(predictions)
    
    Dim num_pos As Long, num_neg As Long
    num_pos = WorksheetFunction.CountIf(actuals, 1)
    num_neg = n - num_pos
    
    If num_pos = 0 Or num_neg = 0 Then
        MsgBox "There must be both positive and negative actual values."
        AUROC = -1
        Exit Function
    End If
    
    Dim tp_count As Long, fp_count As Long
    tp_count = 0
    fp_count = 0
    
    ' Calculate TPR and FPR for each threshold
    For i = 1 To n
        If actuals.Cells(sorted_indices(i)).Value = 1 Then
            tp_count = tp_count + 1
        Else
            fp_count = fp_count + 1
        End If
        
        tpr(i) = tp_count / num_pos
        fpr(i) = fp_count / num_neg
    Next i
    
    ' Append (0,0) and (1,1) to the ROC curve
    tpr(0) = 0
    tpr(n + 1) = 1
    fpr(0) = 0
    fpr(n + 1) = 1
    
    ' Calculate AUC from the ROC curve using trapezoidal rule
    Dim auc As Double
    auc = 0
    For i = 1 To n
        auc = auc + (tpr(i) + tpr(i - 1)) * (fpr(i - 1) - fpr(i)) / 2
    Next i
    
    ' Ensure AUC is non-negative
    AUROC = Abs(auc)
End Function

Function SortIndicesDescending(predictions As Range) As Variant
    Dim i As Long, j As Long
    Dim temp As Double
    Dim indices() As Long
    Dim sorted_predictions() As Double
    
    Dim n As Long
    n = predictions.Count
    ReDim indices(1 To n)
    ReDim sorted_predictions(1 To n)
    
    For i = 1 To n
        indices(i) = i
        sorted_predictions(i) = predictions.Cells(i).Value
    Next i
    
    ' Simple bubble sort
    For i = 1 To n - 1
        For j = i + 1 To n
            If sorted_predictions(i) < sorted_predictions(j) Then
                ' Swap predictions
                temp = sorted_predictions(i)
                sorted_predictions(i) = sorted_predictions(j)
                sorted_predictions(j) = temp
                
                ' Swap indices
                temp = indices(i)
                indices(i) = indices(j)
                indices(j) = temp
            End If
        Next j
    Next i
    
    SortIndicesDescending = indices
End Function

After inserting the above code into the module, you can use the function =AUROC(prob(y=1),dependent variable) in any cell within your Excel workbook.

=AUROC(I2:I398,A2:A398)
Akaike Information Criterion (AIC)

AUC is used for model selection. Lower AIC values means better-fitting models. It penalizes models for their complexity and favors simpler models that explain the data well.

AIC = −2×log-likelihood+2×(Number of Independent Variables+1)

Cells K2:K398 refers to range of log-likelihood column. Cells B1:F1 refers to headers of independent variables.

=-2*SUM(K2:K398)+2*(COUNTA(B1:F1)+1)
Confusion Matrix

A confusion matrix is used to evaluate the performance of a classification model. It compares the actual values with the model's predictions. You can refer to this tutorial - Confusion Matrix in Excel

Hosmer-Lemeshow (HL) Test

The Hosmer-Lemeshow (HL) test is used to check the goodness of fit of a logistic regression model. It explains how well the predicted probabilities from the model match the actual outcomes.
Calculating Hosmer-Lemeshow (HL) Test in Excel

Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

Post Comment 0 Response to "Steps to Build Logistic Regression in Excel"
Next → ← Prev