Multiple Imputation with SAS

This tutorial explains multiple imputation and how it works. It also includes implementation of the algorithm with SAS and also challenges attached to it.

Multiple Imputation

Instead of filling in a single value for each missing value, multiple imputation (Rubin 1976, 1987) replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. The multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. No matter which complete-data analysis is used, the process of combining results from different data sets is essentially the same. [Source : SAS Website]

This process draws a random sample of the missing values from its distribution. This process results in valid statistical inferences that properly reflect the uncertainty due to missing values—for example, confidence intervals with the correct probability coverage.

Multiple imputation inference involves three distinct phases:

The missing data are filled in m times to generate m complete data sets.
Perform regression or any other analysis on each of the m complete data sets.
Average the values of the parameter estimates across the M samples to produce a single point estimate.
Calculate the standard errors by averaging the standard errors of the M parameter estimates. Also calculate variance of the M parameter estimates across samples.

Objective of Multiple Imputation

The main goal of Multiple Imputation is to get robust estimates of your model. It means model is unbiased by missing data.

In SAS, Proc MI is used to replace missing values with multiple imputation.

Missing Data Patterns

1. Monotone :

If a variable has missing data, all variables to the right of the missing data variable in a rectangular data array are also missing.

For example, If an observation has missing value in the third variable, monotonic missing is like o o m m m (all variables to the right has missing data), and one kind of non-monotonic missing can be o o m o m, where o indicates observed, m indicates missing.

2. Arbitrary:

Arbitrary missing data is a missing data pattern that has missingness spread among full data values (no observed missing data pattern).

I. MCMC : Markov Chain Monte Carlo method (Default Method)

The MCMC method is used to impute missing values for a data set with an arbitrary missing pattern. This is the default method in PROC MI (METHOD=MCMC).

PROC MI data = mi_input_data seed=44853 nimpute=5 out=mi_output_data ;
multinormal method=mcmc;
var outcome age gender ethnicity BMI FBS heart_rate ;
run ;

PROC LOGISTIC data=mi_output_data outest=outreg covout;
class gender ;
model outcome= age gender age gender ethnicity BMI FBS heart_rate /covb ;
by _imputation_ ;ods output ParameterEstimates=mi_parms CovB=mi_covb;
run ;

PROC MIANALYZE data = outreg;
modeleffects Intercept age gender ethnicity BMI FBS heart_rate ;
run ;

NIMPUTE = specifies the number of imputations. In this case, it is 5.

IMPUTATION : When this program runs it will produce a large new dataset with 5 * number of observations in a dataset. It will also include a variable called Imputation. For example, you have 150 observations in a dataset. The first 150 observations will have Imputation = 1, the next 150 have Imputation = 2, and so on.

PROC MIANALYZE : It performs the final analysis, which takes the results of the five logistic regressions and combines them.

Interpretation : We average across the five sets of coefficients, we average across the standard errors, and we take the variability of regression coefficients across the five sets of imputed data.

Important PROC MI Statements

CLASS specifies the classification variables in the VAR statement
MONOTONE specifies imputation methods for a data set with a monotone missing pattern
VAR specifies the variables to be analyzed (both missing and non-missing)

II. Regression Imputation (Linear Regression)

proc mi data=Fish1 seed=13951639 out=outex3;

Class Length1

monotone reg(Length1 Length2)
var Length1 Length2 Length3;
run;

Note : Length1 and Length2 are variables to be imputed.

III. Imputing Nominal and Ordinal variables (Discriminant and Logistic Regression)

proc mi data=Fish seed=1305417 out=OutFish;
class Species;
logistic( Species= Length Height Width Height*Width/ details);
var Length Height Width Species;
run;

Ordinal and Binary : Logistic option

Non-Binary : Discrim option

Nominal and Binary : Discrim option

In the above code, Species is a variable on which missing data exists. Var statement includes list of both missing and non-missing variables to be used as predictors in the imputation models.

Example Code : Both Categorical and Continuous Data

proc mi data=Fish seed=1305417 out=OutFish;

class Species;

monotone reg(Height Width/ details)

logistic( Species= Length Height Width Height*Width/ details);

var Length Height Width Species;

run;

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn