This tutorial explains multiple imputation and how it works. It also includes implementation of the algorithm with SAS and also challenges attached to it.
Multiple Imputation
Instead of filling in a single value for each missing value, multiple imputation (Rubin 1976, 1987) replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute. The multiply imputed data sets are then analyzed by using standard procedures for complete data and combining the results from these analyses. No matter which complete-data analysis is used, the process of combining results from different data sets is essentially the same. [Source : SAS Website]
This process draws a random sample of the missing values from its distribution. This process results in valid statistical inferences that properly reflect the uncertainty due to missing values—for example, confidence intervals with the correct probability coverage.
Multiple imputation inference involves three distinct phases:
In SAS, Proc MI is used to replace missing values with multiple imputation.
Multiple imputation inference involves three distinct phases:
- The missing data are filled in m times to generate m complete data sets.
- Perform regression or any other analysis on each of the m complete data sets.
- Average the values of the parameter estimates across the M samples to produce a single point estimate.
- Calculate the standard errors by averaging the standard errors of the M parameter estimates. Also calculate variance of the M parameter estimates across samples.
Objective of Multiple Imputation
The main goal of Multiple Imputation is to get robust estimates of your model. It means model is unbiased by missing data.
Missing Data Patterns
1. Monotone :
If a variable has missing data, all variables to the right of the missing data variable in a rectangular data array are also missing.
For example, If an observation has missing value in the third variable, monotonic missing is like o o m m m (all variables to the right has missing data), and one kind of non-monotonic missing can be o o m o m, where o indicates observed, m indicates missing.
2. Arbitrary:
Arbitrary missing data is a missing data pattern that has missingness spread among full data values (no observed missing data pattern).
I. MCMC : Markov Chain Monte Carlo method (Default Method)
The MCMC method is used to impute missing values for a data set with an arbitrary missing pattern. This is the default method in PROC MI (METHOD=MCMC).
PROC MI data = mi_input_data seed=44853 nimpute=5 out=mi_output_data ;
multinormal method=mcmc;
var outcome age gender ethnicity BMI FBS heart_rate ;
run ;
PROC LOGISTIC data=mi_output_data outest=outreg covout;
class gender ;
model outcome= age gender age gender ethnicity BMI FBS heart_rate /covb ;
by _imputation_ ;ods output ParameterEstimates=mi_parms CovB=mi_covb;
run ;
PROC MIANALYZE data = outreg;
modeleffects Intercept age gender ethnicity BMI FBS heart_rate ;
run ;
NIMPUTE = specifies the number of imputations. In this case, it is 5.
Interpretation : We average across the five sets of coefficients, we average across the standard errors, and we take the variability of regression coefficients across the five sets of imputed data.
Important PROC MI Statements
- CLASS specifies the classification variables in the VAR statement
- MONOTONE specifies imputation methods for a data set with a monotone missing pattern
- VAR specifies the variables to be analyzed (both missing and non-missing)
proc mi data=Fish1 seed=13951639 out=outex3;
Class Length1
monotone reg(Length1 Length2)
var Length1 Length2 Length3;
run;
Note : Length1 and Length2 are variables to be imputed.
III. Imputing Nominal and Ordinal variables (Discriminant and Logistic Regression)
var Length1 Length2 Length3;
run;
Note : Length1 and Length2 are variables to be imputed.
III. Imputing Nominal and Ordinal variables (Discriminant and Logistic Regression)
proc mi data=Fish seed=1305417 out=OutFish;
class Species;
logistic( Species= Length Height Width Height*Width/ details);
var Length Height Width Species;
run;
class Species;
logistic( Species= Length Height Width Height*Width/ details);
var Length Height Width Species;
run;
Ordinal and Binary : Logistic option
Non-Binary : Discrim option
Nominal and Binary : Discrim option
In the above code, Species is a variable on which missing data exists. Var statement includes list of both missing and non-missing variables to be used as predictors in the imputation models.
In the above code, Species is a variable on which missing data exists. Var statement includes list of both missing and non-missing variables to be used as predictors in the imputation models.
Example Code : Both Categorical and Continuous Data
proc mi data=Fish seed=1305417 out=OutFish;
class Species;
monotone reg(Height Width/ details)
logistic( Species= Length Height Width Height*Width/ details);
var Length Height Width Species;
run;
What is the use of Seed?
ReplyDeleteIt's the seed for the randomness of the proc. So if you use the same seed for the same setup you'll get the same results. Important if you need to be able to replicate the same results over different runs.
DeleteCan logistics under MI handle more than one levels say 0, 1, 2, 3?
ReplyDelete