This article explains how to check the assumptions of multiple regression and the solutions to violations of assumptions.

If a value is higher than the 1.5*IQR above the upper quartile (Q3), the value will be considered as outlier. Similarly, if a value is lower than the 1.5*IQR below the lower quartile (Q1), the value will be considered as outlier. In SAS,

/* Studentized residuals - Check Outliers*/

model crime=pctmetro poverty single / stb clb;

output out=stdres p= predict r = resid rstudent=r h=lev cookd=cookd dffits=dffit;

run;

quit;

ods graphics off;

/* Print only those observations having absolute value of studentized residual greater than 3*/

proc print data= stdres;

var r crime pctmetro poverty single;

where abs(r)>=3;

run;

Higher the Cook's D is, the more influential the point is.

proc print data=stdres;

where cookd > (4/51);

var cd crime pctmetro poverty single;

run;

It means replacing extreme values with the largest/smallest non-extreme observation.

In layman's terms, capping at the 1st and 99th percentile means values that are less than the value at 1st percentile are replaced by the value at 1st percentile, and values that are greater than the value at 99th percentile are replaced by the value at 99th percentile.

Another way of thinking of this is that the variability in values for your independent variables is the same at all values of the dependent variable.

proc reg data= reg.crime;

model crime = poverty single;

plot r.*p.;

run;

quit;

The White test tests the null hypothesis that the variance of the residuals is homogenous (equal). We use the / spec option on the model statement to obtain the White test.

It measures how much the variance of an estimated regression coefficient is increased because of collinearity.

model crime = poverty single /

run;

Multicollinearity inflates the standard errors, making it impossible to determine the relative

importance of the predictors. In other words, the coefficients will be unreliable. Note that multicollinearity does not affect the efficiency of the estimators – they remain BLUE (Best Linear Unbiased Estimators).

1. Run

2. Use centering: which is subtracting the mean from the predictor values before generating the square term. The resulting centered data may well display considerably lower multicollinearity.

For example : Weight and Weight2 are faced with problem of multicollinearity.

First Step : Center_Weight = Weight - mean(Weight)

Second Step : Center_Weight2 = Center_Weight ^2

It states that the errors associated with one observation are not correlated with the errors of any other observation.

proc reg data = reg.crime;

model crime = poverty single /

run;

PROC REG tests for

The RMSE for your training and your test sets should be very similar if you have built a good model. If the RMSE for the test set is much higher than that of the training set, it is likely that you've badly over fit the data, i.e. you've created a model that tests well in sample, but has little predictive value when tested out of sample.

The specific transformation used depends on the extent of the deviation from normality.

1. If the distribution differs moderately from normality, a square root transformation is often the best.

2. A log transformation is usually best if the data are more substantially non-normal.

3. An inverse transformation should be tried for severely non-normal data.
4.

Download the dataset (Source : UCLA)

**SAS Code : Reading downloaded file into SAS**

/* Read data from a folder where the file is stored*/

libname reg "C:\Users\Deepanshu Bhalla\Downloads";/* Checking the number of observations, number of variables in a data set*/proc contents data = reg.crime varnum;run;

**Read Statistical Properties of OLS Coefficient Estimators**

**Checking Assumptions of Multiple Regression with SAS**

**1. Detecting Outlier**

**I. Box Plot Method**

**PLOTS**options in

**PROC UNIVARIATE**tells SAS to generate Box Plot graph.

**II. Studentized Residuals Method**

**Studentized Residuals : Meaning**

Before jumping into studentized residuals, we need to understand the meaning of residuals.

**Residuals**is the difference between the observed value and the predicted value.

**Standardized Residuals**is the residuals divided by the standard error of estimate.

**Studentized Residuals**is the residuals divided by the standard error of the residual with that case deleted.If absolute value of studentized residual is greater than 3, the observation is considered as an outlier.

**SAS Code**

/* Studentized residuals - Check Outliers*/

ods graphics on;

proc reg data=reg.crime;model crime=pctmetro poverty single / stb clb;

output out=stdres p= predict r = resid rstudent=r h=lev cookd=cookd dffits=dffit;

run;

quit;

ods graphics off;

/* Print only those observations having absolute value of studentized residual greater than 3*/

proc print data= stdres;

var r crime pctmetro poverty single;

where abs(r)>=3;

run;

**III. Cook's D Method**

If the Cook's D value is greater than 4/(number of observations), the value is considered as an outlier.

**SAS Code**

proc print data=stdres;

where cookd > (4/51);

var cd crime pctmetro poverty single;

run;

**Consequences of Outliers**

Outliers can affect the estimates of the independent variables

**Treatment of Outlier**

**1. Percentile capping based on distribution of a variable**

It means replacing extreme values with the largest/smallest non-extreme observation.

In layman's terms, capping at the 1st and 99th percentile means values that are less than the value at 1st percentile are replaced by the value at 1st percentile, and values that are greater than the value at 99th percentile are replaced by the value at 99th percentile.

**2. Compare Models with or without Outliers**

Smaller the RMSE, Better the Model.

The RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data–how close the observed data points are to the model’s predicted values. Whereas R-squared is a relative measure of fit, RMSE is an absolute measure of fit.

ods graphics on;

proc reg data=reg.crime;

model crime=pctmetro poverty single /

run;

quit;

ods graphics off;

There should be a moderate and

proc corr data=reg.crime;

var pctmetro poverty single;

with crime;

run;

If the assumption of linearity is violated, the linear regression model will return incorrect (biased) estimates. In short, the coefficients as well as R-square will be underestimated.

1. When the error variance appears to be constant (

2. When the error variance does not appear constant it may be necessary to transform Y or both X and Y.

The Shapiro-Wilk W test can be used to check normality assumption. In this case, we set null hypothesis that residual is normally distributed.

proc reg data=reg.crime;

model crime=pctmetro poverty single / stb clb;

output out=stdres p= predict r = resid;

run;

proc univariate data=stdres normal;

var resid;

run;

Many common tests of null hypotheses on regression results require normality. So if the residuals are not normal, then you cannot perform these hypothesis tests.

Transform the

*Lower values of RMSE indicate better fit.***2. Linear Relationship between Dependent and Independent Variables****I. Scatter plot of independent variable vs. dependent variable**ods graphics on;

proc reg data=reg.crime;

model crime=pctmetro poverty single /

**partial**;run;

quit;

ods graphics off;

**II . Run correlation between dependent variable and independent variables**There should be a moderate and

**SIGNIFICANT**correlation score between dependent variable and independent variable.proc corr data=reg.crime;

var pctmetro poverty single;

with crime;

run;

**Check out :**SAS Macro for detecting non-linear relationship

**Consequences of Non-Linear Relationship**

If the assumption of linearity is violated, the linear regression model will return incorrect (biased) estimates. In short, the coefficients as well as R-square will be underestimated.

**Treatment of Non linear Relationship**

1. When the error variance appears to be constant (

**Homoscedasticity**), only X needs be transformed to linearize the relationship. Transform independent variable to Log10(X), Inverse(X), Square root(X), Square(X), Exp(X), 1/X, Exp(-X).2. When the error variance does not appear constant it may be necessary to transform Y or both X and Y.

**Run Box-Cox Transformations for Dependent Variable****3. Errors (Residuals) should be normally distributed**The Shapiro-Wilk W test can be used to check normality assumption. In this case, we set null hypothesis that residual is normally distributed.

If the p-value is greater than .05, it means we cannot reject the null hypothesis that residual is normally distributed.

**SAS Code**

model crime=pctmetro poverty single / stb clb;

output out=stdres p= predict r = resid;

run;

proc univariate data=stdres normal;

var resid;

run;

**Consequences of Non-Normality of Errors**

Many common tests of null hypotheses on regression results require normality. So if the residuals are not normal, then you cannot perform these hypothesis tests.

**Treatment of Non Normality**

**DEPENDENT**variable. Try log, square root and reciprocal transformations.

**Run Box-Cox Transformations for Dependent Variable**

**4. Homoscedasticity**
There should be homogeneity of variance of the residuals. In other words, the variance of residuals are approximately equal for all predicted dependent variable values.

Another way of thinking of this is that the variability in values for your independent variables is the same at all values of the dependent variable.

**I. Plot Residuals by Predicted values**

proc reg data= reg.crime;

model crime = poverty single;

plot r.*p.;

run;

quit;

**II. White,**

**Pagan and Lagrange multiplier (LM)**

**Test**

The White test tests the null hypothesis that the variance of the residuals is homogenous (equal). We use the / spec option on the model statement to obtain the White test.

If the p-value of white test is greater than .05, the homogenity of variance of residual has been met.

**With PROC REG ( No CLASS statement , No Pagan Test)**

proc reg data= reg.crime;

model crime = poverty single/ SPEC;

run;

Note :P-value greater than .05 indicates homoscedasticity.

**With PROC AUTOREG**

**(LM Test, CLASS statement for categorical variables)**

proc autoreg data=reg.crime;

model crime = pctmetro poverty single / archtest;

output out=r r=yresid;

run;

Note :Check P-value of Q statistics and LM tests. P-value greater than .05 indicates homoscedasticity.

**With PROC MODEL**

**(White and Pagan Test , No CLASS statement for categorical variables)**

proc model data= reg.crime;

parms a1 b1 b2;

crime = a1 + b1*poverty + b2*single;

fit crime / white pagan=(1 poverty single)

out=resid1 outresid;

run;

quit;

If the p-value of white test and Breusch-Pagan test is greater than .05, the homogenity of variance of residual has been met.

**Consequences of Heteroscedasticity**

- The regression prediction remains unbiased and consistent but inefficient. It is inefficient because the estimators are no longer the Best Linear Unbiased Estimators (BLUE).
- The hypothesis tests (t-test and F-test) are no longer valid.

**Treatment of Heteroscedasticity**

**Testing and Correcting Heteroscedasticity**

**5. Multicollinearity**

**Mutlicollinearity means there is a high correlation between independent variables. The linear regression model MUST NOT be faced with problem of multicollinearity.**

**VIF (Variance Inflation Factor)**

It measures how much the variance of an estimated regression coefficient is increased because of collinearity.

**Interpretation :**If the variance inflation factor of a predictor variable were 9 (Sqrt(9) = 3) this means that the standard error for the coefficient of that predictor variable is 3 times as large as it would be if that predictor variable were uncorrelated with the other predictor variables.

If VIF is greater than 5, there is a multicollinearity problem in the model.proc reg data = reg.crime;

model crime = poverty single /

**vif**;

run;

**Consequences of Multicollinearity**

Multicollinearity inflates the standard errors, making it impossible to determine the relative

importance of the predictors. In other words, the coefficients will be unreliable. Note that multicollinearity does not affect the efficiency of the estimators – they remain BLUE (Best Linear Unbiased Estimators).

**Treatment of Multicollinearity**

1. Run

**PROC VARCLUS**with

**HI**option (Principal Component Analysis). A variable that has the lowest 1-R2 ratio is likely to be a good representative for the cluster.

2. Use centering: which is subtracting the mean from the predictor values before generating the square term. The resulting centered data may well display considerably lower multicollinearity.

For example : Weight and Weight2 are faced with problem of multicollinearity.

First Step : Center_Weight = Weight - mean(Weight)

Second Step : Center_Weight2 = Center_Weight ^2

**6. Independence of error terms - No Autocorrelation**

It states that the errors associated with one observation are not correlated with the errors of any other observation.

**Suppose you have collected data from labors in eight different districts. It is likely that the labors within each district will tend to be more like one another that labors from different districts, that is, their errors are not independent.**

*It is a problem when you use time series data.*proc reg data = reg.crime;

model crime = poverty single /

**dw**;

run;

PROC REG tests for

**first-order autocorrelations**using the

**Durbin-Watson**coefficient (DW). The null hypothesis is no autocorrelation.

A DW value between 1.5 and 2.5 confirms the absence of first-order autocorrelation. If DW value less than 1.5, it indicates positive autocorrelation. If DW value greater than 2.5, it indicates negative autocorrelationAutocorrelation inflates significance results of coefficients by underestimating the standard errors of the coefficients. Hypothesis testing will therefore lead to incorrect conclusions.

**Another alternative test :**

**Lagrange Multiplier Test**

**It can be used for more than one order of auto correlation.**It consists of several steps. First, regress Y on Xs to get residuals. Compute lag value of residuals up to pth order. Replace missing values for lagged residuals with zeros. Rerun regression model including lagged residual variable as an independent variable.

proc autoreg data = reg.crime;

model crime = poverty single / dwprob godfrey;

run;

**Consequences of Autocorrelation**

Autocorrelation inflates t-statistics by underestimating the standard errors of the coefficients. Hypothesis testing will therefore lead to incorrect conclusions. Estimators no longer have minimum variance but they will remain unbiased.

**Treatment of Autocorrelation**

1. Add lagged transforms (lag value) of the dependent variable

2. Use PROC AUTOREG

It is advisable to build auto-regressive model with PROC AUTOREG for time series data.

**Related Posts :**

**Linear Regression Model with PROC GLMSELECT****Homoscedasticity Simplified with SAS****Scoring Linear Regression Model with SAS**

**Important Point 1 :**Box Cox Transformation of Dependent Variable can solve problem of non-linearity, non-normality of error and heteroscedasticity.

**Run Box-Cox Transformations for Dependent Variable**

**Important Point 2 : RMSE for Training vs Test Sample**

**Important Point 3 : Transformation Rules**

**Check out this link**

The specific transformation used depends on the extent of the deviation from normality.

1. If the distribution differs moderately from normality, a square root transformation is often the best.

2. A log transformation is usually best if the data are more substantially non-normal.

3. An inverse transformation should be tried for severely non-normal data.

**If nothing can be done to "normalize" the variable, then you might want to dichotomize (2 categories) the variable.**

You did an outstanding job. Thanks!

ReplyDeleteI am not able to locate the data at UCLA website . Can you please mention the specific filename and path for the same

ReplyDeleteIs this applicable for logistic model ?

ReplyDeleteOutlier and Multicollinearity assumptions are applicable for logistic model as well

Delete