Chi-Square : Variable Reduction Technique

Chi-Square as variable selection / reduction technique

The Pearson / Wald / Score Chi-Square Test can be used to test the association between the independent variables and the dependent variable.

A Wald/Score chi-square test can be used for continuous and categorical variables. Whereas, Pearson chi-square is used for categorical variables. The p-value indicates whether a coefficient is significantly different from zero.

Note : Wald and Score Chi-Square tests are asymptotically equivalent.

For Continuous variables

Run UNIVARIATE Score / Wald Chi-Square analysis on each of your continuous independent variable with dependent variable. It means putting one variable in independent variable list and setting your dependent variable and run regression analysis. Score and Wald Chi-Square are asymptotically equivalent. In PROC LOGISTIC, use options: selection=stepwise maxstep=1 details

MAXSTEP=1 means that the maximum number of times any of the independent variables can be added or removed is 1 time.

Look at table - "Analysis of Effects Eligible for Entry"

It helps to know the univariate predictive power of each continuous variable.

Rule :

Keep all the variables with a probability of chi-square of less than 0.25.
Select the 50 variables with the highest chi-square score.

ODS OUTPUT EFFECTNOTINMODEL = TEST;
PROC LOGISTIC DATA = BHALLA.MODELING DESC;
MODEL ATTRITION = DDA DDABAL DEP DEPAMT
/ SELECTION = S MAXSTEP = 1 DETAILS;
RUN;

For Categorical variables

Run UNIVARIATE Pearson Chi-Square analysis on each of your categorical independent variable with dependent variable. In PROC FREQ , use options: missing chisq

proc freq data= newdata;
tables attrition * balance/missing ChiSq;
run;

Keep categorical variables where each category has more than about 25 observations and a p-value <=0.25.

For Transformed Variables (Log, sqrt, exp, cube root, inverse etc.)

In PROC LOGISTIC, use options : selection=stepwise maxstep=2 details

Look at "Summary of Stepwise Procedure" table. It will give you two best variables.

Use command ODS OUTPUT ModelBuildingSummary = TEST; to store the 'Summary of Stepwise Procedure' table in SAS

Note : Use of conservative sig level (0.05) often leads to important variables excluded from analysis. It is recommended to use higher level based on the work by Mickey and Greenland (1989) on logistic regression. It is important to CHECK the IMPORTANCE of these variables in multivariable model stage before deciding your final variables.

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 8 Responses to "Chi-Square : Variable Reduction Technique"

AnonymousMay 13, 2015 at 4:34 PM
Why 0.25 and not 0.05?
Deepanshu BhallaMay 14, 2015 at 5:02 AM
Use of conservative sig level (0.05) often leads to important variables excluded from analysis. It is recommended to use higher level based on the work by Mickey and Greenland (1989) on logistic regression. It is important to CHECK the IMPORTANCE of these variables in multivariable model stage before deciding your final variables.
AnonymousMay 14, 2015 at 6:21 AM
ok u mean running the logistic regression with no selection options and viewing p-values of each variable before seeting p-value...but not more than 0.25?
AnonymousMay 14, 2015 at 6:58 AM
Pls i posted questions under the following topics: model validation, scoring in logistic, bootstrapping.

tnx
AyushJune 2, 2018 at 8:59 PM
Hi Sir, do you have another chapter which deals with feature selection of categorical variable using chi square test in R?