SAS : Proc Varclus Explained

Deepanshu Bhalla 12 Comments , ,
The VARCLUS procedure is a useful SAS procedure for variable reduction. It is based on divisive clustering technique.
  1. All variables start in one cluster. Then, a principal components analysis is done on the variables in the cluster to determine whether the cluster should be split into two subsets of variables.
  2. If the second eigenvalue for the cluster is greater than the specified cutoff, then the inital cluster is split into two clusters. If the second eigenvalue is large, it means that at least two principal components account for a large amount of variation among the inputs.
  3. To determine which inputs are included in each cluster, the principal component scores are rotated obliquely to maximize the correlation within a cluster and minimize the correlation between clusters. 
  4. This process ends when the second eigenvalues of all current clusters fall below the cutoff.
If a cluster has only 1 variable in it, it means that this variable has only one principal component and hence, second eigenvalue of this variable is 0.
proc varclus data=imputed maxeigen=.7 short hi;
var Q1-Q5 VAR1-VAR20;
run;
The MAXEIGEN option specifies the largest permissible value of the second eigenvalue in each cluster (default value is 1)

The HI option specifies that the clusters at different levels maintain a hierarchical structure that prevents variables from transferring from one cluster to another after the split is made. In other words, variables cannot be reassigned to other clusters as they are assigned once in a cluster.

The SHORT option suppresses some of the output generated by PROC VARCLUS.

Important Points
  1. By default, maximum clusters is equal to the number of variables in the model. The MAXCLUSTERS option can be used to specify the largest number of clusters desired. It's better not to specify the option and let SAS decides the number of clusters.
  2. By default, PROC VARCLUS uses a non-hierarchical version of this algorithm, in which variables can also be reassigned to other clusters. The HI option is used to run hierarchical version.
  3. Larger eigenvalue thresholds result in fewer clusters, and smaller thresholds yield more clusters.
  4. Variables belonging to different clusters may be correlated as it is a type of oblique component analysis.

How to select best variables from each cluster

A best variable has a high correlation with its own cluster and has a low correlation with the other clusters.
A variable that has the lowest 1- R-squared ratio is likely to be a good representative for the cluster. It means maximum correlation with own cluster and minimum correlation with next cluster.
Why lowest 1 - R-squared ratio?

It is because when a variable has maximum correlation with own cluster and minimum correlation with next cluster. the 1- R**2 ratio will be minimum. See the formula below.


SAS Macro for Variable Selection

%macro varsel(input=, vars= , output =);
ods select none;
ods output clusterquality=summary
           rsquare=clusters;

proc varclus data=&input maxeigen=.7 short hi;
   var &vars;
run;
ods select all;

data _null_;
set summary;
call symput('nvar',compress(NumberOfClusters));
run;

data selvars;
set clusters (where = (NumberOfClusters=&nvar));
keep Cluster Variable RSquareRatio;
run;

data cv / view=cv;
retain dummy 1;
set selvars;
keep dummy cluster;
run;

data filled;
update cv(obs=0) cv;
by dummy;
set selvars(drop=cluster);
output;
drop dummy;
run;

proc sort data = filled;
by cluster RSquareRatio;
run;

data &output;
set filled (rename = (variable = Best_Variables));
if first.cluster then output;
by cluster;
run;

%mend;

%varsel(input= abc, vars= _numeric_ , output = rest);

Same Variable Selection Technique (Varlcus) in R
install.packages("Hmisc")
v = varclus(x, similarity="spear")
R Code : IV and Clustering
Related Posts
Spread the Word!
Share
About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

12 Responses to "SAS : Proc Varclus Explained"
  1. Excellent Article!!!.I have been looking for this logic.

    ReplyDelete
  2. Great article, simple and to the point without technical jargon clutters. Excellent for someone new to proc varclus looking to get started with it...

    ReplyDelete
  3. Amazing! Easy to follow with clear explanations.

    Thank you!

    ReplyDelete
  4. Hi,

    Can you suggest how do we perform the same in R?

    Thanks,
    Krishna

    ReplyDelete
  5. Good job Deepanshu ! The 'R-square own cluster' & 'R-square next cluster' is not clear from your article. Specifically I mean 'next' is not explicitly presented here with clarity.

    ReplyDelete
  6. Can't we use proc factor for variables reduction?

    ReplyDelete
  7. whats th password for the excel?

    ReplyDelete
  8. Can someone please explain how to calculate R-squre within and next closest cluster?

    ReplyDelete
  9. what is the password of the excel?

    ReplyDelete
  10. Can you please provide the code with example dataset? This helps in understanding the code flow! Appreciate your efforts. Thank you

    ReplyDelete
  11. Thank you..great articles...what is the password for the excel file?

    ReplyDelete
Next → ← Prev