SAS : Proc Varclus Explained

The VARCLUS procedure is a useful SAS procedure for variable reduction. It is based on divisive clustering technique.
  1. All variables start in one cluster. Then, a principal components analysis is done on the variables in the cluster to determine whether the cluster should be split into two subsets of variables.
  2. If the second eigenvalue for the cluster is greater than the specified cutoff, then the inital cluster is split into two clusters. If the second eigenvalue is large, it means that at least two principal components account for a large amount of variation among the inputs.
  3. To determine which inputs are included in each cluster, the principal component scores are rotated obliquely to maximize the correlation within a cluster and minimize the correlation between clusters. 
  4. This process ends when the second eigenvalues of all current clusters fall below the cutoff.
If a cluster has only 1 variable in it, it means that this variable has only one principal component and hence, second eigenvalue of this variable is 0.
proc varclus data=imputed maxeigen=.7 short hi;
var Q1-Q5 VAR1-VAR20;
run;
The MAXEIGEN option specifies the largest permissible value of the second eigenvalue in each cluster (default value is 1)

The HI option specifies that the clusters at different levels maintain a hierarchical structure that prevents variables from transferring from one cluster to another after the split is made. In other words, variables cannot be reassigned to other clusters as they are assigned once in a cluster.

The SHORT option suppresses some of the output generated by PROC VARCLUS.

Important Points
  1. By default, maximum clusters is equal to the number of variables in the model. The MAXCLUSTERS option can be used to specify the largest number of clusters desired. It's better not to specify the option and let SAS decides the number of clusters.
  2. By default, PROC VARCLUS uses a non-hierarchical version of this algorithm, in which variables can also be reassigned to other clusters. The HI option is used to run hierarchical version.
  3. Larger eigenvalue thresholds result in fewer clusters, and smaller thresholds yield more clusters.
  4. Variables belonging to different clusters may be correlated as it is a type of oblique component analysis.

How to select best variables from each cluster

A best variable has a high correlation with its own cluster and has a low correlation with the other clusters.
A variable that has the lowest 1- R-squared ratio is likely to be a good representative for the cluster. It means maximum correlation with own cluster and minimum correlation with next cluster.
Why lowest 1 - R-squared ratio?

It is because when a variable has maximum correlation with own cluster and minimum correlation with next cluster. the 1- R**2 ratio will be minimum. See the formula below.


SAS Macro for Variable Selection

%macro varsel(input=, vars= , output =);
ods select none;
ods output clusterquality=summary
           rsquare=clusters;

proc varclus data=&input maxeigen=.7 short hi;
   var &vars;
run;
ods select all;

data _null_;
set summary;
call symput('nvar',compress(NumberOfClusters));
run;

data selvars;
set clusters (where = (NumberOfClusters=&nvar));
keep Cluster Variable RSquareRatio;
run;

data cv / view=cv;
retain dummy 1;
set selvars;
keep dummy cluster;
run;

data filled;
update cv(obs=0) cv;
by dummy;
set selvars(drop=cluster);
output;
drop dummy;
run;

proc sort data = filled;
by cluster RSquareRatio;
run;

data &output;
set filled (rename = (variable = Best_Variables));
if first.cluster then output;
by cluster;
run;

%mend;

%varsel(input= abc, vars= _numeric_ , output = rest);

Same Variable Selection Technique (Varlcus) in R
install.packages("Hmisc")
v = varclus(x, similarity="spear")
R Code : IV and Clustering

SAS Tutorials : 100 Free SAS Tutorials


Statistics Tutorials : 50 Statistics Tutorials

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

7 Responses to "SAS : Proc Varclus Explained"

  1. Excellent Article!!!.I have been looking for this logic.

    ReplyDelete
  2. Great article, simple and to the point without technical jargon clutters. Excellent for someone new to proc varclus looking to get started with it...

    ReplyDelete
  3. Amazing! Easy to follow with clear explanations.

    Thank you!

    ReplyDelete
    Replies
    1. Thank you for your lovely words. Cheers!

      Delete
  4. Hi,

    Can you suggest how do we perform the same in R?

    Thanks,
    Krishna

    ReplyDelete
  5. Good job Deepanshu ! The 'R-square own cluster' & 'R-square next cluster' is not clear from your article. Specifically I mean 'next' is not explicitly presented here with clarity.

    ReplyDelete
  6. Can't we use proc factor for variables reduction?

    ReplyDelete

Next → ← Prev