Cluster Analysis using SAS

This tutorial explains how to do cluster analysis in SAS. It also covers detailed explanation of various statistical techniques of cluster analysis with examples. Cluster analysis is mainly used for segmentation. It has gained popularity in almost every domain to segment customers.

Cluster Analysis

Finding similarities between data on the basis of the characteristics found in the data and grouping similar data objects into clusters. It is an unsupervised learning technique (No dependent variable).

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.

Games: Identify player groups on the basis of their age groups, location and types of games they have shown interest in the past.

Internet: Clustering webpages based on their content.

Quality of Clustering

A good clustering method produces high quality clusters with minimum intra-cluster distance (high similarity within the cluster) and maximum inter-class distance (low similarity between two clusters).

Ways to measure Distance

There are multiple ways to measure distance. The most two popular methods are as follows -

Manhattan distance: |x2-x1| + |y2-y1|

Data Preparation

Adequate Sample Size
Remove outliers
Variable Type : Continuous or binary variable
Check Multicollinearity
Standardize Continuous Variables / Transformation

Adequate Sample Size

Sufficient size is needed to ensure representativeness of the population and its underlying structure, particularly small groups within the population.

Minimum group sizes are based on the relevance of each group to the research question and the confidence needed in characterizing that group.

Remove outliers / Percentile Capping

Outliers are observations that fall below Q1 - 1.5(IQR) or above Q3 + 1.5(IQR). Here, IQR = Q3 - Q1. Another method to handle outliers is to cap large values at 99th percentile.

Variable Type

Cluster analysis works most appropriately with binary or continuous data (numeric variables). If you have categorical variables (ordinal or nominal data), you have to group them into binary values - either 0 or 1.

Another approach for handling categorical data :

We will create (k-1) variables for a categorical variable. For example, you have a categorical variable containing 3 categories - Retail , Bank and HR. We will create two variables for handling 3 levels taking one level as response value.

One variable would be Retail - all values of retail tag as 1 and other two levels as 0
Second variable would be Bank - all values of bank as 1 and other two levels as 0

Note : HR level was considered as response value. It is zero in both the variables.

By including Retail and Bank in the model, you will be able to capture all the three levels.

Check Multicollinearity

Multicollinearity means independent variables are highly correlated to each other. In cluster analysis, there is no dependent variable. Hence, all variables are considered independent to each other.

When variables used in clustering are highly correlated, some variables get a higher weight than others.

Correlation coefficients between predictor variables > 0.7 is an appropriate indicator for multicollinearity.

Standardize Continuous Variables

If one variable has a much wider range than others then this variable will tend to dominate. For example, if body measurements had been taken for a number of different people, the range (in mm) of heights would be much wider than the range in wrist circumference (in cm).

If you do not standardise your data then the variables measured in higher unit will dominate the computed dissimilarity and variables that are measured in small unit will contribute very little.

Prior running cluster analysis, we standardize all the analysis variables (real numeric variables) to a mean of zero and standard deviation of one (converted to z-scores).

Standardize Variables

SAS Code : Standardization

In the code below, input data set is named readin and output data set is named outdata. The analysis variables are V1 through V14.

proc standard data=readin out=outdata mean=0 std=1;
var V1-V14;
run;

Standardization can also be done using PROC STDIZE

proc stdize data=readin out=outdata method=std;
var V1-V14;
run;

In case you want to apply MIN-MAX standardization method which is (X-min)/(Max-Min). You can change method = range in PROC STDIZE.

Alternative Method to Standardize Continuous Variables

When you suspect that the data contain non-convex or non-spherical shape, you should estimate the within-cluster co-variance matrix to transform the data instead of standardization.

You can use the ACECLUS procedure to transform the data such that the resulting within-cluster covariance matrix is spherical. It computes canonical variables which would be used in the analyses further. The canonical variables are linear combination of the original variables.

*p : proportion of pairs used for estimating the within-cluster covariance;
proc aceclus data=readin out=outdata p=.03 noprint;
var v1-v14;
run;

* Run Cluster Analysis on Transformed Data;
ods graphics on;
proc cluster data=outdata method=ward ccc pseudo print=15 outtree=Tree;
var can1 can2 can3 ;
id srl;
run;
ods graphics off;

Note : The VAR statement specifies that the canonical variables computed in the ACECLUS procedure are used in the cluster analysis. The ID statement specifies that the variable SRL should be added to the Tree output data set.

If the clusters have very different covariance matrices, PROC ACECLUS is not useful. In that case, you can rely on single linkage clustering.

proc cluster data=nonconvex outtree=tree method=SINGLE noprint;
run;

Type of Clustering

K-means Clustering (Flat Clustering)
Hierarchical clustering (Agglomerative clustering)

K- means Clustering

In k-means clustering algorithm we take the number of inputs, represented with the k, the k is called as number of clusters from the data set. The value of k will define by the user and the each cluster having some distance between them, we calculate the distance between the clusters using the Euclidean distance formula.

Steps to perform k-means clustering

1. Choose the number of clusters k

2. Compute center of these clusters i.e. centroid or cluster seeds (mean of the points in a cluster) . We can take any random objects as the initial centroids or the first k objects in sequence.

3. Determine the distance of each object to the centroids using Euclidean distance.

4. Group the object based on minimum distance.

5. Computing New Cluster Seeds - Recompute the centroids (centers) of these clusters by taking mean of all points in each cluster formed above.

6. Repeat Steps 2 ,3, 4 and 5 until the centroids no longer change ( or convergence is reached) .

Calculation Steps : How K-mean clustering works

Dataset : A1(2,10), A2(2,5), A3(8,5), B1(5, 8), B2(7,5), B3(6,4), C1(1,2), C2(4,9)

Step 1 : We choose 3 clusters.
Step 2 : The initial cluster centers – means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. They are also called cluster seeds.
Step 3 : We need to calculate the distance between each data points and the cluster centers using the Euclidean distance.

Two points (x1,y1), (x2,y2)

or Manhattan distance = |x2-x1| + |y2-y1|

1st Row:

Distance calculate between the A2 data point and the Centroids A1, B1, C1
Distance between A2(2,5) & A1(2, 10) = |2-2| + |5-10| = 0+5 = 5
Distance between A2(2,5) & B1(5, 8) = |2-5| + |5-8| = 3+3 = 6
Distance between A2(2,5) & C1(1, 2) = |2-1| + |5-2| = 1+3 = 4
The A2 nearby Cluster Center is C1.

2nd Row:

Distance calculate between the A3 data point and the Centroids A1, B1, C1
Distance between A3(8,5) & A1(2,10) = 11
Distance between A3(8,5) & B1(5,8) = 6
Distance between A3(8,5) & C1(1,2) = 10
The A3 nearby Cluster Center is B1.

3rd Row:

Distance calculate between the B2 data point and the Centroids A1, B1, C1
Distance between B2(7,5) & A1(2,10) = 10
Distance between B2(7,5) & B1(5,8) = 5
Distance between B2(7,5) & C1(1,2) = 9
The B2 nearby Cluster Center is B1.

4th Row:

Distance calculate between the B3 data point and the Centroids A1, B1, C1
Distance between B3(6,4) & A1(2,10) = 10
Distance between B3(6,4) & B1(5,8) = 5
Distance between B3(6,4) & C1(1,2) = 7
The B3 nearby Cluster Center is B1.

5th Row:

Distance calculate between the C2 data point and the Centroids A1, B1, C1
Distance between C2(4,9) & A1(2,10) = 3
Distance between C2(4,9) & B1(5,8) = 2
Distance between C2(4,9) & C1(1,2) = 10
The C2 nearby Cluster Center is B1.

Step 4 : Calculate Cluster Seeds (Mean Values)

Cluster A1 (2, 10) nearby point is A1(2,10), which was the old mean, so the cluster center remains the same.

Cluster B1(5,8) nearby points are B1(5,8), A3(8,5), B2(7,5), B3(6, 4), C2(4, 9)
B1 Mean value = (5+8+7+6+4/5 , 8+5+5+4+9/5) = (6, 6.2)

Cluster C1(1,2) nearby points are C1(1,2), A2(2,5)
C1 Mean value = (1.5, 3.5)

The updated Cluster seeds are : A1(2, 10), B1(6, 6.2), C1(1.5, 3.5)

Step 5 : Go for the next iteration with the updated cluster seeds.

We need to calculate the distances between the each data points to updated centroids.

Cluster A1(2, 10) nearby points are C2(4,9)

A1 Mean value = (3, 9.5)

Cluster B1(6, 6.2) nearby points are A3(8,5), B2(7,5), B3(6,4)

B1 Mean value = (6.7, 4)

Cluster C1(1.5,3.5) nearby points are A2(2,5)

C1 Mean value = (1.7, 4.2)

The updated Cluster points are : A1(3, 9.5), B1(6.7, 4), C1= (1.7, 4.2)

After completion of the iteration 2 the cluster points are not equal to the iteration 1 cluster points, and then we need to go for the iteration 3.

Step 6 : Check Convergence

The cluster seeds are no change between the Iteration 2 and the iteration 3, then we stop the iteration.

Limitations of k-means clustering

The number of clusters must be known before using k-means clustering.
Sensitive to outliers, noise as mean is used.
When the number of data are not so many, initial grouping will determine the cluster significantly.

Determine the number of clusters in k-means Clustering

Run k-means clustering code multiple times and look for consensus among the two statistics—that is, local peaks of the CCC and pseudo-F statistic.

1. Pseudo F

Look for Pseudo F to increase to a maximum as we increment the number of clusters by 1, and then observe when the Pseudo F starts to decrease. At that point we take the number of clusters at the (local) maximum.

2. Cubic Clustering Criterion (CCC)

Look for CCC to increase to a maximum as we increment the number of clusters by 1, and then observe when the CCC starts to decrease. At that point we take the number of clusters at the (local) maximum.

Note :

Largest value of CCC greater than 2 or 3 indicate good clusterings.
Largest value of CCC between 0 and 2 indicate possible clusters but should be interpreted cautiously.

In R, there are multiple ways to determine the number of clusters.

SAS Code

proc fastclus data=sashelp.iris maxclusters=2 maxiter=100 converge=0
mean=mean out=prelim;
var petal: sepal:;
run;

Note : If you want a complete convergence (i.e. no relative change in cluster seeds), set converge = 0 and a large value for the MAXITER option.

Explanation of the above code

The statement maxclusters= tells SAS to form the number of clusters using k-means algorithm.

The statement MEAN=[SAS-data-set] creates an output data set mean that contains the cluster means and other statistics for each cluster.

The statement out=[SAS-data-set] creates an output data set that contains the original variables and two new variables, cluster and distance. The variable cluster contains the cluster identification number to which each observation has been assigned. The variable distance contains the distance from the observation to its cluster seed.

The next thing we want to do is to look at the cluster means. We can characterize the individual clusters.

SAS Macro for k-means clustering

This macro helps us to run the fastclus code multiple times and look for consensus among the two statistics—that is, local peaks of the CCC and pseudo-F statistic.

Another metrics to check - Rsquare value. The R-square value increases as more clusters are specified. The optimal number of numbers would be where R-square reaches local maximum and then starts falling or not increasing much. However, Milligan and Cooper demonstrated that changes in the R-Square are not very useful for estimating the number of clusters, but it may be useful if you are interested solely in data reduction.

The macro variable K represents the number of clusters defined in the PROC FASTCLUS procedure.

/*Run the fastclus multiple times*/
%macro kmean(K);

proc fastclus data=readin out=outdata&K. maxclusters= &K. maxiter=100 converge=0;
var v1-v14;
run;

%mend;

%kmean(3);
%kmean(4);
%kmean(5);

Next Step : Visualize Clusters

Suppose optimal number of clusters come out 4. We need to check whether or not the clusters overlap with each other in terms of their location in the k-dimensional space 14 variables. It is not possible to visualize clusters in 14 dimensions. To work around this problem, we can use canonical discriminant analysis which is a data reduction technique that creates a smaller number of variables that are linear combinations of the 14 clustering variables. The new variables called canonical variables are ordered in terms of the proportion of variance in the clustering variable that is accounted for by each of the canonical variables. So the first canonical variable will account for the largest proportion of the variance.

Canonical discriminant analysis finds linear combinations of the numeric variables that provide maximal separation between classes or groups.

SAS : Canonical Discriminant Analysis

* Canonical Discriminant Analysis;
proc candisc data=outdata4 out=egclustcan;
class cluster;
var v1-v14;
run;

*Plots the two canonical variables generated from PROC CANDISC, can1 and can2.
proc sgplot data=egclustcan;
scatter y=can2 x=can1 / group=cluster;
run;

Hierarchical clustering

Each observation begins in a cluster by itself
Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
Compute distances (similarities) between the new cluster and each of the old clusters.
Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.

Clustering algorithms

Single linkage or nearest neighbor – the similarity between clusters is the shortest distance between any object in one cluster and any object in the other cluster. It is the most commonly used and its very flexible. It can define a wide range of clustering patterns. When clusters are poorly delineated could create problems.
Complete linkage or farthest neighbor - Cluster similarity is based on the maximum distance between observations in each cluster. Similarity between the clusters is the smallest circle that could encompass both of them. Eliminates some of the problems of earlier method and has been found to generate the most compact clustering solutions.
Centroid method – The similarity between two clusters is the distance between its centroids. They could produce confusing results.
Ward’s Method – The measures of similarity are the sum of squares within the cluster summed over all variables. The retained clusters are the ones with the smallest values.

Basically, Ward’s Method looks at cluster analysis as an analysis of variance problem, instead of using distance metrics or measures of association.

Determine the number of clusters in Hierarchical Clustering

Look for consensus among the three statistics—that is, local peaks of the CCC and pseudo-F statistic combined with a small value of the pseudo-T2 statistics followed by quick increasing value of pseudo-T2 statistics for the next cluster fusion. These criteria are appropriate only for compact or slightly elongated clusters, preferably clusters that are roughly multivariate normal.

1. Pseudo T2 statistic

Look for the first relatively large value, then move up one cluster (clustering in step k+1 is selected as the optimal cluster).

2. Pseudo F statistic

Look for Pseudo F to increase to a maximum as we increment the number of clusters by 1, and then observe when the Pseudo F starts to decrease. At that point we take the number of clusters at the (local) maximum.

3. Cubic Clustering Criterion (CCC)

Largest value of CCC (Peaks on the plot with the CCC) greater than 2 or 3 indicate good clusterings.
Largest value of CCC (Peaks with the CCC) between 0 and 2 indicate possible clusters but should be interpreted cautiously.
If the CCC increases continually as the number of clusters increases, the distribution may be grainy or the data may have been excessively rounded or recorded with just a few digits.

In R, there are multiple ways to determine the number of clusters.

Dendrogram

A dendrogram is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. In step 0, each observation begins in a cluster by itself . In each successive steps, find the closest (most similar) pair of clusters and merge them into a single cluster.

How to interpret Dendrogram

In the example above, data point 4 and 5 are more similar to each other than to data point 3. In addition, data points 1 and 2 are more similar to each other than 4 and 5 are to 3.

Limitation of Hierarchical clustering

Slower running process. Hence, it is not possible in large data sets.

SAS Code

ods graphics on;
proc cluster data=sashelp.iris method=centroid ccc pseudo out= tree;
var petal: sepal:;
copy species;
run;
ods graphics off;

Criteria for Number of Clusters

proc tree data = tree noprint nclusters=3 out=out;
copy petal: sepal: species;
run;

proc candisc data=out out=cluscan distance anova;
class cluster;
var petal: sepal:;
run;

proc sgplot data=cluscan;
scatter y=can2 x=can1 / group=cluster;
run;

Visualize Cluster

PROC CLUSTER

The METHOD= specification determines the clustering method used by the procedure. Here, we are using CENTROID method.

The CCC PSEUDO options displays the CCC statistics, pseudo-F and pseudo-T statistics.

The OUT= [Dataset Name] creates output data sets that contain the results of hierarchical clustering as a tree structure.

The VAR statement lists numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used. The ":" represents all the variables named with the keyword preceding colon sign.

The COPY statement specifies extra variables to be copied to the OUT= data set.

PROC TREE

The PROC TREE procedure creates a data set containing a variable cluster tells the cluster identification number to which each observation has been assigned.

The NCLUSTERS= option specifies the number of clusters desired in the OUT= data set.

The COPY statement specifies extra variables to be copied to the OUT= data set.

Interpretation of Results

Semipartial R-squared

Semipartial R-square is a measure of the homogeneity of merged clusters, so Semipartial R-squared is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. Thus, the SPRSQ value should be small to imply that we are merging two homogeneous groups.

R-square

R-square measures the extent to which groups or clusters are different from each other (so, when you have just one cluster RSQ value is, intuitively, zero). Thus, the RSQ value should be high.or as close to 1 as possible as it explains the proportion of variance accounted for by the clusters. It is an important method to evaluate quality of clusters.

Cubic Clustering Criterion

The Cubic Clustering Criterion (CCC) is a comparative measure of the deviation of the clusters from the distribution expected if data points were obtained from a uniform distribution.

Larger positive values of the CCC indicate a better solution, as it shows a larger difference from a uniform (no clusters) distribution. However, the CCC may be incorrect if clustering variables are highly correlated.

Pseudo-F statistic

The pseudo-F statistic is intended to capture the 'tightness' of clusters, and is in essence a ratio of the mean sum of squares between groups to the mean sum of squares within group.

Larger numbers of the pseudo-F usually indicate a better clustering solution. If pseudo-F decreases with k and reaches a maximum value, the value of k at the maximum or immediately prior to the point may be a candidate for the value of k.

Centroid Distance

Centroid Distance is simply the Euclidian distance between the centroid of the two clusters that are to be joined or merged. So, Centroid Distance is a measure of the homogeneity of merged clusters and the value should be small.

Best Approach : Combination of both techniques

First use a hierarchical technique to generate a complete set of cluster solutions and establish the appropriate number of clusters. Then, you use a k-means (nonhierarchical) method.

One should analyze and examine the fundamental in the defined clusters. Clusters with small number of observations should be through fully examined – do they represent valid components or simply outliers?

About Author:
Deepanshu Bhalla

Deepanshu founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 10 years of experience in data science. During his tenure, he worked with global clients in various domains like Banking, Insurance, Private Equity, Telecom and HR.

While I love having friends who agree, I only learn from those who don't
Let's Get Connected Email LinkedIn

Post Comment 14 Responses to "Cluster Analysis using SAS"

AnonymousMay 28, 2015 at 6:01 AM
Awesome explanation!!

Thanks for sharing this :)
UnknownJune 6, 2015 at 10:42 AM
Thank you for your informative explanation. I have imported my data from the excel sheet, but I don't know what to do next ! It would be great if you could guide me. Thanks
AnonymousJuly 27, 2015 at 9:35 AM
Zohara Rafi - you have no hope if that's all you have managed
deimDecember 20, 2015 at 8:57 AM
Nice work!
OsayiDecember 23, 2015 at 7:27 PM
This was very helpful. Thank you!
Ayush GargJanuary 20, 2016 at 7:27 AM
Isn't using rattle a good option in R
UnknownMay 3, 2016 at 3:14 AM
HOW CAN WE DO CLUSTERING OF HOTEL BASED ON THEIR REVIEWS IN R? WHAT WILL BE THE PROCEDURE?
UnknownAugust 11, 2016 at 9:50 PM
Hi Deepanshu,

Can you please explain more details about two -step clustering?. If I have large data how do I start with hierarchical technique.

Thanks a lot for such informative article.
AnonymousJanuary 24, 2017 at 8:22 PM
very simple explanation , thanks fr the effort
In Step 4 : Calculate Cluster Seeds (Mean Values)
B1 Mean value = (5+8+7+6+4/4,8+5+5+4+9/4) = (6, 6.2) should be
B1 Mean value = (5+8+7+6+4/5,8+5+5+4+9/5) = (6, 6.2)

AnonymousMarch 14, 2017 at 4:35 AM
What technique can you use if you have some variables that have multicollinearity? Are you still able to use proc varclus in SAS and visualise using proc tree?
satyajit yadavNovember 28, 2019 at 2:13 AM
Thanks great work...
UnknownMay 7, 2020 at 12:37 PM
Clear and Simple Explanation. Thanks.