Cluster Analysis with R


R Data Science: R Programming A-Z: R For Data Science With Real Exercises!


Data Science: Machine Learning A-Z: Hands-On Python & R In Data Science

Cluster Analysis
Finding similarities between data on the basis of the characteristics found in the data and grouping similar data objects into clusters. It is an unsupervised learning technique (No dependent variable).

Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

Insurance: Identifying groups of motor insurance policy holders with some interesting characteristics.

Games: Identify player groups on the basis of their age groups, location and types of games they have shown interest in the past.

Internet: Clustering webpages based on their content.

Quality of Clustering

A good clustering method produces high quality clusters with minimum within-cluster distance (high similarity) and maximum inter-class distance (low similarity).

Ways to measure Distance

Manhattan distance: |x2-x1| + |y2-y1|

Data Preparation
  1. Adequate Sample Size
  2. Standardize Continuous Variables
  3. Remove outliers
  4. Variable Type : Continuous or binary variable
  5. Check Multicollinearity

Types of Clustering
  1. K-means Clustering (Flat Clustering)
  2. Hierarchical clustering (Agglomerative clustering)

Detailed Theoretical Explanation - How Cluster Analysis works

Assess clustering tendency (clusterability)

It is important to assess cluster tendency (i.e. to determine whether data contain meaningful clusters) before running any clustering algorithms. In unsupervised learning, the clustering methods returns clusters even if the data does not contain any meaninful cluster pattern.

Hopkins statistic is used to assess the clustering tendency of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution (i.e. no meaningful clusters).

The null and the alternative hypotheses are defined as follow:
  1. Null hypothesis: the dataset is uniformly distributed (i.e., no meaningful clusters)
  2. Alternative hypothesis: the dataset is not uniformly distributed (i.e., contains meaningful clusters)
If the value of Hopkins statistic is close to 0,it means that the data is highly clusterable. If the value is close to 0.5, that means the data contains no meaningful clusters.

Determine the optimal number of clusters

In R, there is a package called "NbClust" that provides 30 indices to determine the optimal number of clusters. The 2 important methods out of 30 methods are as follows -
  1. Look for a bend or elbow in the sum of squared error (SSE) scree plot. The location of the elbow in the plot suggests a suitable number of clusters for the kmeans.
  2. Silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. A higher silhouette width is preferred to determine the optimal number of clusters. Observations with a negative width are probably placed in the wrong cluster.

Important Steps : Cluster Analysis
  1. Select important variables for analysis
  2. Check and remove outliers
  3. Standardize variables
  4. Assess clusterability
  5. Select the right clustering algorithm and optimal no. of clusters - Internal Validation (clValid Package)
  6. External Validation (if true label available)

R Code : Cluster Analysis
Note : We can automate selecting the best clustering algorithm and optimal number of clusters with clValid Package.
# Loading data
data<-iris[,-c(5)]

# To standarize the variables
data = scale(data)

# Assessing cluster tendency
if(!require(clustertend)) install.packages("clustertend")
library(clustertend)
# Compute Hopkins statistic for the dataset
set.seed(123)
hopkins(data, n = nrow(data)-1)
#Since the H value = 0.1815 which is far below the threshold 0.5, it is highly clusterable

###########################################################################
####################### K Means clustering ################################
###########################################################################

# K-mean - Determining optimal number of clusters
# NbClust Package : 30 indices to determine the number of clusters in a dataset
# If index = 'all' - run 30 indices to determine the optimal no. of clusters
# If index = "silhouette" - It is a measure to estimate the dissimilarity between clusters.
# A higher silhouette width is preferred to determine the optimal number of clusters

if(!require(NbClust)) install.packages("NbClust")
nb <- NbClust(data,  distance = "euclidean", min.nc=2, max.nc=15, method = "kmeans",
              index = "silhouette")
nb$All.index
nb$Best.nc

#Method II : Same Silhouette Width analysis with fpc package
library(fpc)
pamkClus <- pamk(data, krange = 2:15, criterion="multiasw", ns=2, critout=TRUE)
pamkClus$nc
cat("number of clusters estimated by optimum average silhouette width:", pamkClus$nc, "\n")

#Method III : Scree plot to determine the number of clusters
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:15) {
  wss[i] <- sum(kmeans(data,centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

# K-Means Cluster Analysis
fit <- kmeans(data,pamkClus$nc)

# get cluster means
aggregate(data,by=list(fit$cluster),FUN=mean)

# append cluster assignment
data <- data.frame(data, clusterid=fit$cluster)

###########################################################################
####################### Hierarchical clustering############################
###########################################################################

# Hierarchical clustering - Determining optimal number of clusters
library(NbClust)
res<-NbClust(data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=6,
             method = "ward.D2", index = "kl")
res$All.index
res$Best.nc

# Ward Hierarchical Clustering
d <- dist(data, method = "euclidean")
fit <- hclust(d, method="ward.D2")
plot(fit) # display dendogram

# cluster assignment (members)
groups <- cutree(fit, k=2)
data = cbind(data,groups)

# draw dendogram with red borders around the 2 clusters
rect.hclust(fit, k=2, border="red")
Next Step : Validate Cluster Analysis 
Coursera Data Science

R Tutorials : 75 Free R Tutorials


Statistics Tutorials : 50 Statistics Tutorials

Get Free Email Updates :
*Please confirm your email address by clicking on the link sent to your Email*

Related Posts:

3 Responses to "Cluster Analysis with R"

  1. HI Folk,
    In the beginning you wrote code to standardize the variables,
    but its lil wrong,
    its not working
    I dont think so that there is any function like Data, but there is a function like Scale.
    Kindly help

    ReplyDelete
    Replies
    1. That's a typo. Corrected! Thank you for bringing this issue into my attention.

      Delete
  2. why did you put set.seed(123)?
    can you please explain..

    ReplyDelete

Next → ← Prev