####
**R Data Science:**
R Programming A-Z: R For Data Science With Real Exercises!

####
**Data Science:**
Machine Learning A-Z: Hands-On Python & R In Data Science

**Cluster Analysis**

Finding similarities between data on the basis of the characteristics found in the data and grouping similar data objects into clusters. It is an unsupervised learning technique (No dependent variable).

**Examples of Clustering Applications**

**Marketing:**Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

**Insurance:**Identifying groups of motor insurance policy holders with some interesting characteristics.

**Games**: Identify player groups on the basis of their age groups, location and types of games they have shown interest in the past.

**Internet:**Clustering webpages based on their content.

**Quality of Clustering**

**Ways to measure Distance**

Manhattan distance: |x2-x1| + |y2-y1|

**Data Preparation**

- Adequate Sample Size
- Standardize Continuous Variables
- Remove outliers
- Variable Type : Continuous or binary variable
- Check Multicollinearity

**Types of Clustering**

- K-means Clustering (Flat Clustering)
- Hierarchical clustering (Agglomerative clustering)

**Detailed Theoretical Explanation - How Cluster Analysis works**

**Assess clustering tendency (clusterability)**

It is important to assess cluster tendency (i.e. to determine whether data contain meaningful clusters) before running any clustering algorithms. In unsupervised learning, the clustering methods returns clusters even if the data does not contain any meaninful cluster pattern.

**Hopkins statistic**is used to assess the clustering tendency of a dataset by measuring the probability that a given dataset is generated by a uniform data distribution (i.e. no meaningful clusters).

The null and the alternative hypotheses are defined as follow:

**Null hypothesis:**the dataset is uniformly distributed (i.e., no meaningful clusters)**Alternative hypothesis:**the dataset is not uniformly distributed (i.e., contains meaningful clusters)

If thevalue of Hopkins statistic is close to 0,it means that the data ishighly clusterable. If the value isclose to 0.5, that means the data containsno meaningful clusters.

**Determine the optimal number of clusters**

In R, there is a package called

**"NbClust"**that provides 30 indices to determine the optimal number of clusters. The 2 important methods out of 30 methods are as follows -

**Look for a bend or elbow in the sum of squared error (SSE) scree plot.**The location of the elbow in the plot suggests a suitable number of clusters for the kmeans.**Silhouette analysis**measures how well an observation is clustered and it estimates the average distance between clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters.**A higher silhouette width is preferred to determine the optimal number of clusters.**Observations with a negative width are probably placed in the wrong cluster.

**Important Steps : Cluster Analysis**

- Select important variables for analysis
- Check and remove outliers
- Standardize variables
- Assess clusterability
- Select the right clustering algorithm and optimal no. of clusters -
**Internal Validation (clValid Package)** **External Validation (if true label available)**

**R Code : Cluster Analysis**

**Note :**We can automate selecting the best clustering algorithm and optimal number of clusters with clValid Package.

# Loading data

data<-iris[,-c(5)]

# To standarize the variables

data = scale(data)

# Assessing cluster tendency

if(!require(clustertend)) install.packages("clustertend")

library(clustertend)

# Compute Hopkins statistic for the dataset

set.seed(123)

hopkins(data, n = nrow(data)-1)

#Since the H value = 0.1815 which is far below the threshold 0.5, it is highly clusterable

###########################################################################

####################### K Means clustering ################################

###########################################################################

# K-mean - Determining optimal number of clusters

# NbClust Package : 30 indices to determine the number of clusters in a dataset

# If index = 'all' - run 30 indices to determine the optimal no. of clusters

# If index = "silhouette" - It is a measure to estimate the dissimilarity between clusters.

# A higher silhouette width is preferred to determine the optimal number of clusters

if(!require(NbClust)) install.packages("NbClust")

nb <- NbClust(data, distance = "euclidean", min.nc=2, max.nc=15, method = "kmeans",

index = "silhouette")

nb$All.index

nb$Best.nc

#Method II : Same Silhouette Width analysis with fpc package

library(fpc)

pamkClus <- pamk(data, krange = 2:15, criterion="multiasw", ns=2, critout=TRUE)

pamkClus$nc

cat("number of clusters estimated by optimum average silhouette width:", pamkClus$nc, "\n")

#Method III : Scree plot to determine the number of clusters

wss <- (nrow(data)-1)*sum(apply(data,2,var))

for (i in 2:15) {

wss[i] <- sum(kmeans(data,centers=i)$withinss)

}

plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

# K-Means Cluster Analysis

fit <- kmeans(data,pamkClus$nc)

# get cluster means

aggregate(data,by=list(fit$cluster),FUN=mean)

# append cluster assignment

data <- data.frame(data, clusterid=fit$cluster)

###########################################################################

####################### Hierarchical clustering############################

###########################################################################

# Hierarchical clustering - Determining optimal number of clusters

library(NbClust)

res<-NbClust(data, diss=NULL, distance = "euclidean", min.nc=2, max.nc=6,

method = "ward.D2", index = "kl")

res$All.index

res$Best.nc

# Ward Hierarchical Clustering

d <- dist(data, method = "euclidean")

fit <- hclust(d, method="ward.D2")

plot(fit) # display dendogram

# cluster assignment (members)

groups <- cutree(fit, k=2)

data = cbind(data,groups)

# draw dendogram with red borders around the 2 clusters

rect.hclust(fit, k=2, border="red")

**Next Step : Validate Cluster Analysis**

HI Folk,

ReplyDeleteIn the beginning you wrote code to standardize the variables,

but its lil wrong,

its not working

I dont think so that there is any function like Data, but there is a function like Scale.

Kindly help

That's a typo. Corrected! Thank you for bringing this issue into my attention.

Deletewhy did you put set.seed(123)?

ReplyDeletecan you please explain..