Data should be in a tabular format where rows are observations and columns are variables.
No missing observations.
Data must be standardized \(\{\frac{x_i-\bar{x}}{s}\}\)
Here we’ll use mainly the following R packages:
cluster
for computing clustering algorithms,
and
factoextra
for ggplot2-based elegant visualization
of clustering results.
factoextra
contains many functions for cluster analysis
and visualization, including:
Functions | Descriptions |
---|---|
dist(fviz_dist, get_dist) |
Distance Matrix Computation and Visualization |
get_clust_tendency |
Assessing Clustering Tendency |
fviz_nbclust(fviz_gap_stat) |
Determining the Optimal Number of Clusters |
fviz_dend |
Enhanced Visualization of Dendrogram |
fviz_cluster |
Visualize Clustering Results |
fviz_mclust |
Visualize Model-based Clustering Results |
fviz_silhouette |
Visualize Silhouette Information from Clustering |
hcut |
Computes Hierarchical Clustering and Cut the Tree |
hkmeans |
Hierarchical k-means clustering |
eclust |
Visual enhancement of clustering analysis |
install.packages(c("cluster", "factoextra"))
The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix.The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters.
There are many methods to calculate this distance information.Here are the common distance measures:
Let \(x^{n\times 1} \text{ and } y^{n\times 1}\) are two vectors.
\[d_{euc}(x,y)=\sqrt{{\sum}_{i=1}^n (x_i-y_i)^2}\]
\[d_{man}(x,y)=\sqrt{{\sum}_{i=1}^n \mid(x_i-y_i)\mid}\]
This are widely used for gene expression data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:
\[d_{cor}(x,y)=1-r_{xy} \:\: \text{where}\: r_{xy}\; \text{is the Pearson correlation b/w x and y}\]
Pearson correlation measures the degree of a linear relationship between two profiles.
It’s a special case of Pearson’s correlation with \(\bar{x}\) and \(\bar{y}\) both replaced by zero:
\[d_{eisen}(x,y)=1-\frac{\mid{\sum}_{i=1}^n x_i y_i \mid}{\sqrt{{\sum}_{i=1}^n x_i^2 {\sum}_{i=1}^n y_i^2}}\]
The spearman correlation method computes the correlation between the rank of x and the rank of y variables.
\[d_{spear}(x,y)=1-r_{spear}(xy) \:\: \text{where}\: r_{spear}(xy)\; \text{is the Spearman correlation b/w x and y}\]
Kendall correlation method computes the correlation between the rank of x and the rank of y variables.
\[d_{kend}(x,y)=1-r_{kend}(xy) \:\: \text{where}\: r_{spear}(xy)\; \text{is the Kendall correlation b/w x and y}\]
NOTE
Pearson, used for two quantitative continuous variables which have a linear relationship
Spearman, used for two quantitative variables if the link is partially linear, or for one qualitative ordinal variable and one quantitative variable
Kendall, often used for two qualitative ordinal variables
The choice of distance measures is very important. For most common clustering software, the default distance measure is the Euclidean distance but depending on the type of the data other dissimilarity measures might be preferred. For example, correlation-based distance is often used in gene expression data analysis.
Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.
The distance between two objects is 0 when they are perfectly correlated.
Pearson’s correlation is quite sensitive to outliers. This does not matter when clustering samples, because the correlation is over thousands of genes.
When clustering genes, it is important to be aware of the possible impact of outliers. This can be mitigated by using Spearman’s correlation instead of Pearson’s correlation.
If we want to identify clusters of observations with the same overall profiles regardless of their magnitudes, then we should go with correlation-based distance as a dissimilarity measure. This is particularly the case in gene expression data analysis, where we might want to consider genes similar when they are “up” and “down” together. It is also the case, in marketing if we want to identify group of shoppers with the same preference in term of items, regardless of the volume of items they bought.
If Euclidean distance is chosen, then observations with high values of features will be clustered together. The same holds true for observations with low values of features.
Here we are using a subset of 15 random samples of USArrests dataset.
Next, we standardize the data using the function
scale()
.
set.seed(123)
id= sample(1:dim(USArrests)[1],15,replace = FALSE)
data= USArrests[id,]
data=scale(data)
head(data)
## Murder Assault UrbanPop Rape
## New Mexico 0.5850809 1.02300309 0.2250557 0.61101857
## Iowa -1.7022042 -1.54760088 -0.6892332 -1.43885018
## Indiana -0.4591145 -0.90775622 -0.1265939 -0.48290177
## Arizona -0.2353583 1.12403120 0.9283549 0.50261205
## Tennessee 1.0325932 -0.06585536 -0.5485734 0.09855138
## Texas 0.9082842 0.08007413 0.9283549 -0.03942055
There are many R functions for computing distances between pairs of observations:
dist()
R base function [stats
package]:
Accepts only numeric data as an input.get_dist()
function [factoextra
package]:
Accepts only numeric data as an input. Compared to the standard dist()
function, it supports correlation-based distance measures including
“pearson”, “kendall” and “spearman” methods.daisy()
function [cluster package]: Able to handle
other variable types (e.g. nominal, ordinal, (a)symmetric binary). In
that case, the Gower’s coefficient will be automatically used as the
metric. It’s one of the most popular measures of proximity for mixed
data types.dist.eucl= dist(data, method = "euclidean")
## Other methods are “euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”.
round(as.matrix(dist.eucl)[1:3,1:3],1)
## New Mexico Iowa Indiana
## New Mexico 0.0 4.1 2.5
## Iowa 4.1 0.0 1.8
## Indiana 2.5 1.8 0.0
In this matrix, the value represent the distance between objects. The values on the diagonal of the matrix represent the distance between objects and themselves (which are zero).
Note that, if we want to compute pairwise distances between variables, we must start by transposing the data before using the dist() function. The function t() is used for transposing the data.
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
dist.cor= get_dist(data, method = "pearson")
round(as.matrix(dist.cor)[1:3,1:3],1)
## New Mexico Iowa Indiana
## New Mexico 0.0 1.7 2.0
## Iowa 1.7 0.0 0.3
## Indiana 2.0 0.3 0.0
The function daisy()
[cluster package] provides a
solution (Gower’s metric) for computing the distance matrix, in the
situation where the data contain no-numeric columns. flower dataset is a
such kind of data:
library(cluster)
# Load data
data(flower)
head(flower, 3)
## V1 V2 V3 V4 V5 V6 V7 V8
## 1 0 1 1 4 3 15 25 15
## 2 1 0 0 2 1 3 150 50
## 3 0 1 0 3 3 1 150 50
# Data structure
str(flower)
## 'data.frame': 18 obs. of 8 variables:
## $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
## $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
## $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
## $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
## $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
## $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
## $ V7: num 25 150 150 125 20 50 40 100 25 100 ...
## $ V8: num 15 50 50 50 15 40 20 15 15 60 ...
# Distance matrix
dist.mixed= daisy(flower)
round(as.matrix(dist.mixed)[1:3, 1:3], 2)
## 1 2 3
## 1 0.00 0.89 0.53
## 2 0.89 0.00 0.51
## 3 0.53 0.51 0.00
library(factoextra)
fviz_dist(dist.eucl)
K-means clustering (MacQueen 1967) is one of the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.
Let \(\mathbf{x}^{d\times 1}_1,\mathbf{x}^{d\times 1}_2,...,\mathbf{x}^{d\times 1}_n\) samples, k-means clustering aims to partition the \(n\) observations into \(k\:\:(\leq n)\) sets \(\mathbf{S}=\{S_1,S_2,...,S_k\}\) so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Formally, the objective is to find:
\[argmin_s={\sum}_{i=1}^k{\sum}_{x\in S_i} \left\| \mathbf{x}- \mathbf{\mu}_i\right\|^2 = argmin_s{\sum}_{i=1}^k \mid S_i \mid \text{Var}(S_i)\] where \(\mathbf{\mu}_i\) is the mean (also called centroid) of points in \(S_i\),i.e., \[\mathbf{\mu}_i=\frac{1}{\mid S_i \mid} {\sum}_{\mathbf{x} \in S_i }^k \mathbf{x},\] \(\mid S_i \mid\) is the size of \(S_i\), and \(\left\|.\right\|\) is the usual \(L^2 \text{norm}\). This is equivalent to minimizing the pairwise squared deviations of points in the same cluster: \[argmin_s {\sum}_{i=1}^k \frac{1}{\mid S_i \mid} {\sum}_{x, y \in S_i} \left\| \mathbf{x}- \mathbf{y}_i\right\|^2\].
The equivalence can be deduced from identity \(\mid S_i \mid {\sum}_{x \in S_i} \left\| \mathbf{x}- \mathbf{\mu}_i\right\|^2= \frac{1}{2} {\sum}_{x, y \in S_i} \left\| \mathbf{x}- \mathbf{y}_i\right\|^2\). Since the total variance is constant, this is equivalent to maximizing the sum of squared deviations between points in different clusters (between-cluster sum of squares, BCSS). This deterministic relationship is also related to the law of total variance in probability theory.
There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (Hartigan and Wong 1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid: \[argmin_s={\sum}_{i=1}^k{\sum}_{x\in S_i} (\mathbf{x}- \mathbf{\mu}_i)^2 \].
USArrests
Dataset
The R function fviz_nbclust()
[in
factoextra
package] provides a convenient solution to
estimate the optimal number of clusters.
data= scale(USArrests)
library(factoextra)
fviz_nbclust(data,kmeans ,method = "wss")+geom_vline(xintercept = 4,linetype="dotted")
Note that,here we specify nstart = 251
. This means that
R will try 25 different random starting assignments and then select the
best results corresponding to the one with the lowest within cluster
variation. The default value of nstart
in R is one. But,
it’s strongly recommended to compute k-means clustering with a large
value of nstart such as 25 or 50, in order to have a more stable
result.
set.seed(123)
km= kmeans(data,4,nstart = 25)
km
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.4118898 0.8743346 -0.8145211 0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 0.6950701 1.0394414 0.7226370 1.27693964
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 4 4 1 4
## Colorado Connecticut Delaware Florida Georgia
## 4 3 3 4 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 2 4 3 2
## Kansas Kentucky Louisiana Maine Maryland
## 3 2 1 2 4
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 4 2 1 4
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 4 2 3
## New Mexico New York North Carolina North Dakota Ohio
## 4 4 1 2 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 4 3 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 2 2 3
##
## Within cluster sum of squares by cluster:
## [1] 8.316061 11.952463 16.212213 19.922437
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
It’s possible to compute the mean of each variables by clusters using the original data:
aggregate(USArrests, by=list(cluster=km$cluster), mean)
## cluster Murder Assault UrbanPop Rape
## 1 1 13.93750 243.62500 53.75000 21.41250
## 2 2 3.60000 78.53846 52.07692 12.17692
## 3 3 5.65625 138.87500 73.87500 18.78125
## 4 4 10.81538 257.38462 76.00000 33.19231
If we want to add the point classifications to the original data:
dd= cbind(USArrests, cluster = km$cluster)
head(dd)
## Murder Assault UrbanPop Rape cluster
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 4
## Arizona 8.1 294 80 31.0 4
## Arkansas 8.8 190 50 19.5 1
## California 9.0 276 91 40.6 4
## Colorado 7.9 204 78 38.7 4
Now, we want to visualize the data in a scatter plot with coloring each data point according to its cluster assignment.
The problem is that the data contains more than 2 variables and the question is what variables to choose for the xy scatter plot.
A solution is to reduce the number of dimensions by applying a dimensionality reduction algorithm, such as Principal Component Analysis (PCA), that operates on the four variables and outputs two new variables (that represent the original variables) that you can use to do the plot. In other words, if we have a multi-dimensional data set, a solution is to perform Principal Component Analysis (PCA) and to plot data points according to the first two principal components coordinates.
The function fviz_cluster()
[factoextra
package] can be used to easily visualize k-means clusters. It takes
k-means results and the original data as arguments. In the resulting
plot, observations are represented by points, using principal components
if the number of variables is greater than 2. It’s also possible to draw
concentration ellipse around each cluster.
library(factoextra)
fviz_cluster(km,data,ellipse.type = "norm")
A robust alternative to k-means is PAM, which is based on medoids.
PAM clustering can be computed using the function pam()
[cluster
package]. The function pamk( )
[fpc
package] is a wrapper for PAM that also prints the
suggested number of clusters based on optimum average silhouette
width.
The k-medoids algorithm is a clustering approach related to k-means clustering.
In k-medoids clustering, each cluster is represented by one of the data point in the cluster, which are named as cluster medoids.
The term medoid refers to an object within a cluster for which average dissimilarity between it and all the other the members of the cluster is minimal. It corresponds to the most centrally located point in the cluster. In other words, we can minimize the sum of the dissimilarities between object and their closest selected object.
K-medoid is a robust alternative to k-means clustering i.e., the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means (used in k-means).
This method requires the user to specify k, the number of clusters to be generated (like in k-means clustering). A useful approach to determine the optimal number of clusters is the silhouette method.
The most common k-medoids clustering methods is the PAM algorithm (Partitioning Around Medoids, (Kaufman and Rousseeuw 1990)).
The PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set. Read this document for Clear understanding about this algorithm, but in summary, PAM algorithm proceeds in two phases as follow:
Build phase:
1Select k objects to become the medoids, or in case these objects were provided use them as the medoids.
Calculate the dissimilarity matrix if it was not provided.
Assign every object to its closest medoid.
Swap phase:
As mentioned above, the PAM algorithm works with a matrix of dissimilarity, and to compute this matrix the algorithm can use two metrics:
The euclidean distances, that are the root sum-of-squares of differences;
And, the Manhattan distance that are the sum of absolute distances.
Note that, in practice, we should get similar results most of the time, using either euclidean or Manhattan distance. If our data contains outliers, Manhattan distance should give more robust results, whereas euclidean would be influenced by unusual values.
USArrests
Dataset
The function pam()
[cluster
package] and
pamk()
[fpc
package] can be used to compute
PAM.
The function pamk() does not require a user to decide the number of clusters K.
To create a beautiful graph of the clusters generated with the
pam()
function, will use the factoextra
package.
library(cluster)
library(factoextra)
The R function fviz_nbclust()
[in
factoextra
package] provides a convenient solution to
estimate the optimal number of clusters. Here we’ll use the average
silhouette method. A high average silhouette width indicates a good
clustering.
data= scale(USArrests)
fviz_nbclust(data,kmeans ,method = "silhouette")
Pam= pam(data, 2) #computes PAM algorithm with k = 2
Pam
## Medoids:
## ID Murder Assault UrbanPop Rape
## New Mexico 31 0.8292944 1.3708088 0.3081225 1.1603196
## Nebraska 27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 1 2 2 1 1
## Hawaii Idaho Illinois Indiana Iowa
## 2 2 1 2 2
## Kansas Kentucky Louisiana Maine Maryland
## 2 2 1 2 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 2 1 1
## Montana Nebraska Nevada New Hampshire New Jersey
## 2 2 1 2 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 2 2
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 2 2 1
## South Dakota Tennessee Texas Utah Vermont
## 2 1 1 2 2
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 2 2 2
## Objective function:
## build swap
## 1.441358 1.368969
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call" "data"
The printed output shows:
- the cluster medoids: a matrix, which rows are the medoids and columns are variables
- the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated
If we want to add the point classifications to the original data:
dd= cbind(USArrests, cluster = Pam$cluster)
head(dd, n = 3)
## Murder Assault UrbanPop Rape cluster
## Alabama 13.2 236 58 21.2 1
## Alaska 10.0 263 48 44.5 1
## Arizona 8.1 294 80 31.0 1
To visualize the partitioning results, we’ll use the function
fviz_cluster()
[factoextra
package].
fviz_cluster(Pam,data,ellipse.type = "norm")
Note that, for large data sets, pam() may need too much memory or too much computation time. In this case, the function clara() is preferable. This should not be a problem for modern computers.
Note that there is a variant of PAM named CLARA (Clustering Large Applications, (Kaufman and Rousseeuw 1990)) which is used for analyzing large data sets. It is an extension to k-medoids (PAM) methods to deal with data containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem. This is achieved using the sampling approach.
Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed size (sampsize) and applies the PAM algorithm to generate an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster, defined as the cost function.
CLARA repeats the sampling and clustering processes a pre-specified number of times in order to minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal cost.
The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It works in a “bottom-up” manner. That is, - each object is initially considered as a single-element cluster (leaf).
-This procedure is iterated until all points are member of just one single big cluster (root).
The steps to agglomerative hierarchical clustering are:
Preparing the data (standardize the variables)
Computing (dis)similarity information between every pair of objects in the data set.
sing linkage function to group objects into hierarchical cluster tree, based on the distance information generated at step 1. Objects/clusters that are in close proximity are linked together using the linkage function.
Determining where to cut the hierarchical tree into clusters. This creates a partition of the data.
Similarity measures
There are many methods to calculate the (dis)similarity information,
including Euclidean and manhattan
distances. In R software, we can use the function
dist()
to compute the distance between every pair of object
in a data set, which is known as a distance or dissimilarity matrix.
Linkage
The linkage function takes the distance information, returned by the function dist(), and groups pairs of objects into clusters based on their similarity. Next, these newly formed clusters are linked to each other to create bigger clusters. This process is iterated until all the objects in the original data set are linked together in a hierarchical tree.
There are many cluster agglomeration methods (i.e, linkage methods). The most common linkage methods are: - Maximum or complete linkage
Minimum or single linkage
Mean or average linkage
Centroid linkage
Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.
Complete linkage and Ward’s method are generally preferred.
the R base function hclust()
can be used to create the
hierarchical tree. Alternatively, the package cluster
makes
it easy to perform cluster analysis in R. It provides the function
agnes()
and diana()
for computing
agglomerative and divisive clustering, respectively.
USArrests
Dataset
The function hclust()
[base
package] is
used to compute Hierarchical Clustering.
Dendrograms correspond to the graphical representation of the
hierarchical tree generated by the function hclust()
.
Dendrogram can be produced in R using the base function
plot()
. We can also use fviz_dend()
[ in
factoextra
package] to produce a beautiful dendrogram.
library(cluster)
library(factoextra)
data= scale(USArrests)
res.dist= dist(data, method = "euclidean")
dim(as.matrix(res.dist))
## [1] 50 50
round(as.matrix(res.dist)[1:10,1:10],2)
## Alabama Alaska Arizona Arkansas California Colorado Connecticut
## Alabama 0.00 2.70 2.29 1.29 3.26 2.65 3.22
## Alaska 2.70 0.00 2.70 2.83 3.01 2.33 4.74
## Arizona 2.29 2.70 0.00 2.72 1.31 1.37 3.26
## Arkansas 1.29 2.83 2.72 0.00 3.76 2.83 2.61
## California 3.26 3.01 1.31 3.76 0.00 1.29 4.07
## Colorado 2.65 2.33 1.37 2.83 1.29 0.00 3.33
## Connecticut 3.22 4.74 3.26 2.61 4.07 3.33 0.00
## Delaware 2.02 3.62 1.91 1.80 3.07 2.55 1.76
## Florida 2.30 3.00 1.75 3.37 2.03 2.45 4.47
## Georgia 1.13 2.82 2.79 2.21 3.38 2.86 3.97
## Delaware Florida Georgia
## Alabama 2.02 2.30 1.13
## Alaska 3.62 3.00 2.82
## Arizona 1.91 1.75 2.79
## Arkansas 1.80 3.37 2.21
## California 3.07 2.03 3.38
## Colorado 2.55 2.45 2.86
## Connecticut 1.76 4.47 3.97
## Delaware 0.00 3.06 2.98
## Florida 3.06 0.00 2.18
## Georgia 2.98 2.18 0.00
Hc= hclust(d = res.dist, method = "ward.D2")
Note that, the function hcut()
[in
factoextra
package] can also be used to Computes
hierarchical clustering (hclust, agnes, diana) and cut the tree into k
clusters.
plot(Hc)
Similarly by using factoextra
package:
library("factoextra")
fviz_dend(Hc, cex = 0.5)
After linking the objects in a data set into a hierarchical cluster tree, we usually want to assess that the distances (i.e., heights) in the tree reflect the original distances accurately.
One way to measure this, is to compute the correlation between
the cophenetic distances and the original
distance data generated by the dist()
function.
If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the original distance matrix.
The closer the value of the correlation coefficient is to 1, the more accurately the clustering solution reflects your data. Values above 0.75 are felt to be good.
The R base function cophenetic()
can be used to
compute the cophenetic distances for hierarchical
clustering.
# Compute cophentic distance
coph= cophenetic(Hc)
# Correlation between cophenetic distance and the original distance
cor(coph, res.dist)
## [1] 0.6975266
Execute the hclust()
function again using the average
linkage method. Next, call cophenetic()
to evaluate the
clustering solution.
Hc2= hclust(res.dist, method = "average")
cor(res.dist, cophenetic(Hc2))
## [1] 0.7180382
The correlation coefficient shows that using a different linkage method creates a tree that represents the original distances slightly better.
One of the problems with hierarchical clustering is that, it does not tell us how many clusters there are, or where to cut the dendrogram to form clusters.
We can cut the hierarchical tree at a given height in order to partition your data into clusters.
The R base function cutree()
can be used to cut a
tree, generated by the hclust()
function, into several
groups either by specifying the desired number of groups or the cut
height.
It returns a vector containing the cluster number of each observation.
# Cut tree into 4 groups
grp= cutree(Hc, k = 4)
head(grp, n = 4)
## Alabama Alaska Arizona Arkansas
## 1 2 2 3
# Number of members in each cluster
table(grp)
## grp
## 1 2 3 4
## 7 12 19 12
# Get the names for the members of cluster 1
rownames(data)[grp == 1]
## [1] "Alabama" "Georgia" "Louisiana" "Mississippi"
## [5] "North Carolina" "South Carolina" "Tennessee"
The result of the cuts can be visualized easily using the function fviz_dend() [in factoextra]:
fviz_dend(Hc, cex = 0.5, k = 4, color_labels_by_k = TRUE)
Using the function fviz_cluster()
[in
factoextra
], we can also visualize the result in a scatter
plot, but note that, this function cannot be applied on the base package
function hclust()
generated object; for that we will apply
hcut()
function [from factoextra
] for
performing hierarchical clustering. Observations are represented by
points in the plot, using principal components. A frame is drawn around
each cluster.
hc.cut= hcut(data, k = 4, hc_method = "ward.D2")
hc.cut
##
## Call:
## stats::hclust(d = x, method = hc_method)
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 50
fviz_cluster(hc.cut, ellipse.type = "convex",show.clust.cent=T,geom = c("point", "text") )
As discussed earlyar, The R package cluster
makes it
easy to perform cluster analysis in R. The function agnes()
and diana()
functions perform all the necessary steps for
you. We don’t need to execute the scale()
,
dist()
and hclust()
function separately.
library("cluster")
# Agglomerative Nesting (Hierarchical Clustering)
agnes= agnes(x = USArrests, # data matrix
stand = TRUE, # Standardize the data
metric = "euclidean", # metric for distance matrix
method = "ward" # Linkage method
)
# DIvisive ANAlysis Clustering
diana= diana(x = USArrests, # data matrix
stand = TRUE, # standardize the data
metric = "euclidean" # metric for distance matrix
)
After running agnes()
and diana()
, you can
use the function fviz_dend()
[in factoextra
] to
visualize the output:
fviz_dend(res.agnes, cex = 0.6, k = 4)
The divisive hierarchical clustering, is the inverse of agglomerative clustering.
It starts by including all objects in a single large cluster.
At each step of iteration, the most heterogeneous cluster is divided into two.
The process is iterated until all objects are in their own cluster.
Recall that, divisive clustering is good at identifying large clusters while agglomerative clustering is good at identifying small clusters.
The R function diana()
[cluster
package]
can be used to compute divisive clustering. It returns an object of
class diana
(see ?diana.object
) which has also
methods for the functions: print()
, summary()
,
plot()
, pltree()
,
as.dendrogram()
, as.hclust()
and
cutree()
.
The output can be visualized as dendrograms using the function
fviz_dend() [
factoextra` package]. For example:
# Compute diana()
library(cluster)
res.diana <- diana(USArrests, stand = TRUE)
# Plot the dendrogram
library(factoextra)
fviz_dend(res.diana, cex = 0.5,
k = 4, # Cut in four groups
palette = "jco" # Color palette
)
Now, we will see how to compare cluster dendrograms in R using the 1dendextend` R package. Suppose we have computed two dendrograms by computing hierarchical clustering (HC) using two different linkage methods (“average” and “ward.D2”).
data= scale(USArrests)
# working with a small random subset of the data set with 10 observations among the 50 observations.
set.seed(123)
id= sample(1:50, 10)
data= data[id,]
library(dendextend)
library(dplyr)
# Compute distance matrix
res.dist= dist(data, method = "euclidean")
# Compute 2 hierarchical clusterings
hc1= hclust(res.dist, method = "average")
hc2= hclust(res.dist, method = "ward.D2")
# Create two dendrograms
dend1= as.dendrogram (hc1)
dend2= as.dendrogram (hc2)
# Create a list to hold dendrograms
dend_list= dendlist(dend1, dend2)
To visually compare two dendrograms, we’ll use the following R
functions [dendextend
package]:
untangle()
: finds the best layout to align
dendrogram lists, using heuristic methods
tanglegram()
: plots the two dendrograms, side by
side, with their labels connected by lines.
entanglement()
: computes the quality of the
alignment of the two trees. Entanglement is a measure between 1 (full
entanglement) and 0 (no entanglement). A lower entanglement coefficient
corresponds to a good alignment.
Draw a tanglegram:
# Align and plot two dendrograms side by side
dendlist(dend1, dend2) %>%
untangle(method = "step1side") %>% # Find the best alignment layout
tanglegram() # Draw the two dendrograms
# Compute alignment quality. Lower value = good alignment quality
dendlist(dend1, dend2) %>%
untangle(method = "step1side") %>% # Find the best alignment layout
entanglement() # Alignment quality
## [1] 0
The function cor.dendlist()
is used to compute
“Baker” or “Cophenetic” correlation matrix between a list of
trees.
The value can range between -1 to 1. With near 0 values meaning that the two trees are not statistically similar.
# Cophenetic correlation matrix
cor.dendlist(dend_list, method = "cophenetic")
## [,1] [,2]
## [1,] 1.0000000 0.9925544
## [2,] 0.9925544 1.0000000
# Baker correlation matrix
cor.dendlist(dend_list, method = "baker")
## [,1] [,2]
## [1,] 1.0000000 0.9895528
## [2,] 0.9895528 1.0000000
The correlation between two trees can be also computed as follow:
# Cophenetic correlation coefficient
cor_cophenetic(dend1, dend2)
## [1] 0.9925544
# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.9895528
It’s also possible to compare simultaneously multiple dendrograms:
# Create multiple dendrograms by chaining
dend1= data %>% dist %>% hclust("complete") %>% as.dendrogram
dend2= data %>% dist %>% hclust("single") %>% as.dendrogram
dend3= data %>% dist %>% hclust("average") %>% as.dendrogram
dend4= data %>% dist %>% hclust("centroid") %>% as.dendrogram
# Compute correlation matrix
dend_list= dendlist("Complete" = dend1, "Single" = dend2,
"Average" = dend3, "Centroid" = dend4)
cors= cor.dendlist(dend_list)
# Print correlation matrix
round(cors, 2)
## Complete Single Average Centroid
## Complete 1.00 0.46 0.45 0.30
## Single 0.46 1.00 0.23 0.17
## Average 0.45 0.23 1.00 0.31
## Centroid 0.30 0.17 0.31 1.00
# Visualize the correlation matrix using corrplot package
library(corrplot)
## corrplot 0.92 loaded
corrplot(cors, "pie", "lower")
Clusters can be found in a data set by chance due to clustering noise
or sampling error. The R package pvclust
(Suzuki and
Shimodaira 2015) which uses bootstrap resampling techniques to compute
p-value for each hierarchical clusters.
The Algorithm for computing P-Values are:
Generated thousands of bootstrap samples by randomly sampling elements of the data.
Compute hierarchical clustering on each bootstrap copy.
For each cluster:
Clusters with AU >= 95% are considered to be strongly supported by data.
The function pvclust()
can be used as follow:
pvclust(data, method.hclust = "average",
method.dist = "correlation", nboot = 1000)
--------------------------------------------------------------------------------------------------------------
data: numeric data matrix or data frame.
--------------------------------------------------------------------------------------------------------------
method.hclust: the agglomerative method used in hierarchical clustering. Possible values are one of “average”,
“ward”, “single”, “complete”,“mcquitty”, “median” or “centroid”. The default is “average”.
See method argument in ?hclust.
--------------------------------------------------------------------------------------------------------------
method.dist: the distance measure to be used. Possible values are one of “correlation”, “uncentered”,
“abscor” or those which are allowed for method argument in dist() function, such
“euclidean” and “manhattan”.
--------------------------------------------------------------------------------------------------------------
nboot: the number of bootstrap replications. The default is 1000.
--------------------------------------------------------------------------------------------------------------
iseed: an integrer for random seeds. Use iseed argument to achieve reproducible results.
--------------------------------------------------------------------------------------------------------------
The function pvclust()
returns an object of class
pvclust containing many elements including hclust
which contains hierarchical clustering result for the original data
generated by the function hclust()
.
K-means represents one of the most popular clustering algorithm. However, it has some limitations:
it requires the user to specify the number of clusters in advance and selects initial centroids randomly.
The final k-means clustering solution is very sensitive to this initial random selection of cluster centers.
The result might be (slightly) different each time you compute k-means.
The algorithm is summarized as follow:
Compute hierarchical clustering and cut the tree into k-clusters.
Compute the center (i.e the mean) of each cluster.
Compute k-means by using the set of cluster centers (defined in step 2) as the initial cluster centers.
The R function hkmeans()
[in factoextra
package], provides an easy solution to compute the hierarchical
k-means clustering.
# Compute hierarchical k-means clustering
#library(factoextra)
data= scale(USArrests)
hk= hkmeans(data, 4)
# Elements returned by hkmeans()
names(hk)
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
To print all the results:
hk
## Hierarchical K-means clustering with 4 clusters of sizes 8, 13, 16, 13
##
## Cluster means:
## Murder Assault UrbanPop Rape
## 1 1.4118898 0.8743346 -0.8145211 0.01927104
## 2 0.6950701 1.0394414 0.7226370 1.27693964
## 3 -0.4894375 -0.3826001 0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
##
## Clustering vector:
## Alabama Alaska Arizona Arkansas California
## 1 2 2 1 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 4 2 3 4
## Kansas Kentucky Louisiana Maine Maryland
## 3 4 1 4 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 4 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 4 4 2 4 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 4 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 4 1 2 3 4
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 4 4 3
##
## Within cluster sum of squares by cluster:
## [1] 8.316061 19.922437 16.212213 11.952463
## (between_SS / total_SS = 71.2 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
# Visualize the tree
fviz_dend(hk, cex = 0.6, palette = "jco",
rect = TRUE, rect_border = "jco", rect_fill = TRUE)
# Visualize the hkmeans final clusters
fviz_cluster(hk, palette = "jco", repel = TRUE,
ggtheme = theme_classic())
The fuzzy clustering is considered as soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.
This is different from k-means and k-medoid clustering, where each object is affected exactly to one cluster. K-means and k-medoids clustering are known as hard or non-fuzzy clustering.
In fuzzy clustering, points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value varying from 0 to 1.
The fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. The centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster.
For Clear understanding read this document.
We’ll use the following R packages: 1) cluster
for
computing fuzzy clustering and 2) factoextra
for
visualizing clusters.
The function fanny()
[cluster
R package]
can be used to compute fuzzy clustering. FANNY stands
for fuzzy analysis clustering.
A simplified format is:
fanny(x, k, metric = "euclidean", stand = FALSE)
-----------------------------------------------------------------------------------
x: A data matrix or data frame or dissimilarity matrix
-----------------------------------------------------------------------------------
k: The desired number of clusters to be generated
-----------------------------------------------------------------------------------
metric: Metric for calculating dissimilarities between observations
-----------------------------------------------------------------------------------
stand: If TRUE, variables are standardized before calculating the dissimilarities
-----------------------------------------------------------------------------------
Note that, there are more advanced clustering methods available like: Model Based Clustering, DBSCAN: Density-Based Clustering.
Bradley Boehmke & Brandon Greenwell; Hands-On Machine Learning with R
Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In, 226–31. AAAI Press.
Kaufman, Leonard, and Peter Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis.