0.1 Data Preparation

  • Data should be in a tabular format where rows are observations and columns are variables.

  • No missing observations.

  • Data must be standardized \(\{\frac{x_i-\bar{x}}{s}\}\)

0.2 Required Packages

Here we’ll use mainly the following R packages:

  • cluster for computing clustering algorithms, and

  • factoextra for ggplot2-based elegant visualization of clustering results.

factoextra contains many functions for cluster analysis and visualization, including:

Functions Descriptions
dist(fviz_dist, get_dist) Distance Matrix Computation and Visualization
get_clust_tendency Assessing Clustering Tendency
fviz_nbclust(fviz_gap_stat) Determining the Optimal Number of Clusters
fviz_dend Enhanced Visualization of Dendrogram
fviz_cluster Visualize Clustering Results
fviz_mclust Visualize Model-based Clustering Results
fviz_silhouette Visualize Silhouette Information from Clustering
hcut Computes Hierarchical Clustering and Cut the Tree
hkmeans Hierarchical k-means clustering
eclust Visual enhancement of clustering analysis
install.packages(c("cluster", "factoextra"))

0.3 Clustering Distance Measures

The classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix.The choice of distance measures is a critical step in clustering. It defines how the similarity of two elements (x, y) is calculated and it will influence the shape of the clusters.

There are many methods to calculate this distance information.Here are the common distance measures:

Let \(x^{n\times 1} \text{ and } y^{n\times 1}\) are two vectors.

0.3.1 The classical methods:

0.3.2 (1) Euclidean distance:

\[d_{euc}(x,y)=\sqrt{{\sum}_{i=1}^n (x_i-y_i)^2}\]

0.3.3 (2) Manhattan distance:

\[d_{man}(x,y)=\sqrt{{\sum}_{i=1}^n \mid(x_i-y_i)\mid}\]

0.3.4 Correlation-based distances:

This are widely used for gene expression data analyses. Correlation-based distance is defined by subtracting the correlation coefficient from 1. Different types of correlation methods can be used such as:

0.3.5 (1) Pearson correlation distance:

\[d_{cor}(x,y)=1-r_{xy} \:\: \text{where}\: r_{xy}\; \text{is the Pearson correlation b/w x and y}\]

Pearson correlation measures the degree of a linear relationship between two profiles.

0.3.6 (2) Eisen cosine correlation distance:

It’s a special case of Pearson’s correlation with \(\bar{x}\) and \(\bar{y}\) both replaced by zero:

\[d_{eisen}(x,y)=1-\frac{\mid{\sum}_{i=1}^n x_i y_i \mid}{\sqrt{{\sum}_{i=1}^n x_i^2 {\sum}_{i=1}^n y_i^2}}\]

0.3.7 (3) Spearman correlation distance:

The spearman correlation method computes the correlation between the rank of x and the rank of y variables.

\[d_{spear}(x,y)=1-r_{spear}(xy) \:\: \text{where}\: r_{spear}(xy)\; \text{is the Spearman correlation b/w x and y}\]

0.3.8 (4) Kendall correlation distance:

Kendall correlation method computes the correlation between the rank of x and the rank of y variables.

\[d_{kend}(x,y)=1-r_{kend}(xy) \:\: \text{where}\: r_{spear}(xy)\; \text{is the Kendall correlation b/w x and y}\]

NOTE

  • Pearson, used for two quantitative continuous variables which have a linear relationship

  • Spearman, used for two quantitative variables if the link is partially linear, or for one qualitative ordinal variable and one quantitative variable

  • Kendall, often used for two qualitative ordinal variables

0.3.9 What type of distance measures should we choose?

  • The choice of distance measures is very important. For most common clustering software, the default distance measure is the Euclidean distance but depending on the type of the data other dissimilarity measures might be preferred. For example, correlation-based distance is often used in gene expression data analysis.

  • Correlation-based distance considers two objects to be similar if their features are highly correlated, even though the observed values may be far apart in terms of Euclidean distance.

  • The distance between two objects is 0 when they are perfectly correlated.

  • Pearson’s correlation is quite sensitive to outliers. This does not matter when clustering samples, because the correlation is over thousands of genes.

  • When clustering genes, it is important to be aware of the possible impact of outliers. This can be mitigated by using Spearman’s correlation instead of Pearson’s correlation.

  • If we want to identify clusters of observations with the same overall profiles regardless of their magnitudes, then we should go with correlation-based distance as a dissimilarity measure. This is particularly the case in gene expression data analysis, where we might want to consider genes similar when they are “up” and “down” together. It is also the case, in marketing if we want to identify group of shoppers with the same preference in term of items, regardless of the volume of items they bought.

  • If Euclidean distance is chosen, then observations with high values of features will be clustered together. The same holds true for observations with low values of features.

0.4 Distance matrix computation

0.4.0.1 Data preparation

Here we are using a subset of 15 random samples of USArrests dataset. Next, we standardize the data using the function scale().

set.seed(123)
id= sample(1:dim(USArrests)[1],15,replace = FALSE)
data= USArrests[id,]
data=scale(data)
head(data)
##                Murder     Assault   UrbanPop        Rape
## New Mexico  0.5850809  1.02300309  0.2250557  0.61101857
## Iowa       -1.7022042 -1.54760088 -0.6892332 -1.43885018
## Indiana    -0.4591145 -0.90775622 -0.1265939 -0.48290177
## Arizona    -0.2353583  1.12403120  0.9283549  0.50261205
## Tennessee   1.0325932 -0.06585536 -0.5485734  0.09855138
## Texas       0.9082842  0.08007413  0.9283549 -0.03942055

0.4.0.2 R functions and packages

There are many R functions for computing distances between pairs of observations:

    1. dist() R base function [stats package]: Accepts only numeric data as an input.
    1. get_dist() function [factoextra package]: Accepts only numeric data as an input. Compared to the standard dist() function, it supports correlation-based distance measures including “pearson”, “kendall” and “spearman” methods.
    1. daisy() function [cluster package]: Able to handle other variable types (e.g. nominal, ordinal, (a)symmetric binary). In that case, the Gower’s coefficient will be automatically used as the metric. It’s one of the most popular measures of proximity for mixed data types.

0.4.0.3 Computing euclidean distance

dist.eucl= dist(data, method = "euclidean")

## Other methods are “euclidean”, “maximum”, “manhattan”, “canberra”, “binary”, “minkowski”.

round(as.matrix(dist.eucl)[1:3,1:3],1)
##            New Mexico Iowa Indiana
## New Mexico        0.0  4.1     2.5
## Iowa              4.1  0.0     1.8
## Indiana           2.5  1.8     0.0

In this matrix, the value represent the distance between objects. The values on the diagonal of the matrix represent the distance between objects and themselves (which are zero).

Note that, if we want to compute pairwise distances between variables, we must start by transposing the data before using the dist() function. The function t() is used for transposing the data.

0.4.0.4 Computing correlation based distances

library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
dist.cor= get_dist(data, method = "pearson")
round(as.matrix(dist.cor)[1:3,1:3],1)
##            New Mexico Iowa Indiana
## New Mexico        0.0  1.7     2.0
## Iowa              1.7  0.0     0.3
## Indiana           2.0  0.3     0.0

0.4.0.5 Computing distances for mixed data

The function daisy() [cluster package] provides a solution (Gower’s metric) for computing the distance matrix, in the situation where the data contain no-numeric columns. flower dataset is a such kind of data:

library(cluster)

# Load data

data(flower)
head(flower, 3)
##   V1 V2 V3 V4 V5 V6  V7 V8
## 1  0  1  1  4  3 15  25 15
## 2  1  0  0  2  1  3 150 50
## 3  0  1  0  3  3  1 150 50
# Data structure
str(flower)
## 'data.frame':    18 obs. of  8 variables:
##  $ V1: Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 2 2 ...
##  $ V2: Factor w/ 2 levels "0","1": 2 1 2 1 2 2 1 1 2 2 ...
##  $ V3: Factor w/ 2 levels "0","1": 2 1 1 2 1 1 1 2 1 1 ...
##  $ V4: Factor w/ 5 levels "1","2","3","4",..: 4 2 3 4 5 4 4 2 3 5 ...
##  $ V5: Ord.factor w/ 3 levels "1"<"2"<"3": 3 1 3 2 2 3 3 2 1 2 ...
##  $ V6: Ord.factor w/ 18 levels "1"<"2"<"3"<"4"<..: 15 3 1 16 2 12 13 7 4 14 ...
##  $ V7: num  25 150 150 125 20 50 40 100 25 100 ...
##  $ V8: num  15 50 50 50 15 40 20 15 15 60 ...
# Distance matrix
dist.mixed= daisy(flower)
round(as.matrix(dist.mixed)[1:3, 1:3], 2)
##      1    2    3
## 1 0.00 0.89 0.53
## 2 0.89 0.00 0.51
## 3 0.53 0.51 0.00

0.4.0.6 Visualizing distance matrices

library(factoextra)
fviz_dist(dist.eucl)

  • Red: high similarity (ie: low dissimilarity) | Blue: low similarity

0.5 K-Means Clustering

K-means clustering (MacQueen 1967) is one of the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst. It classifies objects in multiple groups (i.e., clusters), such that objects within the same cluster are as similar as possible (i.e., high intra-class similarity), whereas objects from different clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-means clustering, each cluster is represented by its center (i.e, centroid) which corresponds to the mean of points assigned to the cluster.

0.5.1 K-means basic ideas

Let \(\mathbf{x}^{d\times 1}_1,\mathbf{x}^{d\times 1}_2,...,\mathbf{x}^{d\times 1}_n\) samples, k-means clustering aims to partition the \(n\) observations into \(k\:\:(\leq n)\) sets \(\mathbf{S}=\{S_1,S_2,...,S_k\}\) so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Formally, the objective is to find:

\[argmin_s={\sum}_{i=1}^k{\sum}_{x\in S_i} \left\| \mathbf{x}- \mathbf{\mu}_i\right\|^2 = argmin_s{\sum}_{i=1}^k \mid S_i \mid \text{Var}(S_i)\] where \(\mathbf{\mu}_i\) is the mean (also called centroid) of points in \(S_i\),i.e., \[\mathbf{\mu}_i=\frac{1}{\mid S_i \mid} {\sum}_{\mathbf{x} \in S_i }^k \mathbf{x},\] \(\mid S_i \mid\) is the size of \(S_i\), and \(\left\|.\right\|\) is the usual \(L^2 \text{norm}\). This is equivalent to minimizing the pairwise squared deviations of points in the same cluster: \[argmin_s {\sum}_{i=1}^k \frac{1}{\mid S_i \mid} {\sum}_{x, y \in S_i} \left\| \mathbf{x}- \mathbf{y}_i\right\|^2\].

The equivalence can be deduced from identity \(\mid S_i \mid {\sum}_{x \in S_i} \left\| \mathbf{x}- \mathbf{\mu}_i\right\|^2= \frac{1}{2} {\sum}_{x, y \in S_i} \left\| \mathbf{x}- \mathbf{y}_i\right\|^2\). Since the total variance is constant, this is equivalent to maximizing the sum of squared deviations between points in different clusters (between-cluster sum of squares, BCSS). This deterministic relationship is also related to the law of total variance in probability theory.

There are several k-means algorithms available. The standard algorithm is the Hartigan-Wong algorithm (Hartigan and Wong 1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid: \[argmin_s={\sum}_{i=1}^k{\sum}_{x\in S_i} (\mathbf{x}- \mathbf{\mu}_i)^2 \].

0.5.2 Computing k-means clustering

0.5.2.1 Data

USArrests Dataset

0.5.2.2 Estimating the optimal number of clusters

The R function fviz_nbclust() [in factoextra package] provides a convenient solution to estimate the optimal number of clusters.

data= scale(USArrests)

library(factoextra)
fviz_nbclust(data,kmeans ,method = "wss")+geom_vline(xintercept = 4,linetype="dotted")

0.5.2.3 k-means clustering

Note that,here we specify nstart = 251. This means that R will try 25 different random starting assignments and then select the best results corresponding to the one with the lowest within cluster variation. The default value of nstart in R is one. But, it’s strongly recommended to compute k-means clustering with a large value of nstart such as 25 or 50, in order to have a more stable result.

set.seed(123)
km= kmeans(data,4,nstart = 25)
km
## K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4  0.6950701  1.0394414  0.7226370  1.27693964
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              4              4              1              4 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              4              3              3              4              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              2              4              3              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              2              1              2              4 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              4              2              1              4 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              4              2              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              4              4              1              2              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              4              3              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              2              2              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 11.952463 16.212213 19.922437
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

It’s possible to compute the mean of each variables by clusters using the original data:

aggregate(USArrests, by=list(cluster=km$cluster), mean)
##   cluster   Murder   Assault UrbanPop     Rape
## 1       1 13.93750 243.62500 53.75000 21.41250
## 2       2  3.60000  78.53846 52.07692 12.17692
## 3       3  5.65625 138.87500 73.87500 18.78125
## 4       4 10.81538 257.38462 76.00000 33.19231

If we want to add the point classifications to the original data:

dd= cbind(USArrests, cluster = km$cluster)
head(dd)
##            Murder Assault UrbanPop Rape cluster
## Alabama      13.2     236       58 21.2       1
## Alaska       10.0     263       48 44.5       4
## Arizona       8.1     294       80 31.0       4
## Arkansas      8.8     190       50 19.5       1
## California    9.0     276       91 40.6       4
## Colorado      7.9     204       78 38.7       4

0.5.2.4 Visualizing k-means clusters

Now, we want to visualize the data in a scatter plot with coloring each data point according to its cluster assignment.

The problem is that the data contains more than 2 variables and the question is what variables to choose for the xy scatter plot.

A solution is to reduce the number of dimensions by applying a dimensionality reduction algorithm, such as Principal Component Analysis (PCA), that operates on the four variables and outputs two new variables (that represent the original variables) that you can use to do the plot. In other words, if we have a multi-dimensional data set, a solution is to perform Principal Component Analysis (PCA) and to plot data points according to the first two principal components coordinates.

The function fviz_cluster() [factoextra package] can be used to easily visualize k-means clusters. It takes k-means results and the original data as arguments. In the resulting plot, observations are represented by points, using principal components if the number of variables is greater than 2. It’s also possible to draw concentration ellipse around each cluster.

library(factoextra)
fviz_cluster(km,data,ellipse.type = "norm")

0.5.2.5 Alternative to k-means clustering

A robust alternative to k-means is PAM, which is based on medoids. PAM clustering can be computed using the function pam() [cluster package]. The function pamk( ) [fpc package] is a wrapper for PAM that also prints the suggested number of clusters based on optimum average silhouette width.

0.6 K-Medoids Clustering: Partial Around Medoids (PAM)

  • The k-medoids algorithm is a clustering approach related to k-means clustering.

  • In k-medoids clustering, each cluster is represented by one of the data point in the cluster, which are named as cluster medoids.

  • The term medoid refers to an object within a cluster for which average dissimilarity between it and all the other the members of the cluster is minimal. It corresponds to the most centrally located point in the cluster. In other words, we can minimize the sum of the dissimilarities between object and their closest selected object.

  • K-medoid is a robust alternative to k-means clustering i.e., the algorithm is less sensitive to noise and outliers, compared to k-means, because it uses medoids as cluster centers instead of means (used in k-means).

  • This method requires the user to specify k, the number of clusters to be generated (like in k-means clustering). A useful approach to determine the optimal number of clusters is the silhouette method.

  • The most common k-medoids clustering methods is the PAM algorithm (Partitioning Around Medoids, (Kaufman and Rousseeuw 1990)).

0.6.1 PAM algorithm

The PAM algorithm is based on the search for k representative objects or medoids among the observations of the data set. Read this document for Clear understanding about this algorithm, but in summary, PAM algorithm proceeds in two phases as follow:

Build phase:

  1. 1Select k objects to become the medoids, or in case these objects were provided use them as the medoids.

  2. Calculate the dissimilarity matrix if it was not provided.

  3. Assign every object to its closest medoid.

Swap phase:

  1. For each cluster search if any of the object of the cluster decreases the average dissimilarity coefficient; if it does, select the entity that decreases this coefficient the most as the medoid for this cluster; 5. If at least one medoid has changed go to (3), else end the algorithm.

As mentioned above, the PAM algorithm works with a matrix of dissimilarity, and to compute this matrix the algorithm can use two metrics:

  1. The euclidean distances, that are the root sum-of-squares of differences;

  2. And, the Manhattan distance that are the sum of absolute distances.

Note that, in practice, we should get similar results most of the time, using either euclidean or Manhattan distance. If our data contains outliers, Manhattan distance should give more robust results, whereas euclidean would  be influenced by unusual values.

0.6.2 Computing PAM

0.6.2.1 Data

USArrests Dataset

0.6.2.2 Required R packages and functions

The function pam() [cluster package] and pamk() [fpc package] can be used to compute PAM.

The function pamk() does not require a user to decide the number of clusters K.

To create a beautiful graph of the clusters generated with the pam() function, will use the factoextra package.

library(cluster)
library(factoextra)

0.6.2.3 Estimating the optimal number of clusters

The R function fviz_nbclust() [in factoextra package] provides a convenient solution to estimate the optimal number of clusters. Here we’ll use the average silhouette method. A high average silhouette width indicates a good clustering.

data= scale(USArrests)

fviz_nbclust(data,kmeans ,method = "silhouette")

0.6.2.4 Computing PAM

Pam=  pam(data, 2) #computes PAM algorithm with k = 2
Pam
## Medoids:
##            ID     Murder    Assault   UrbanPop       Rape
## New Mexico 31  0.8292944  1.3708088  0.3081225  1.1603196
## Nebraska   27 -0.8008247 -0.8250772 -0.2445636 -0.5052109
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              1              2              2              1              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              2              2              1              2              2 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              2              2              1              2              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              2              1              1 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              2              2              1              2              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              2              2 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              2              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              2              1              1              2              2 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              2              2              2 
## Objective function:
##    build     swap 
## 1.441358 1.368969 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
The printed output shows:

- the cluster medoids: a matrix, which rows are the medoids and columns are variables

- the clustering vector: A vector of integers (from 1:k) indicating the cluster to which each point is allocated

If we want to add the point classifications to the original data:

dd= cbind(USArrests, cluster = Pam$cluster)
head(dd, n = 3)
##         Murder Assault UrbanPop Rape cluster
## Alabama   13.2     236       58 21.2       1
## Alaska    10.0     263       48 44.5       1
## Arizona    8.1     294       80 31.0       1

0.6.2.5 Visualizing PAM clusters

To visualize the partitioning results, we’ll use the function fviz_cluster() [factoextra package].

fviz_cluster(Pam,data,ellipse.type = "norm")

Note that, for large data sets, pam() may need too much memory or too much computation time. In this case, the function clara() is preferable. This should not be a problem for modern computers.

Note that there is a variant of PAM named CLARA (Clustering Large Applications, (Kaufman and Rousseeuw 1990)) which is used for analyzing large data sets. It is an extension to k-medoids (PAM) methods to deal with data containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem. This is achieved using the sampling approach.

Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed size (sampsize) and applies the PAM algorithm to generate an optimal set of medoids for the sample. The quality of resulting medoids is measured by the average dissimilarity between every object in the entire data set and the medoid of its cluster, defined as the cost function.

CLARA repeats the sampling and clustering processes a pre-specified number of times in order to minimize the sampling bias. The final clustering results correspond to the set of medoids with the minimal cost.

0.7 Agglomerative Hierarchical Clustering

The agglomerative clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It works in a “bottom-up” manner. That is, - each object is initially considered as a single-element cluster (leaf).

  • At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster (nodes).

-This procedure is iterated until all points are member of just one single big cluster (root).

The steps to agglomerative hierarchical clustering are:

  1. Preparing the data (standardize the variables)

  2. Computing (dis)similarity information between every pair of objects in the data set.

  3. sing linkage function to group objects into hierarchical cluster tree, based on the distance information generated at step 1. Objects/clusters that are in close proximity are linked together using the linkage function.

  4. Determining where to cut the hierarchical tree into clusters. This creates a partition of the data.

Similarity measures

There are many methods to calculate the (dis)similarity information, including Euclidean and manhattan distances. In R software, we can use the function dist() to compute the distance between every pair of object in a data set, which is known as a distance or dissimilarity matrix.

Linkage

The linkage function takes the distance information, returned by the function dist(), and groups pairs of objects into clusters based on their similarity. Next, these newly formed clusters are linked to each other to create bigger clusters. This process is iterated until all the objects in the original data set are linked together in a hierarchical tree.

There are many cluster agglomeration methods (i.e, linkage methods). The most common linkage methods are: - Maximum or complete linkage

  • Minimum or single linkage

  • Mean or average linkage

  • Centroid linkage

  • Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

Complete linkage and Ward’s method are generally preferred.

0.7.1 Computing Agglomerative Hierarchical Clustering

the R base function hclust() can be used to create the hierarchical tree. Alternatively, the package cluster makes it easy to perform cluster analysis in R. It provides the function agnes() and diana() for computing agglomerative and divisive clustering, respectively.

0.7.1.1 Data

USArrests Dataset

0.7.1.2 Required R packages and functions

The function hclust() [base package] is used to compute Hierarchical Clustering.

Dendrograms correspond to the graphical representation of the hierarchical tree generated by the function hclust().

Dendrogram can be produced in R using the base function plot(). We can also use fviz_dend()[ in factoextra package] to produce a beautiful dendrogram.

library(cluster)
library(factoextra)

0.7.1.3 Calculating dissimilarity matrix

data= scale(USArrests)

res.dist= dist(data, method = "euclidean")

dim(as.matrix(res.dist))
## [1] 50 50
round(as.matrix(res.dist)[1:10,1:10],2)
##             Alabama Alaska Arizona Arkansas California Colorado Connecticut
## Alabama        0.00   2.70    2.29     1.29       3.26     2.65        3.22
## Alaska         2.70   0.00    2.70     2.83       3.01     2.33        4.74
## Arizona        2.29   2.70    0.00     2.72       1.31     1.37        3.26
## Arkansas       1.29   2.83    2.72     0.00       3.76     2.83        2.61
## California     3.26   3.01    1.31     3.76       0.00     1.29        4.07
## Colorado       2.65   2.33    1.37     2.83       1.29     0.00        3.33
## Connecticut    3.22   4.74    3.26     2.61       4.07     3.33        0.00
## Delaware       2.02   3.62    1.91     1.80       3.07     2.55        1.76
## Florida        2.30   3.00    1.75     3.37       2.03     2.45        4.47
## Georgia        1.13   2.82    2.79     2.21       3.38     2.86        3.97
##             Delaware Florida Georgia
## Alabama         2.02    2.30    1.13
## Alaska          3.62    3.00    2.82
## Arizona         1.91    1.75    2.79
## Arkansas        1.80    3.37    2.21
## California      3.07    2.03    3.38
## Colorado        2.55    2.45    2.86
## Connecticut     1.76    4.47    3.97
## Delaware        0.00    3.06    2.98
## Florida         3.06    0.00    2.18
## Georgia         2.98    2.18    0.00

0.7.1.4 Computing Agglomerative Hierarchical Clustering with Linkage

Hc= hclust(d = res.dist, method = "ward.D2")

Note that, the function hcut()[in factoextra package] can also be used to Computes hierarchical clustering (hclust, agnes, diana) and cut the tree into k clusters.

0.7.1.5 Dendogram

plot(Hc)

Similarly by using factoextra package:

library("factoextra")
fviz_dend(Hc, cex = 0.5)

0.7.1.6 Verify the cluster tree

After linking the objects in a data set into a hierarchical cluster tree, we usually want to assess that the distances (i.e., heights) in the tree reflect the original distances accurately.

  • One way to measure this, is to compute the correlation between the cophenetic distances and the original distance data generated by the dist() function.

  • If the clustering is valid, the linking of objects in the cluster tree should have a strong correlation with the distances between objects in the original distance matrix.

  • The closer the value of the correlation coefficient is to 1, the more accurately the clustering solution reflects your data. Values above 0.75 are felt to be good.

  • The R base function cophenetic() can be used to compute the cophenetic distances for hierarchical clustering.

# Compute cophentic distance
coph= cophenetic(Hc)

# Correlation between cophenetic distance and  the original distance
cor(coph, res.dist)
## [1] 0.6975266

Execute the hclust() function again using the average linkage method. Next, call cophenetic() to evaluate the clustering solution.

Hc2= hclust(res.dist, method = "average")

cor(res.dist, cophenetic(Hc2))
## [1] 0.7180382

The correlation coefficient shows that using a different linkage method creates a tree that represents the original distances slightly better.

0.7.1.7 Cut the dendrogram into different groups

One of the problems with hierarchical clustering is that, it does not tell us how many clusters there are, or where to cut the dendrogram to form clusters.

  • We can cut the hierarchical tree at a given height in order to partition your data into clusters.

  • The R base function cutree() can be used to cut a tree, generated by the hclust() function, into several groups either by specifying the desired number of groups or the cut height.

  • It returns a vector containing the cluster number of each observation.

# Cut tree into 4 groups
grp= cutree(Hc, k = 4)
head(grp, n = 4)
##  Alabama   Alaska  Arizona Arkansas 
##        1        2        2        3
# Number of members in each cluster
table(grp)
## grp
##  1  2  3  4 
##  7 12 19 12
# Get the names for the members of cluster 1
rownames(data)[grp == 1]
## [1] "Alabama"        "Georgia"        "Louisiana"      "Mississippi"   
## [5] "North Carolina" "South Carolina" "Tennessee"

The result of the cuts can be visualized easily using the function fviz_dend() [in factoextra]:

fviz_dend(Hc, cex = 0.5, k = 4, color_labels_by_k = TRUE)

Using the function fviz_cluster() [in factoextra], we can also visualize the result in a scatter plot, but note that, this function cannot be applied on the base package function hclust() generated object; for that we will apply hcut() function [from factoextra] for performing hierarchical clustering. Observations are represented by points in the plot, using principal components. A frame is drawn around each cluster.

hc.cut= hcut(data, k = 4, hc_method = "ward.D2")
hc.cut
## 
## Call:
## stats::hclust(d = x, method = hc_method)
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 50
fviz_cluster(hc.cut, ellipse.type = "convex",show.clust.cent=T,geom = c("point", "text") )

0.7.1.8 Cluster R package

As discussed earlyar, The R package cluster makes it easy to perform cluster analysis in R. The function agnes() and diana() functions perform all the necessary steps for you. We don’t need to execute the scale(), dist() and hclust() function separately.

library("cluster")

# Agglomerative Nesting (Hierarchical Clustering)

agnes=  agnes(x = USArrests, # data matrix
              stand = TRUE, # Standardize the data
              metric = "euclidean", # metric for distance matrix
              method = "ward" # Linkage method
             )

# DIvisive ANAlysis Clustering

diana= diana(x = USArrests, # data matrix
             stand = TRUE, # standardize the data
             metric = "euclidean" # metric for distance matrix
            )

After running agnes() and diana(), you can use the function fviz_dend()[in factoextra] to visualize the output:

fviz_dend(res.agnes, cex = 0.6, k = 4)

0.8 Divisive Hierarchical Clustering

  • The divisive hierarchical clustering, is the inverse of agglomerative clustering.

  • It starts by including all objects in a single large cluster.

  • At each step of iteration, the most heterogeneous cluster is divided into two.

  • The process is iterated until all objects are in their own cluster.

Recall that, divisive clustering is good at identifying large clusters while agglomerative clustering is good at identifying small clusters.

0.8.1 Computation

The R function diana() [cluster package] can be used to compute divisive clustering. It returns an object of class diana (see ?diana.object) which has also methods for the functions: print(), summary(), plot(), pltree(), as.dendrogram(), as.hclust() and cutree().

The output can be visualized as dendrograms using the function fviz_dend() [factoextra` package]. For example:

# Compute diana()
library(cluster)
res.diana <- diana(USArrests, stand = TRUE)

# Plot the dendrogram
library(factoextra)
fviz_dend(res.diana, cex = 0.5,
          k = 4, # Cut in four groups
          palette = "jco" # Color palette
          )

0.9 Comparing Cluster Dendrograms

Now, we will see how to compare cluster dendrograms in R using the 1dendextend` R package. Suppose we have computed two dendrograms by computing hierarchical clustering (HC) using two different linkage methods (“average” and “ward.D2”).

0.9.1 Data

data= scale(USArrests)

# working with a small random subset of the data set with 10 observations among the 50 observations.
set.seed(123)
id= sample(1:50, 10)
data= data[id,]
library(dendextend)
library(dplyr)
# Compute distance matrix
res.dist= dist(data, method = "euclidean")

# Compute 2 hierarchical clusterings
hc1= hclust(res.dist, method = "average")
hc2= hclust(res.dist, method = "ward.D2")

# Create two dendrograms
dend1= as.dendrogram (hc1)
dend2= as.dendrogram (hc2)

# Create a list to hold dendrograms
dend_list= dendlist(dend1, dend2)

0.9.1.1 Visual comparison of two dendrograms

To visually compare two dendrograms, we’ll use the following R functions [dendextend package]:

  • untangle(): finds the best layout to align dendrogram lists, using heuristic methods

  • tanglegram(): plots the two dendrograms, side by side, with their labels connected by lines.

  • entanglement(): computes the quality of the alignment of the two trees. Entanglement is a measure between 1 (full entanglement) and 0 (no entanglement). A lower entanglement coefficient corresponds to a good alignment.

Draw a tanglegram:

# Align and plot two dendrograms side by side
dendlist(dend1, dend2) %>%
  untangle(method = "step1side") %>% # Find the best alignment layout
  tanglegram()                       # Draw the two dendrograms

# Compute alignment quality. Lower value = good alignment quality
dendlist(dend1, dend2) %>%
  untangle(method = "step1side") %>% # Find the best alignment layout
  entanglement()                     # Alignment quality
## [1] 0

0.9.1.2 Correlation matrix between a list of dendrogams

  • The function cor.dendlist() is used to compute “Baker” or “Cophenetic” correlation matrix between a list of trees.

  • The value can range between -1 to 1. With near 0 values meaning that the two trees are not statistically similar.

# Cophenetic correlation matrix
cor.dendlist(dend_list, method = "cophenetic")
##           [,1]      [,2]
## [1,] 1.0000000 0.9925544
## [2,] 0.9925544 1.0000000
# Baker correlation matrix
cor.dendlist(dend_list, method = "baker")
##           [,1]      [,2]
## [1,] 1.0000000 0.9895528
## [2,] 0.9895528 1.0000000

The correlation between two trees can be also computed as follow:

# Cophenetic correlation coefficient
cor_cophenetic(dend1, dend2)
## [1] 0.9925544
# Baker correlation coefficient
cor_bakers_gamma(dend1, dend2)
## [1] 0.9895528

It’s also possible to compare simultaneously multiple dendrograms:

# Create multiple dendrograms by chaining
dend1= data %>% dist %>% hclust("complete") %>% as.dendrogram
dend2= data %>% dist %>% hclust("single") %>% as.dendrogram
dend3= data %>% dist %>% hclust("average") %>% as.dendrogram
dend4= data %>% dist %>% hclust("centroid") %>% as.dendrogram

# Compute correlation matrix
dend_list= dendlist("Complete" = dend1, "Single" = dend2,
                    "Average" = dend3, "Centroid" = dend4)

cors= cor.dendlist(dend_list)

# Print correlation matrix
round(cors, 2)
##          Complete Single Average Centroid
## Complete     1.00   0.46    0.45     0.30
## Single       0.46   1.00    0.23     0.17
## Average      0.45   0.23    1.00     0.31
## Centroid     0.30   0.17    0.31     1.00
# Visualize the correlation matrix using corrplot package
library(corrplot)
## corrplot 0.92 loaded
corrplot(cors, "pie", "lower")

0.10 Computing P-value for Hierarchical Clustering

Clusters can be found in a data set by chance due to clustering noise or sampling error. The R package pvclust (Suzuki and Shimodaira 2015) which uses bootstrap resampling techniques to compute p-value for each hierarchical clusters.

The Algorithm for computing P-Values are:

  1. Generated thousands of bootstrap samples by randomly sampling elements of the data.

  2. Compute hierarchical clustering on each bootstrap copy.

  3. For each cluster:

    1. compute the bootstrap probability (BP) value which corresponds to the frequency that the cluster is identified in bootstrap copies.
    2. Compute the approximately unbiased (AU) probability values (p-values) by multiscale bootstrap resampling.

Clusters with AU >= 95% are considered to be strongly supported by data.

The function pvclust() can be used as follow:

pvclust(data, method.hclust = "average",
        method.dist = "correlation", nboot = 1000)
--------------------------------------------------------------------------------------------------------------
data: numeric data matrix or data frame.
--------------------------------------------------------------------------------------------------------------
method.hclust: the agglomerative method used in hierarchical clustering. Possible values are one of “average”,
               “ward”, “single”, “complete”,“mcquitty”, “median” or “centroid”. The default is “average”.
               See method argument in ?hclust.
--------------------------------------------------------------------------------------------------------------
method.dist: the distance measure to be used. Possible values are one of “correlation”, “uncentered”,
             “abscor” or those which are allowed for method argument in dist() function, such
             “euclidean” and “manhattan”.
--------------------------------------------------------------------------------------------------------------
nboot: the number of bootstrap replications. The default is 1000.
--------------------------------------------------------------------------------------------------------------
iseed: an integrer for random seeds. Use iseed argument to achieve reproducible results.
--------------------------------------------------------------------------------------------------------------

The function pvclust() returns an object of class pvclust containing many elements including hclust which contains hierarchical clustering result for the original data generated by the function hclust().

0.11 Hierarchical K-Means Clustering: Optimize Clusters

K-means represents one of the most popular clustering algorithm. However, it has some limitations:

  • it requires the user to specify the number of clusters in advance and selects initial centroids randomly.

  • The final k-means clustering solution is very sensitive to this initial random selection of cluster centers.

  • The result might be (slightly) different each time you compute k-means.

0.11.1 Algorithm

The algorithm is summarized as follow:

  1. Compute hierarchical clustering and cut the tree into k-clusters.

  2. Compute the center (i.e the mean) of each cluster.

  3. Compute k-means by using the set of cluster centers (defined in step 2) as the initial cluster centers.

0.11.2 Computation

The R function hkmeans() [in factoextra package], provides an easy solution to compute the hierarchical k-means clustering.

# Compute hierarchical k-means clustering
#library(factoextra)

data= scale(USArrests)

hk= hkmeans(data, 4)
# Elements returned by hkmeans()
names(hk)
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

To print all the results:

hk
## Hierarchical K-means clustering with 4 clusters of sizes 8, 13, 16, 13
## 
## Cluster means:
##       Murder    Assault   UrbanPop        Rape
## 1  1.4118898  0.8743346 -0.8145211  0.01927104
## 2  0.6950701  1.0394414  0.7226370  1.27693964
## 3 -0.4894375 -0.3826001  0.5758298 -0.26165379
## 4 -0.9615407 -1.1066010 -0.9301069 -0.96676331
## 
## Clustering vector:
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              2              2              1              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              4              2              3              4 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              4              1              4              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              4              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              4              4              2              4              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              4              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              4              1              2              3              4 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              4              4              3 
## 
## Within cluster sum of squares by cluster:
## [1]  8.316061 19.922437 16.212213 11.952463
##  (between_SS / total_SS =  71.2 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"
# Visualize the tree
fviz_dend(hk, cex = 0.6, palette = "jco", 
          rect = TRUE, rect_border = "jco", rect_fill = TRUE)

# Visualize the hkmeans final clusters
fviz_cluster(hk, palette = "jco", repel = TRUE,
             ggtheme = theme_classic())

0.12 Fuzzy Clustering

The fuzzy clustering is considered as soft clustering, in which each element has a probability of belonging to each cluster. In other words, each element has a set of membership coefficients corresponding to the degree of being in a given cluster.

This is different from k-means and k-medoid clustering, where each object is affected exactly to one cluster. K-means and k-medoids clustering are known as hard or non-fuzzy clustering.

In fuzzy clustering, points close to the center of a cluster, may be in the cluster to a higher degree than points in the edge of a cluster. The degree, to which an element belongs to a given cluster, is a numerical value varying from 0 to 1.

The fuzzy c-means (FCM) algorithm is one of the most widely used fuzzy clustering algorithms. The centroid of a cluster is calculated as the mean of all points, weighted by their degree of belonging to the cluster.

For Clear understanding read this document.

We’ll use the following R packages: 1) cluster for computing fuzzy clustering and 2) factoextra for visualizing clusters.

The function fanny() [cluster R package] can be used to compute fuzzy clustering. FANNY stands for fuzzy analysis clustering.

A simplified format is:

fanny(x, k, metric = "euclidean", stand = FALSE)
-----------------------------------------------------------------------------------
x: A data matrix or data frame or dissimilarity matrix
-----------------------------------------------------------------------------------
k: The desired number of clusters to be generated
-----------------------------------------------------------------------------------
metric: Metric for calculating dissimilarities between observations
-----------------------------------------------------------------------------------
stand: If TRUE, variables are standardized before calculating the dissimilarities
-----------------------------------------------------------------------------------

Note that, there are more advanced clustering methods available like: Model Based Clustering, DBSCAN: Density-Based Clustering.

0.13 Reference

  1. Bradley Boehmke & Brandon Greenwell; Hands-On Machine Learning with R

  2. Ester, Martin, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.” In, 226–31. AAAI Press.

  3. Kaufman, Leonard, and Peter Rousseeuw. 1990. Finding Groups in Data: An Introduction to Cluster Analysis.