Week 9: K-means and hierarchical clustering
We will cover:
It’s easy if we can see the clusters, but what an algorithm sees might be quite different.
Let \(A=(x_{a1}, x_{a2}, ..., x_{ap})\) and \(B=(x_{b1}, x_{b2}, ..., x_{bp})\).
Euclidean
\[\begin{align*} d_{EUC}(A, B) &= \sqrt{\sum_{j=1}^p (x_{aj}-x_{bj})^2} &\\ &= \sqrt{((X_A-X_B)^\top (X_A-X_B))}& \end{align*}\]
Other distance metrics
Mahalanobis (or statistical) distance: \(\sqrt{((X_A-X_B)^\top S^{-1} (X_A-X_B))}\)
Manhattan: \(\sum_{j=1}^p|(X_{aj}-X_{bj})|\)
Minkowski: \((\sum_{j=1}^p|(X_{aj}-X_{bj})|^m)^{1/m}\)
Count data
Canberra: \(\frac{1}{n_z}\sum_{j=1}^p\frac{X_{aj}-X_{bj}}{X_{aj}+X_{bj}}\)
Bray-Curtis: \(\frac{\sum_{j=1}^p|X_{aj}-X_{bj}|}{\sum_{j=1}^p(X_{aj}+X_{bj})}\)
Categorical variables
Mixed variable types
Other
Hamming: all binary variables, number of variables at which values are different.
Cosine: \(\frac{\sum_{j=1}^p X_{aj}X_{bj}}{||X_a|| ||X_b||}\) (How does this compare to a calculation of correlation??)
V1 V2 V3 V4 point
1 7.3 7.6 7.7 8.0 a1
2 7.4 7.2 7.3 7.2 a2
3 4.1 4.6 4.6 4.8 a3
Compute Euclidean distance between \(a_1\) and \(a_2\).
🔑 Standardise your variables!!!!
Could you compute a correlation distance? \(d_{cor} = 1-|r|\) Is \(a_1\) close to \(a_3\) than \(a_2\)?
Anything can be a distance if it follows these rules:
Distance is a dissimilarity measure because small means close and large means far.
Correlation is a similarity measure because the larger the value the closer the objects. It can be converted to a dissimilarity with a transformation.
This is an iterative procedure. To use it the number of clusters, \(k\), must be decided first. The stages of the iteration are:
lbl | x1 | x2 |
---|---|---|
a | 16 | 4 |
b | 19 | 8 |
c | 14 | 4 |
d | 19 | 9 |
e | 10 | 21 |
f | 7 | 19 |
g | 1 | 20 |
h | 2 | 15 |
i | 3 | 6 |
j | 3 | 7 |
k | 6 | 2 |
l | 6 | 5 |
Select \(k=2\), and set initial seed means
\(\bar{x}_1=\) (10, 11) , \(\bar{x}_2=\) (11, 9)
Compute distances \((d_1, d_2)\) between each observation and each mean,
\(\bar{x}_1=\) (10, 11) , \(\bar{x}_2=\) (11, 9)
lbl | x1 | x2 | d1 | d2 |
---|---|---|---|---|
a | 16 | 4 | 9.2 | 7.1 |
b | 19 | 8 | 9.5 | 8.1 |
c | 14 | 4 | 8.1 | 5.8 |
d | 19 | 9 | 9.2 | 8.0 |
e | 10 | 21 | 10.0 | 12.0 |
f | 7 | 19 | 8.5 | 10.8 |
g | 1 | 20 | 12.7 | 14.9 |
h | 2 | 15 | 8.9 | 10.8 |
i | 3 | 6 | 8.6 | 8.5 |
j | 3 | 7 | 8.1 | 8.2 |
k | 6 | 2 | 9.8 | 8.6 |
l | 6 | 5 | 7.2 | 6.4 |
Assign each observation to a cluster, based on which mean is closest.
lbl | x1 | x2 | d1 | d2 | cl |
---|---|---|---|---|---|
a | 16 | 4 | 9.2 | 7.1 | 2 |
b | 19 | 8 | 9.5 | 8.1 | 2 |
c | 14 | 4 | 8.1 | 5.8 | 2 |
d | 19 | 9 | 9.2 | 8.0 | 2 |
e | 10 | 21 | 10.0 | 12.0 | 1 |
f | 7 | 19 | 8.5 | 10.8 | 1 |
g | 1 | 20 | 12.7 | 14.9 | 1 |
h | 2 | 15 | 8.9 | 10.8 | 1 |
i | 3 | 6 | 8.6 | 8.5 | 2 |
j | 3 | 7 | 8.1 | 8.2 | 1 |
k | 6 | 2 | 9.8 | 8.6 | 2 |
l | 6 | 5 | 7.2 | 6.4 | 2 |
Recompute means, and re-assign the cluster membership
\(\bar{x}_1=\) (5, 16) , \(\bar{x}_2=\) (12, 5)
lbl | x1 | x2 | d1 | d2 | cl |
---|---|---|---|---|---|
a | 16 | 4 | 16.8 | 4.4 | 2 |
b | 19 | 8 | 16.7 | 7.6 | 2 |
c | 14 | 4 | 15.6 | 2.6 | 2 |
d | 19 | 9 | 16.2 | 8.0 | 2 |
e | 10 | 21 | 7.1 | 15.7 | 1 |
f | 7 | 19 | 3.5 | 14.4 | 1 |
g | 1 | 20 | 5.1 | 18.2 | 1 |
h | 2 | 15 | 3.0 | 13.7 | 1 |
i | 3 | 6 | 10.5 | 8.9 | 2 |
j | 3 | 7 | 9.5 | 9.0 | 2 |
k | 6 | 2 | 14.5 | 6.8 | 2 |
l | 6 | 5 | 11.5 | 5.9 | 2 |
Recompute means, and re-assign the cluster membership
\(\bar{x}_1=\) (5, 19) , \(\bar{x}_2=\) (11, 6)
lbl | x1 | x2 | d1 | d2 | cl |
---|---|---|---|---|---|
a | 16 | 4 | 18.4 | 5.5 | 2 |
b | 19 | 8 | 17.7 | 8.6 | 2 |
c | 14 | 4 | 17.3 | 3.6 | 2 |
d | 19 | 9 | 17.1 | 8.9 | 2 |
e | 10 | 21 | 5.5 | 15.4 | 1 |
f | 7 | 19 | 2.0 | 13.9 | 1 |
g | 1 | 20 | 4.2 | 17.4 | 1 |
h | 2 | 15 | 4.8 | 12.8 | 1 |
i | 3 | 6 | 12.9 | 7.8 | 2 |
j | 3 | 7 | 11.9 | 7.9 | 2 |
k | 6 | 2 | 16.8 | 6.0 | 2 |
l | 6 | 5 | 13.8 | 4.8 | 2 |
Recompute means, and re-assign the cluster membership
\(\bar{x}_1=\) (5, 19) , \(\bar{x}_2=\) (11, 6)
lbl | x1 | x2 | d1 | d2 | cl |
---|---|---|---|---|---|
a | 16 | 4 | 18.4 | 5.5 | 2 |
b | 19 | 8 | 17.7 | 8.6 | 2 |
c | 14 | 4 | 17.3 | 3.6 | 2 |
d | 19 | 9 | 17.1 | 8.9 | 2 |
e | 10 | 21 | 5.5 | 15.4 | 1 |
f | 7 | 19 | 2.0 | 13.9 | 1 |
g | 1 | 20 | 4.2 | 17.4 | 1 |
h | 2 | 15 | 4.8 | 12.8 | 1 |
i | 3 | 6 | 12.9 | 7.8 | 2 |
j | 3 | 7 | 11.9 | 7.9 | 2 |
k | 6 | 2 | 16.8 | 6.0 | 2 |
l | 6 | 5 | 13.8 | 4.8 | 2 |
We know there are three clusters, but generally we don’t know this.
Will \(k=3\)-means clustering see three?
Fit for various values of \(k\). Add cluster label to data.
Examine solution in plots of the data.
Compute cluster metrics.
NOTE: set.seed()
because results can depend on initialisation.
\(n\times n\) distance matrix
a | b | c | d | e | f | g | h | i | j | k | l | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
a | 0.0 | 5.0 | 2.0 | 5.8 | 18.0 | 17.5 | 21.9 | 17.8 | 13.2 | 13.3 | 10.2 | 10.0 |
b | 5.0 | 0.0 | 6.4 | 1.0 | 15.8 | 16.3 | 21.6 | 18.4 | 16.1 | 16.0 | 14.3 | 13.3 |
c | 2.0 | 6.4 | 0.0 | 7.1 | 17.5 | 16.6 | 20.6 | 16.3 | 11.2 | 11.4 | 8.2 | 8.1 |
d | 5.8 | 1.0 | 7.1 | 0.0 | 15.0 | 15.6 | 21.1 | 18.0 | 16.3 | 16.1 | 14.8 | 13.6 |
e | 18.0 | 15.8 | 17.5 | 15.0 | 0.0 | 3.6 | 9.1 | 10.0 | 16.6 | 15.7 | 19.4 | 16.5 |
f | 17.5 | 16.3 | 16.6 | 15.6 | 3.6 | 0.0 | 6.1 | 6.4 | 13.6 | 12.6 | 17.0 | 14.0 |
g | 21.9 | 21.6 | 20.6 | 21.1 | 9.1 | 6.1 | 0.0 | 5.1 | 14.1 | 13.2 | 18.7 | 15.8 |
h | 17.8 | 18.4 | 16.3 | 18.0 | 10.0 | 6.4 | 5.1 | 0.0 | 9.1 | 8.1 | 13.6 | 10.8 |
i | 13.2 | 16.1 | 11.2 | 16.3 | 16.6 | 13.6 | 14.1 | 9.1 | 0.0 | 1.0 | 5.0 | 3.2 |
j | 13.3 | 16.0 | 11.4 | 16.1 | 15.7 | 12.6 | 13.2 | 8.1 | 1.0 | 0.0 | 5.8 | 3.6 |
k | 10.2 | 14.3 | 8.2 | 14.8 | 19.4 | 17.0 | 18.7 | 13.6 | 5.0 | 5.8 | 0.0 | 3.0 |
l | 10.0 | 13.3 | 8.1 | 13.6 | 16.5 | 14.0 | 15.8 | 10.8 | 3.2 | 3.6 | 3.0 | 0.0 |
What is the distance between the new cluster (d,b) and all of the other observations?
Between points in the cluster to points not in the cluster.
a | b | c | d | e | f | g | h | i | j | k | l | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
a | 0.0 | 5.0 | 2.0 | 5.8 | 18.0 | 17.5 | 21.9 | 17.8 | 13.2 | 13.3 | 10.2 | 10.0 |
b | 5.0 | 0.0 | 6.4 | 1.0 | 15.8 | 16.3 | 21.6 | 18.4 | 16.1 | 16.0 | 14.3 | 13.3 |
c | 2.0 | 6.4 | 0.0 | 7.1 | 17.5 | 16.6 | 20.6 | 16.3 | 11.2 | 11.4 | 8.2 | 8.1 |
d | 5.8 | 1.0 | 7.1 | 0.0 | 15.0 | 15.6 | 21.1 | 18.0 | 16.3 | 16.1 | 14.8 | 13.6 |
e | 18.0 | 15.8 | 17.5 | 15.0 | 0.0 | 3.6 | 9.1 | 10.0 | 16.6 | 15.7 | 19.4 | 16.5 |
f | 17.5 | 16.3 | 16.6 | 15.6 | 3.6 | 0.0 | 6.1 | 6.4 | 13.6 | 12.6 | 17.0 | 14.0 |
g | 21.9 | 21.6 | 20.6 | 21.1 | 9.1 | 6.1 | 0.0 | 5.1 | 14.1 | 13.2 | 18.7 | 15.8 |
h | 17.8 | 18.4 | 16.3 | 18.0 | 10.0 | 6.4 | 5.1 | 0.0 | 9.1 | 8.1 | 13.6 | 10.8 |
i | 13.2 | 16.1 | 11.2 | 16.3 | 16.6 | 13.6 | 14.1 | 9.1 | 0.0 | 1.0 | 5.0 | 3.2 |
j | 13.3 | 16.0 | 11.4 | 16.1 | 15.7 | 12.6 | 13.2 | 8.1 | 1.0 | 0.0 | 5.8 | 3.6 |
k | 10.2 | 14.3 | 8.2 | 14.8 | 19.4 | 17.0 | 18.7 | 13.6 | 5.0 | 5.8 | 0.0 | 3.0 |
l | 10.0 | 13.3 | 8.1 | 13.6 | 16.5 | 14.0 | 15.8 | 10.8 | 3.2 | 3.6 | 3.0 | 0.0 |
Distance (b,d):
Distance (a,c):
Linkage between (b,d) and (a,c)
Single:
Complete:
Average:
Centroid:
Cut the tree to partition the data into \(k\) clusters.
Model-in-the-data-space
We know there are three clusters, but generally we don’t know this.
Will \(k=3\)-means clustering see three?
Fit for various values of \(k\). Add cluster label to data.
Examine solution in plots of the data.
Compute cluster metrics.
NOTE: No need for set.seed()
because results are deterministic.
within.cluster.ss
and wb.ratio
suggest 3, and 5pearsongamma
(Hubert) suggests 2-3dunn
, dunn2
, ch
all 2?ETC3250/5250 Lecture 9 | iml.numbat.space