ETC3250/5250 Introduction to Machine Learning

Overview

We will cover:

Defining distance measure
\(k\)-means algorithm
Hierarchical algorithms
Making and using dendrograms

Cluster analysis

The aim of cluster analysis is to group cases (objects) according to their similarity on the variables. It is also often called unsupervised classification, meaning that classification is the ultimate goal, but the classes (groups) are not known ahead of time.
Hence the first task in cluster analysis is to construct the class information. To determine closeness we start with measuring the interpoint distances.

Cluster these!

It’s easy if we can see the clusters, but what an algorithm sees might be quite different.

Seeing the clusters using spin-and-brush

library(detourr)
set.seed(645)
detour(p_std[,2:5], 
       tour_aes(projection = bl:bm)) |>
       tour_path(grand_tour(2), fps = 60, 
                 max_bases=40) |>
       show_scatter(alpha = 0.7, 
                    axes = FALSE)

How do you measure “close”?

Common interpoint distance measures

Let \(A=(x_{a1}, x_{a2}, ..., x_{ap})\) and \(B=(x_{b1}, x_{b2}, ..., x_{bp})\).

Euclidean

\[\begin{align*} d_{EUC}(A, B) &= \sqrt{\sum_{j=1}^p (x_{aj}-x_{bj})^2} &\\ &= \sqrt{((X_A-X_B)^\top (X_A-X_B))}& \end{align*}\]

Other distance metrics

Mahalanobis (or statistical) distance: \(\sqrt{((X_A-X_B)^\top S^{-1} (X_A-X_B))}\)
Manhattan: \(\sum_{j=1}^p|(X_{aj}-X_{bj})|\)
Minkowski: \((\sum_{j=1}^p|(X_{aj}-X_{bj})|^m)^{1/m}\)

Count data

Canberra: \(\frac{1}{n_z}\sum_{j=1}^p\frac{X_{aj}-X_{bj}}{X_{aj}+X_{bj}}\)
Bray-Curtis: \(\frac{\sum_{j=1}^p|X_{aj}-X_{bj}|}{\sum_{j=1}^p(X_{aj}+X_{bj})}\)

Categorical variables

1- simple matching coefficient: \(1-(\text{#matches})/p\)
Convert to dummy variables, and use Euclidean distance

Mixed variable types

Gower’s distance

Other

Hamming: all binary variables, number of variables at which values are different.
Cosine: \(\frac{\sum_{j=1}^p X_{aj}X_{bj}}{||X_a|| ||X_b||}\) (How does this compare to a calculation of correlation??)

Distance calculations

   V1  V2  V3  V4 point
1 7.3 7.6 7.7 8.0    a1
2 7.4 7.2 7.3 7.2    a2
3 4.1 4.6 4.6 4.8    a3

Compute Euclidean distance between \(a_1\) and \(a_2\).

🔑 Standardise your variables!!!!

Could you compute a correlation distance? \(d_{cor} = 1-|r|\) Is \(a_1\) close to \(a_3\) than \(a_2\)?

Basic rules of a distance metric

Anything can be a distance if it follows these rules:

\(d(A, B) \geq 0\)
\(d(A, A) = 0\)
\(d(A, B) = d(B, A)\)
Metric dissimilarity satisfies \(d(A, B) \leq d(A, C) + d(C, B)\)

If both points on left bank, or both on right bank, use Euclidean distance.
If on opposite sides, Euclidean distances to bridge, plus length of bridge crossing.
Does this satisfy the rules?

Dissimilarity vs similarity

Distance is a dissimilarity measure because small means close and large means far.
Correlation is a similarity measure because the larger the value the closer the objects. It can be converted to a dissimilarity with a transformation.

\(k\)-means clustering

k-means clustering - algorithm (1/8)

This is an iterative procedure. To use it the number of clusters, \(k\), must be decided first. The stages of the iteration are:

Initialize by either (a) partitioning the data into k groups, and compute the \(k\) group means or (b) an initial set of \(k\) points as the first estimate of the cluster means (seed points).
Loop over all observations reassigning them to the group with the closest mean.
Recompute group means.
Iterate steps 2 and 3 until convergence.

Thean C. Lim’s blog post

k-means clustering - algorithm (2/8)

lbl	x1	x2
a	16	4
b	19	8
c	14	4
d	19	9
e	10	21
f	7	19
g	1	20
h	2	15
i	3	6
j	3	7
k	6	2
l	6	5

k-means clustering - algorithm (3/8)

Select \(k=2\), and set initial seed means
\(\bar{x}_1=\) (10, 11) , \(\bar{x}_2=\) (11, 9)

k-means clustering - algorithm (4/8)

Compute distances \((d_1, d_2)\) between each observation and each mean,

\(\bar{x}_1=\) (10, 11) , \(\bar{x}_2=\) (11, 9)

lbl	x1	x2	d1	d2
a	16	4	9.2	7.1
b	19	8	9.5	8.1
c	14	4	8.1	5.8
d	19	9	9.2	8.0
e	10	21	10.0	12.0
f	7	19	8.5	10.8
g	1	20	12.7	14.9
h	2	15	8.9	10.8
i	3	6	8.6	8.5
j	3	7	8.1	8.2
k	6	2	9.8	8.6
l	6	5	7.2	6.4

k-means clustering - algorithm (5/8)

Assign each observation to a cluster, based on which mean is closest.

lbl	x1	x2	d1	d2	cl
a	16	4	9.2	7.1	2
b	19	8	9.5	8.1	2
c	14	4	8.1	5.8	2
d	19	9	9.2	8.0	2
e	10	21	10.0	12.0	1
f	7	19	8.5	10.8	1
g	1	20	12.7	14.9	1
h	2	15	8.9	10.8	1
i	3	6	8.6	8.5	2
j	3	7	8.1	8.2	1
k	6	2	9.8	8.6	2
l	6	5	7.2	6.4	2

k-means clustering - algorithm (6/8)

Recompute means, and re-assign the cluster membership

\(\bar{x}_1=\) (5, 16) , \(\bar{x}_2=\) (12, 5)

lbl	x1	x2	d1	d2	cl
a	16	4	16.8	4.4	2
b	19	8	16.7	7.6	2
c	14	4	15.6	2.6	2
d	19	9	16.2	8.0	2
e	10	21	7.1	15.7	1
f	7	19	3.5	14.4	1
g	1	20	5.1	18.2	1
h	2	15	3.0	13.7	1
i	3	6	10.5	8.9	2
j	3	7	9.5	9.0	2
k	6	2	14.5	6.8	2
l	6	5	11.5	5.9	2

k-means clustering - algorithm (7/8)

Recompute means, and re-assign the cluster membership

\(\bar{x}_1=\) (5, 19) , \(\bar{x}_2=\) (11, 6)

lbl	x1	x2	d1	d2	cl
a	16	4	18.4	5.5	2
b	19	8	17.7	8.6	2
c	14	4	17.3	3.6	2
d	19	9	17.1	8.9	2
e	10	21	5.5	15.4	1
f	7	19	2.0	13.9	1
g	1	20	4.2	17.4	1
h	2	15	4.8	12.8	1
i	3	6	12.9	7.8	2
j	3	7	11.9	7.9	2
k	6	2	16.8	6.0	2
l	6	5	13.8	4.8	2

k-means clustering - algorithm (8/8)

Recompute means, and re-assign the cluster membership

\(\bar{x}_1=\) (5, 19) , \(\bar{x}_2=\) (11, 6)

lbl	x1	x2	d1	d2	cl
a	16	4	18.4	5.5	2
b	19	8	17.7	8.6	2
c	14	4	17.3	3.6	2
d	19	9	17.1	8.9	2
e	10	21	5.5	15.4	1
f	7	19	2.0	13.9	1
g	1	20	4.2	17.4	1
h	2	15	4.8	12.8	1
i	3	6	12.9	7.8	2
j	3	7	11.9	7.9	2
k	6	2	16.8	6.0	2
l	6	5	13.8	4.8	2

Example: penguins

We know there are three clusters, but generally we don’t know this.
Will \(k=3\)-means clustering see three?
Fit for various values of \(k\). Add cluster label to data.
Examine solution in plots of the data.
Compute cluster metrics.
NOTE: set.seed() because results can depend on initialisation.

set.seed(712)
p_km3 <- kmeans(p_std[,2:5], 3)
p_std_km <- p_std |>
  mutate(cl = factor(p_km3$cluster))

Choosing \(k\) with cluster statistics (1/2)

within.cluster.ss: sum of distances within cluster. Want it to be low, but always drops for each additional cluster so look for large drops.
WBRatio: average within/average between distances. Want it to be low, but always drops for each additional cluster so look for large drops.
Hubert Gamma: (s+ - s-)/(s+ + s-) where \(s+=\)sum of number of within \(<\) between, \(s-=\) sum of number within \(>\) between. Want this to be high.
Dunn: ratio of (smallest distance between points from different clusters) to (maximum distance of points within any cluster). Want this to be high.
Calinski-Harabasz Index: \(\frac{\sum_{i=1}^p B_{ii}/(k-1)}{\sum_{i=1}^p W_{ii}/(n-k)}\). Want this to be high.

Choosing \(k\) with cluster statistics (2/2)

Results are inconclusive. No agreement between metrics.
Not unusual. Stay tuned for nuisance variables and observations.

Hierarchical clustering

Hierarchical clustering 1/4

Agglomeration: Begin with all observations in singleton clusters. Sequentially join points into clusters, until all are in one cluster.
Divisive: Begin with all observtions in one cluster, adn sequentially divide until all observations are in singleton clusters.
Produces a tree diagram illustrating the process, called a dendrogram.

Hierarchical clustering 2/4

\(n\times n\) distance matrix

	a	b	c	d	e	f	g	h	i	j	k	l
a	0.0	5.0	2.0	5.8	18.0	17.5	21.9	17.8	13.2	13.3	10.2	10.0
b	5.0	0.0	6.4	1.0	15.8	16.3	21.6	18.4	16.1	16.0	14.3	13.3
c	2.0	6.4	0.0	7.1	17.5	16.6	20.6	16.3	11.2	11.4	8.2	8.1
d	5.8	1.0	7.1	0.0	15.0	15.6	21.1	18.0	16.3	16.1	14.8	13.6
e	18.0	15.8	17.5	15.0	0.0	3.6	9.1	10.0	16.6	15.7	19.4	16.5
f	17.5	16.3	16.6	15.6	3.6	0.0	6.1	6.4	13.6	12.6	17.0	14.0
g	21.9	21.6	20.6	21.1	9.1	6.1	0.0	5.1	14.1	13.2	18.7	15.8
h	17.8	18.4	16.3	18.0	10.0	6.4	5.1	0.0	9.1	8.1	13.6	10.8
i	13.2	16.1	11.2	16.3	16.6	13.6	14.1	9.1	0.0	1.0	5.0	3.2
j	13.3	16.0	11.4	16.1	15.7	12.6	13.2	8.1	1.0	0.0	5.8	3.6
k	10.2	14.3	8.2	14.8	19.4	17.0	18.7	13.6	5.0	5.8	0.0	3.0
l	10.0	13.3	8.1	13.6	16.5	14.0	15.8	10.8	3.2	3.6	3.0	0.0

Hierarchical clustering 3/4

Hierarchical clustering 4/4

Linkage

What is the distance between the new cluster (d,b) and all of the other observations?

Between points in the cluster to points not in the cluster.

Single: minimum distance between points in the different clusters
Complete: maximum distance between points in the different clusters
Average: average of distances between points in the different clusters
Centroid: distances between the average of the different clusters
Wards: minimizes the total within-cluster variance

Linkage

Calculations with different linkage choices

	a	b	c	d	e	f	g	h	i	j	k	l
a	0.0	5.0	2.0	5.8	18.0	17.5	21.9	17.8	13.2	13.3	10.2	10.0
b	5.0	0.0	6.4	1.0	15.8	16.3	21.6	18.4	16.1	16.0	14.3	13.3
c	2.0	6.4	0.0	7.1	17.5	16.6	20.6	16.3	11.2	11.4	8.2	8.1
d	5.8	1.0	7.1	0.0	15.0	15.6	21.1	18.0	16.3	16.1	14.8	13.6
e	18.0	15.8	17.5	15.0	0.0	3.6	9.1	10.0	16.6	15.7	19.4	16.5
f	17.5	16.3	16.6	15.6	3.6	0.0	6.1	6.4	13.6	12.6	17.0	14.0
g	21.9	21.6	20.6	21.1	9.1	6.1	0.0	5.1	14.1	13.2	18.7	15.8
h	17.8	18.4	16.3	18.0	10.0	6.4	5.1	0.0	9.1	8.1	13.6	10.8
i	13.2	16.1	11.2	16.3	16.6	13.6	14.1	9.1	0.0	1.0	5.0	3.2
j	13.3	16.0	11.4	16.1	15.7	12.6	13.2	8.1	1.0	0.0	5.8	3.6
k	10.2	14.3	8.2	14.8	19.4	17.0	18.7	13.6	5.0	5.8	0.0	3.0
l	10.0	13.3	8.1	13.6	16.5	14.0	15.8	10.8	3.2	3.6	3.0	0.0

Distance (b,d):
Distance (a,c):

Linkage between (b,d) and (a,c)

Single:
Complete:
Average:
Centroid:

Results from different linkage choices

Dendrogram

Each leaf of the dendrogram represents one observation
Leaves fuse into branches and branches fuse, either with leaves or other branches.
Fusions lower in the tree mean the groups of observations are more similar to each other.

Cut the tree to partition the data into \(k\) clusters.

Results from different linkage choices

Model-in-the-data-space

Example: penguins

We know there are three clusters, but generally we don’t know this.
Will \(k=3\)-means clustering see three?
Fit for various values of \(k\). Add cluster label to data.
Examine solution in plots of the data.
Compute cluster metrics.
NOTE: No need for set.seed() because results are deterministic.

p_hc_w3 <- hclust(dist(p_std[,2:5]), method="ward.D2")
p_std_hc_w3 <- p_std |>
  mutate(cl = factor(cutree(p_hc_w3, 3)))

Choosing \(k\) with cluster statistics

within.cluster.ss and wb.ratio suggest 3, and 5
pearsongamma (Hubert) suggests 2-3
dunn, dunn2, ch all 2?

ETC3250/5250 Introduction to Machine Learning

Overview

Cluster analysis

Cluster these!

Seeing the clusters using spin-and-brush

How do you measure “close”?

Common interpoint distance measures

Distance calculations

Basic rules of a distance metric

Dissimilarity vs similarity

\(k\)-means clustering

k-means clustering - algorithm (1/8)

k-means clustering - algorithm (2/8)

k-means clustering - algorithm (3/8)

k-means clustering - algorithm (4/8)

k-means clustering - algorithm (5/8)

k-means clustering - algorithm (6/8)

k-means clustering - algorithm (7/8)

k-means clustering - algorithm (8/8)

Example: penguins

Choosing \(k\) with cluster statistics (1/2)

Choosing \(k\) with cluster statistics (2/2)

Hierarchical clustering

Hierarchical clustering 1/4

Hierarchical clustering 2/4

Hierarchical clustering 3/4

Hierarchical clustering 4/4

Linkage

Linkage

Calculations with different linkage choices

Results from different linkage choices

Dendrogram

Results from different linkage choices

Example: penguins

Choosing \(k\) with cluster statistics

Next: Model-based clustering and self-organising maps