---
title: "Lab 11"
subtitle: "Monash University, Econ & Bus Stat, ETC3250/5250"
author: "prepared by Professor Di Cook"
date: "Materials for Week 11"
output:
html_document: default
editor_options:
chunk_output_type: console
---
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE,
error = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 6,
fig.align = "center",
cache = FALSE
)
```
# Objective
The objectives for this week are:
- Think critically about cluster analysis
- Learn about distance metrics
- Apply the $k$-means cluster algorithm
# About the data
The lab uses cluster analysis to analyse the happy paintings by Bob Ross. This was the subject of the [538 post](http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/), "A Statistical Analysis of the Work of Bob Ross".
We have taken the painting images from the [sales site](http://www.saleoilpaintings.com/paintings/bob-ross/bob-ross-sale-3_1.html), read the images into R, and resized them all to be 20 by 20 pixels. Each painting has been classified into one of 8 classes based on the title of the painting. This is the data that you will work with.
Each row corresponds to one painting, and the rgb color values at each pixel are in each column. With a $20\times20$ image, this leads to $400\times3=1200$ columns.
We have given you a set of 178 paintings. Your job is to cluster the data and thus organise the paintings.
Here is one of the original paintings and how it appears after image processing:
```{r out.width="70%"}
library(tidyverse)
library(ggdendro)
library(ggthemes)
library(magick)
library(fpc)
br5 <- image_read("data/paintings/bobross5.jpg")
plot(br5)
```
```{r out.width="40%"}
# Long form data is easier to plot.
p_long <- read_csv("data/paintings-long-train.csv")
p_long %>% filter(id == 5) %>%
ggplot(aes(x=x, y=-y, fill=h)) +
geom_tile() +
scale_fill_identity() +
theme_solid() +
theme(legend.position="none")
```
# Class discussion
Browse the paintings web site. As a group come up with some ways to organise the paintings into groups, eg red tones and blue tones, or water, forest, flowers, rustic. Agree on a set of roughly mutually exclusive, that you could put each painting into one of the sets. This is an impossible challenge, but give it a go.
At the end of this exercise, as a class, have a set of 5-10 categories for that a painting can be assigned into. Share this with everyone.
# Warm up
For the data below, compute
1. Euclidean distance
2. Maximum distance
3. Manhattan
```{r}
set.seed(20200602)
library(tidyverse)
scale2 <- function(x, na.rm = FALSE) (x - mean(x, na.rm = na.rm)) / sd(x, na.rm)
df <- tibble(x1 = rnorm(10), x2 = c(rnorm(5, -5), rnorm(5, 5)), x3 = rexp(10), x4 = rt(10, 3), x5 = rnorm(10), x6 = rgamma(10, 2), x7 = rgeom(10, 0.3)) %>%
mutate_all(scale2)
df
```
and display the distance matrix using a heatmap.
```{r eval=FALSE}
d_euc <- dist(df, diag = TRUE, upper = TRUE)
heatmap(as.matrix(d_euc), symm=TRUE, scale="none", Rowv=NA)
d_max <- dist(df, method = "maximum", diag = ???, upper = ???)
heatmap(as.matrix(d_max), symm=TRUE, scale="none", Rowv=???)
d_man <- dist(df, method = "manhattan", diag = ???, upper = ???)
heatmap(as.matrix(d_man), symm=TRUE, scale="none", Rowv=???)
```
Does the choice of distance metric change how similar two points are considered to be?
# Activities
Make sure you have a copy of the files `paintings_train.csv` and `paintings.zip` from the course web site, in your `data` directory for this week's lab project. Unxip the latter file to get copies of all the images.
## 1.
Read in the data. How many variables? What are the variables summarising? How many paintings are processed?
```{r eval=FALSE}
# Read the data to cluster
paintings <- read_csv("data/paintings-train.csv")
```
## 2.
Compute a principal component analysis on the data. Create a new data set, that uses 12 PCs.
```{r eval=FALSE}
paintings_pca <- prcomp(paintings[,-c(1:3)])
plot(paintings_pca, type="l", npcs=50)
paintings_pc <- paintings_pca$x[,???] %>%
as_tibble() %>%
mutate_all(scale2)
```
## 3.
Run a $k$-means clustering, on both data sets, raw and dimension reduced, using $k=2, ..., 10$.
```{r eval=FALSE}
p_kcl <- paintings[,1:3]
kcl <- NULL
for (k in 2:10) {
cl <- kmeans(paintings[,4:1203], k, nstart=5)$???
kcl <- cbind(kcl,cl)
}
colnames(kcl) <- paste0("cl", 2:10)
p_kcl <- bind_cols(p_kcl, as.data.frame(kcl))
```
```{r eval=FALSE}
p_pc_kcl <- paintings[,1:3]
kcl_pc <- NULL
for (k in 2:10) {
cl <- kmeans(paintings_pc, ???, nstart=5)$cluster
kcl_pc <- cbind(kcl_pc,cl)
}
colnames(kcl_pc) <- paste0("cl", 2:10)
p_pc_kcl <- bind_cols(p_pc_kcl, as.data.frame(kcl_pc))
```
a. For the raw data, compute a range of cluster statistics, and decide on an appropriate number of clusters.
```{r eval=FALSE}
library(fpc)
p_d <- dist(paintings[,-c(1:3)])
p_stats <- NULL
for (k in 2:10) {
x <- cluster.stats(p_d, kcl[, k-1])
p_stats <- rbind(p_stats, c(x$within.cluster.ss, ???, x$ch, x$pearsongamma, x$dunn, x$dunn2))
}
colnames(p_stats) <- c("within.cluster.ss","wb.ratio", "ch", "pearsongamma", "dunn", "dunn2")
p_stats <- data.frame(p_stats)
p_stats$cl <- 2:10
p_stats_m <- p_stats %>%
pivot_longer(cols=within.cluster.ss:dunn2, names_to="stat")
ggplot(data=p_stats_m) +
geom_line(aes(x=???, y=value)) +
xlab("# clusters") + ylab("") +
facet_wrap(~???, ncol=3, scales = "free_y") +
theme_bw()
```
b. For the dimension reduced data, compute a range of cluster statistics, and decide on an appropriate number of clusters.
```{r eval=FALSE}
p_pc_d <- dist(paintings_pc)
p_pc_stats <- NULL
for (k in 2:10) {
x <- cluster.stats(p_pc_d, kcl_pc[, k-1])
p_pc_stats <- rbind(p_pc_stats, c(x$within.cluster.ss, x$wb.ratio, x$ch, x$pearsongamma, x$dunn, x$dunn2))
}
colnames(p_pc_stats) <- c("within.cluster.ss","wb.ratio", "ch", ???, "dunn", "dunn2")
p_pc_stats <- data.frame(p_pc_stats)
p_pc_stats$cl <- 2:10
p_pc_stats_m <- p_pc_stats %>%
pivot_longer(cols=within.cluster.ss:dunn2, names_to="stat")
ggplot(data=p_pc_stats_m) +
geom_line(aes(x=???, y=value)) +
xlab("# clusters") + ylab("") +
facet_wrap(~???, ncol=3, scales = "free_y") +
theme_bw()
```
c. Compute a confusion table to compare the results from best choice of $k$ for the two cluster solutions.
```{r eval=FALSE}
# raw data cluster ids in rows
table(p_kcl$???, p_pc_kcl$???)
# 1 to 7 and 9
# 2 to 8
# 3 to 2
# 4 to 5
# 5 to 4
# 6 to 5
# 7 to 3
# 8 to 6 and 9
# 9 to 6 and 7
```
d. Take a look at the names of the paintings, that fall into your chosen cluster solution. Do the names match anything similar to the original breakdown of paintings made by the class?
```{r eval=FALSE}
p_kcl %>%
filter(cl9 == 1) %>%
select(name) %>% print(n=32)
```
```{r eval=FALSE}
library(magick)
p_ids <- list.files("data/paintings")
plot.new() # Clear the plot window
par(mfrow=c(6, 6), mai=c(0,0,0,0), mar=c(0,0,0,0))
cl_to_view <- 9
cl_ids <- paintings[p_kcl$cl9==cl_to_view, 1:2]
for (i in 1:nrow(cl_ids)) {
img <- image_read(paste0("data/paintings/bobross",cl_ids$id[i],".jpg"))
plot(img)
}
```