ETC3250/5250 Tutorial 11

Model-based clustering and self-organising maps

Author

Prof. Di Cook

Published

13 May 2024

Load the libraries and avoid conflicts, and prepare data

# Load libraries used everywhere
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(purrr)
library(ggdendro)
library(fpc)
library(mclust)
library(kohonen)
library(patchwork)
library(mulgar)
library(tourr)
library(geozoo)
library(ggbeeswarm)
library(colorspace)
library(detourr)
library(crosstalk)
library(plotly)
library(ggthemes)
library(conflicted)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(purrr::map)

🎯 Objectives

The goal for this week is practice fitting model-based clustering and self-organising maps.

🔧 Preparation

Make sure you have all the necessary libraries installed.

Exercises:

1. Clustering spotify data with model-based

This exercise is motivated by this blog post on using \(k\)-means to identify anomalies.

You can read and pre-process the data with this code. Variables mode, time_signature and instrumentalness are removed because they have just a few values. We have also transformed some variables to remove skewness, and standardised the data.

# https://towardsdatascience.com/unsupervised-anomaly-detection-on-spotify-data-k-means-vs-local-outlier-factor-f96ae783d7a7
spotify <- read_csv("https://raw.githubusercontent.com/isaacarroyov/spotify_anomalies_kmeans-lof/main/data/songs_atributtes_my_top_100_2016-2021.csv") |>
  select(-c(mode, time_signature, instrumentalness)) # variables with few values

spotify_tf <- spotify |>
  mutate(speechiness = log10(speechiness),
         liveness = log10(liveness),
         duration_ms = log10(duration_ms),
         danceability = danceability^2,
         artist_popularity = artist_popularity^2,
         acousticness = log10(acousticness)) |>
  mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))

Fit model-based clustering with number of clusters ranging from 1-15, to the transformed data, and all possible parametrisations. Summarise the best models, and plot the BIC values for the models. You can also simplify the plot, and show just the 10 best models.

Why are some variance-covariance parametrisations fitted to less than 15 clusters?

Make a 2D sketch that would illustrate what the best variance-covariance parametrisation looks like conceptually for cluster shapes.
How many parameters need to be estimated for the VVE model with 7 and 8 clusters? Compare this to the number of observations, and explain why the model is not estimated for 8 clusters.

Fit just the best model, and extract the parameter estimates. Write a few sentences describing what can be learned about the way the clusters subset the data.

2. Clustering simulated data with known cluster structure

In tutorial of week 10 you clustered c1 from the mulgar package, after also examining this data using the tour in week 3. We know that there are 6 clusters, but with different sizess. For a model-based clustering, what would you expect is the best variance-covariance parametrisation, based on what you know about the data thus far?

Fit a range of models for a choice of clusters that you believe will cover the range needed to select the best model for this data. Make your plot of the BIC values, and summarise what you learn. Be sure to explain whether this matches what you expected or not.

Fit the best model, and examine the model in the data space using a tour. How well does it fit? Does it capture the clusters that we know about?

3. Music similarity

The music data was collected by extracting the first 40 seconds of each track from CDs using the music editing software Amadeus II, saved as a WAV file and analysed using the R package tuneR. Only a subset of the data is provided, with details:

title: Title of the track
artist: Abba, Beatles, Eels, Vivaldi, Mozart, Beethoven, Enya
type: rock, classical, or new wave
lvar, lave, lmax: average, variance, maximum of the frequencies of the left channel
lfener: an indicator of the amplitude or loudness of the sound
lfreq: median of the location of the 15 highest peak in the periodogram

You can read the data into R using:

music <- read_csv("http://ggobi.org/book/data/music-sub.csv") |>
  rename(title = `...1`)

How many observations in the data? Explain how this should determine the maximum grid size for an SOM.

Fit a SOM model to the data using a 4x4 grid, using a large rlen value. Be sure to standardise your data prior to fitting the model. Make a map of the results, and show the map in both 2D and 5D (using a tour).

Let’s take a look at how it has divided the data into clusters. Set up linked brushing between detourr and map view using the code below.

music_som1_shared <- SharedData$new(music_som1_data)

music_detour <- detour(music_som1_shared, tour_aes(
  projection = lvar:lfreq)) |>
  tour_path(grand_tour(2),
            max_bases=50, fps = 60) |>
  show_scatter(alpha = 0.9, axes = FALSE,
               width = "100%", height = "450px")

music_map <- plot_ly(music_som1_shared,
     x = ~map1,
     y = ~map2,
     text = ~paste(title, artist),
     marker = list(color="black", size=8),
     height = 450) |>
  highlight(on = "plotly_selected",
            off = "plotly_doubleclick") |>
  add_trace(type = "scatter",
            mode = "markers")

bscols(
  music_detour, music_map,
  widths = c(5, 6)
)

Can you see a small cluster of Abba songs? Which two songs are outliers? Which Beethoven piece is most like a Beatles song?

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.