Load the libraries and avoid conflicts, and prepare data
# Load libraries used everywherelibrary(tidyverse)library(tidymodels)library(tidyclust)library(purrr)library(ggdendro)library(fpc)library(mclust)library(kohonen)library(patchwork)library(mulgar)library(tourr)library(geozoo)library(ggbeeswarm)library(colorspace)library(detourr)library(crosstalk)library(plotly)library(ggthemes)library(conflicted)conflicts_prefer(dplyr::filter)conflicts_prefer(dplyr::select)conflicts_prefer(dplyr::slice)conflicts_prefer(purrr::map)
π― Objectives
The goal for this week is practice fitting model-based clustering and self-organising maps.
π§ Preparation
Make sure you have all the necessary libraries installed.
Exercises:
1. Clustering spotify data with model-based
This exercise is motivated by this blog post on using \(k\)-means to identify anomalies.
You can read and pre-process the data with this code. Variables mode, time_signature and instrumentalness are removed because they have just a few values. We have also transformed some variables to remove skewness, and standardised the data.
Fit model-based clustering with number of clusters ranging from 1-15, to the transformed data, and all possible parametrisations. Summarise the best models, and plot the BIC values for the models. You can also simplify the plot, and show just the 10 best models.
Why are some variance-covariance parametrisations fitted to less than 15 clusters?
Make a 2D sketch that would illustrate what the best variance-covariance parametrisation looks like conceptually for cluster shapes.
How many parameters need to be estimated for the VVE model with 7 and 8 clusters? Compare this to the number of observations, and explain why the model is not estimated for 8 clusters.
Fit just the best model, and extract the parameter estimates. Write a few sentences describing what can be learned about the way the clusters subset the data.
2. Clustering simulated data with known cluster structure
In tutorial of week 10 you clustered c1 from the mulgar package, after also examining this data using the tour in week 3. We know that there are 6 clusters, but with different sizess. For a model-based clustering, what would you expect is the best variance-covariance parametrisation, based on what you know about the data thus far?
Fit a range of models for a choice of clusters that you believe will cover the range needed to select the best model for this data. Make your plot of the BIC values, and summarise what you learn. Be sure to explain whether this matches what you expected or not.
Fit the best model, and examine the model in the data space using a tour. How well does it fit? Does it capture the clusters that we know about?
3. Music similarity
The music data was collected by extracting the first 40 seconds of each track from CDs using the music editing software Amadeus II, saved as a WAV file and analysed using the R package tuneR. Only a subset of the data is provided, with details:
lvar, lave, lmax: average, variance, maximum of the frequencies of the left channel
lfener: an indicator of the amplitude or loudness of the sound
lfreq: median of the location of the 15 highest peak in the periodogram
You can read the data into R using:
music <-read_csv("http://ggobi.org/book/data/music-sub.csv") |>rename(title =`...1`)
How many observations in the data? Explain how this should determine the maximum grid size for an SOM.
Fit a SOM model to the data using a 4x4 grid, using a large rlen value. Be sure to standardise your data prior to fitting the model. Make a map of the results, and show the map in both 2D and 5D (using a tour).
Letβs take a look at how it has divided the data into clusters. Set up linked brushing between detourr and map view using the code below.