ETC3250/5250 Tutorial 3

Visualising your data and models

Author

Prof. Di Cook

Published

11 March 2024

Load the libraries and avoid conflicts

# Load libraries used everywhere
library(tidyverse)
library(tidymodels)
library(conflicted)
library(colorspace)
library(patchwork)
library(MASS)
library(randomForest)
library(gridExtra)
library(GGally)
library(geozoo)
library(mulgar)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(palmerpenguins::penguins)
conflicts_prefer(tourr::flea)

🎯 Objectives

The goal for this week is for you to learn and practice visualising high-dimensional data.

🔧 Preparation

Complete the quiz
Do the reading related to week 2

Exercises:

Open your project for this unit called iml.Rproj.

1. The sparseness of high dimensions

Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the cube.solid.random function of the geozoo package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?

The code to generate and view the cubes is:

Code to generate the data and show in a tour

library(tourr)
library(geozoo)
set.seed(1234)
cube3 <- cube.solid.random(3, 500)$points
cube5 <- cube.solid.random(5, 500)$points
cube10 <- cube.solid.random(10, 500)$points

animate_xy(cube3, axes="bottomleft")
animate_xy(cube5, axes="bottomleft")
animate_xy(cube10, axes="bottomleft")

2. Detecting clusters

For the data sets, c1, c3 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).

Code to show in a tour

animate_xy(c1)
animate_xy(c3)

3. Effect of covariance

Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?

Code to generate the samples

library(mvtnorm)
set.seed(501)

s1 <- diag(5)
s2 <- diag(5)
s2[3,4] <- 0.7
s2[4,3] <- 0.7
s3 <- s2
s3[1,2] <- -0.7
s3[2,1] <- -0.7

s1

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

Code to generate the samples

s2

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0  0.0  0.0    0
[2,]    0    1  0.0  0.0    0
[3,]    0    0  1.0  0.7    0
[4,]    0    0  0.7  1.0    0
[5,]    0    0  0.0  0.0    1

Code to generate the samples

s3

     [,1] [,2] [,3] [,4] [,5]
[1,]  1.0 -0.7  0.0  0.0    0
[2,] -0.7  1.0  0.0  0.0    0
[3,]  0.0  0.0  1.0  0.7    0
[4,]  0.0  0.0  0.7  1.0    0
[5,]  0.0  0.0  0.0  0.0    1

Code to generate the samples

set.seed(1234)
d1 <- as.data.frame(rmvnorm(500, sigma = s1))
d2 <- as.data.frame(rmvnorm(500, sigma = s2))
d3 <- as.data.frame(rmvnorm(500, sigma = s3))

4. Principal components analysis on the simulated data

🧐 For data sets d2 and d3 what would you expect would be the number of PCs suggested by PCA?

👨🏽‍💻👩‍💻Conduct the PCA. Report the variances (eigenvalues), and cumulative proportions of total variance, make a scree plot, and the PC coefficients.

🤯Often, the selected number of PCs are used in future work. For both d3 and d4, think about the pros and cons of using 4 PCs and 3 PCs, respectively.

5. PCA on cross-currency time series

The rates.csv data has 152 currencies relative to the USD for the period of Nov 1, 2019 through to Mar 31, 2020. Treating the dates as variables, conduct a PCA to examine how the cross-currencies vary, focusing on this subset: ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR.

rates <- read_csv("https://raw.githubusercontent.com/numbats/iml/master/data/rates_Nov19_Mar20.csv") |>
  select(date, ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR)

Standardise the currency columns to each have mean 0 and variance 1. Explain why this is necessary prior to doing the PCA or is it? Use this data to make a time series plot overlaying all of the cross-currencies.

Code to standardise currencies

library(plotly)
rates_std <- rates |>
  mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))
rownames(rates_std) <- rates_std$date
p <- rates_std |>
  pivot_longer(cols=ARS:ZAR, 
               names_to = "currency", 
               values_to = "rate") |>
  ggplot(aes(x=date, y=rate, 
             group=currency, label=currency)) +
    geom_line() 
ggplotly(p, width=400, height=300)

Conduct a PCA. Make a scree plot, and summarise proportion of the total variance. Summarise these values and the coefficients for the first five PCs, nicely.

Code to do PCA and screeplot

rates_pca <- prcomp(rates_std[,-1], scale=FALSE)
mulgar::ggscree(rates_pca, q=24)
options(digits=2)
summary(rates_pca)

Code to make a nice summary

# Summarise the coefficients nicely
rates_pca_smry <- tibble(evl=rates_pca$sdev^2) |>
  mutate(p = evl/sum(evl), 
         cum_p = cumsum(evl/sum(evl))) |> 
  t() |>
  as.data.frame()
colnames(rates_pca_smry) <- colnames(rates_pca$rotation)
rates_pca_smry <- bind_rows(as.data.frame(rates_pca$rotation),
                            rates_pca_smry)
rownames(rates_pca_smry) <- c(rownames(rates_pca$rotation),
                              "Variance", "Proportion", 
                              "Cum. prop")
rates_pca_smry[,1:5]

Make a biplot of the first two PCs. Explain what you learn.

Biplot code

library(ggfortify)
autoplot(rates_pca, loadings = TRUE, 
         loadings.label = TRUE)

Make a time series plot of PC1 and PC2. Explain why this is useful to do for this data.

Code to plot PCs

rates_pca$x |>
  as.data.frame() |>
  mutate(date = rates_std$date) |>
  ggplot(aes(x=date, y=PC1)) + geom_line()

rates_pca$x |>
  as.data.frame() |>
  mutate(date = rates_std$date) |>
  ggplot(aes(x=date, y=PC2)) + geom_line()

You’ll want to drill down deeper to understand what the PCA tells us about the movement of the various currencies, relative to the USD, over the volatile period of the COVID pandemic. Plot the first two PCs again, but connect the dots in order of time. Make it interactive with plotly, where the dates are the labels. What does following the dates tell us about the variation captured in the first two principal components?

Code to use interaction of the PC plot

library(plotly)
p2 <- rates_pca$x |>
  as.data.frame() |>
  mutate(date = rates_std$date) |>
  ggplot(aes(x=PC1, y=PC2, label=date)) +
    geom_point() +
    geom_path()
ggplotly(p2, width=400, height=400)

6. Write a simple question about the week’s material and test your neighbour, or your tutor.

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.