ETC3250/5250 Tutorial 3

Visualising your data and models

Author

Prof. Di Cook

Published

11 March 2024

Load the libraries and avoid conflicts

# Load libraries used everywhere
library(tidyverse)
library(tidymodels)
library(conflicted)
library(colorspace)
library(patchwork)
library(MASS)
library(randomForest)
library(gridExtra)
library(GGally)
library(geozoo)
library(mulgar)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(palmerpenguins::penguins)
conflicts_prefer(tourr::flea)

🎯 Objectives

The goal for this week is for you to learn and practice visualising high-dimensional data.

🔧 Preparation

Complete the quiz
Do the reading related to week 2

Exercises:

Open your project for this unit called iml.Rproj.

1. The sparseness of high dimensions

Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the cube.solid.random function of the geozoo package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?

The code to generate and view the cubes is:

Code to generate the data and show in a tour

library(tourr)
library(geozoo)
set.seed(1234)
cube3 <- cube.solid.random(3, 500)$points
cube5 <- cube.solid.random(5, 500)$points
cube10 <- cube.solid.random(10, 500)$points

animate_xy(cube3, axes="bottomleft")
animate_xy(cube5, axes="bottomleft")
animate_xy(cube10, axes="bottomleft")

Solution

Each of the projections has a boxy shape, which gets less distinct as the dimension increases.

As the dimension increases, the points tend to concentrate in the centre of the plot window, with a smattering of points in the edges.

2. Detecting clusters

For the data sets, c1, c3 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).

Code to show in a tour

animate_xy(c1)
animate_xy(c3)

Solution

The first data set c1 has 6 clusters, 4 small ones, and two big ones. The two big ones look like planes because they have no variation in some dimensions.

The second data set c3 has a triangular prism shape, which itself is divided into several smaller triangular prisms. It also has several dimensions with no variation, because the points collapse into a line in some projections.

3. Effect of covariance

Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?

Code to generate the samples

library(mvtnorm)
set.seed(501)

s1 <- diag(5)
s2 <- diag(5)
s2[3,4] <- 0.7
s2[4,3] <- 0.7
s3 <- s2
s3[1,2] <- -0.7
s3[2,1] <- -0.7

s1

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    1    0    0
[4,]    0    0    0    1    0
[5,]    0    0    0    0    1

Code to generate the samples

s2

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    0  0.0  0.0    0
[2,]    0    1  0.0  0.0    0
[3,]    0    0  1.0  0.7    0
[4,]    0    0  0.7  1.0    0
[5,]    0    0  0.0  0.0    1

Code to generate the samples

s3

     [,1] [,2] [,3] [,4] [,5]
[1,]  1.0 -0.7  0.0  0.0    0
[2,] -0.7  1.0  0.0  0.0    0
[3,]  0.0  0.0  1.0  0.7    0
[4,]  0.0  0.0  0.7  1.0    0
[5,]  0.0  0.0  0.0  0.0    1

Code to generate the samples

set.seed(1234)
d1 <- as.data.frame(rmvnorm(500, sigma = s1))
d2 <- as.data.frame(rmvnorm(500, sigma = s2))
d3 <- as.data.frame(rmvnorm(500, sigma = s3))

Solution

animate_xy(d1)
animate_xy(d2)
animate_xy(d3)

The points in data d1 are pretty spread in every projection. For the data d2, d3 have some projections where the data is concentrated along a line. This should be seen to be when variables 3 and 4 are contributing to the projection in d2, and when variables 1, 2, 3, 4 contributing to the projection in d3.

4. Principal components analysis on the simulated data

🧐 For data sets d2 and d3 what would you expect would be the number of PCs suggested by PCA?

👨🏽‍💻👩‍💻Conduct the PCA. Report the variances (eigenvalues), and cumulative proportions of total variance, make a scree plot, and the PC coefficients.

🤯Often, the selected number of PCs are used in future work. For both d3 and d4, think about the pros and cons of using 4 PCs and 3 PCs, respectively.

Solution

Thinking about it: In d2 there is strong correlation between variables 3 and 4, which means probably only 4PC s would be needed. In d3 there is strong correlation also between variables 1 and 2 which would mean that only 3 PCs would be needed.

d2_pca <- prcomp(d2, scale=TRUE)
d2_pca

Standard deviations (1, .., p=5):
[1] 1.2944925 1.0120246 0.9995775 0.9840652 0.5766767

Rotation (n x k) = (5 x 5):
           PC1         PC2          PC3         PC4         PC5
V1 0.009051897  0.60982755  0.600760775  0.51637067 -0.02182300
V2 0.042039564  0.44070702 -0.798335151  0.40808929  0.01158053
V3 0.702909484  0.03224989  0.034228444 -0.06034512  0.70715280
V4 0.702411571  0.03021836  0.002269932 -0.08050218 -0.70655437
V5 0.103377852 -0.65721722  0.023890154  0.74612487 -0.01027051

d2_pca$sdev^2/5

[1] 0.33514216 0.20483875 0.19983102 0.19367686 0.06651121

mulgar::ggscree(d2_pca, q=5)

Four PCs explain 93% of the variation. PC1 is the combination of variables 3 and 4, which captures this reduced dimension.

d3_pca <- prcomp(d3, scale=TRUE)
d3_pca

Standard deviations (1, .., p=5):
[1] 1.3262816 1.2831152 0.9984103 0.5561311 0.5371102

Rotation (n x k) = (5 x 5):
           PC1         PC2          PC3         PC4          PC5
V1  0.47372917  0.52551030  0.007091154 -0.55745578  0.434295265
V2 -0.49362867 -0.50367594 -0.047544823 -0.58444458  0.398503844
V3 -0.50057768  0.49960926  0.030888892 -0.40488840 -0.578726039
V4 -0.52968729  0.46318477  0.073441704  0.42649507  0.563559684
V5  0.02765464 -0.07745919  0.995661287 -0.04283613 -0.007678753

d3_pca$sdev^2/5

[1] 0.35180458 0.32927695 0.19936462 0.06185637 0.05769748

mulgar::ggscree(d3_pca, q=5)

Three PCs explain 88% of the variation, and the last two PCs have much smaller variance than the others. PC 1 and 2 are combinations of variables 1, 2, 3 and 4, which captures this reduced dimension, and PC 3 is primarily variable 5.

The PCs are awkward combinations of the original variables. For d2, it would make sense to use PC1 (or equivalently and equal combination of V3, V4), and then keep the original variables V1, V2, V5.

For d3 it’s harder to make this call because the first two PCs are combinations of four variables. Its hard to see from this that the ideal solution would be to use an equal combination of V1, V2, and equal combination of V3, V4 and V5 on its own.

Often understanding the variance that is explained by the PCs is hard to interpret.

5. PCA on cross-currency time series

The rates.csv data has 152 currencies relative to the USD for the period of Nov 1, 2019 through to Mar 31, 2020. Treating the dates as variables, conduct a PCA to examine how the cross-currencies vary, focusing on this subset: ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR.

rates <- read_csv("https://raw.githubusercontent.com/numbats/iml/master/data/rates_Nov19_Mar20.csv") |>
  select(date, ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR)

Standardise the currency columns to each have mean 0 and variance 1. Explain why this is necessary prior to doing the PCA or is it? Use this data to make a time series plot overlaying all of the cross-currencies.

Code to standardise currencies

library(plotly)
rates_std <- rates |>
  mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))
rownames(rates_std) <- rates_std$date
p <- rates_std |>
  pivot_longer(cols=ARS:ZAR, 
               names_to = "currency", 
               values_to = "rate") |>
  ggplot(aes(x=date, y=rate, 
             group=currency, label=currency)) +
    geom_line() 
ggplotly(p, width=400, height=300)

Solution

It isn’t necessary to standardise the variables before using the prcomp function because we can set scale=TRUE to have it done as part of the PCA computation. However, it is useful to standardise the variables to make the time series plot where all the currencies are drawn. This is useful for interpreting the principal components.

Conduct a PCA. Make a scree plot, and summarise proportion of the total variance. Summarise these values and the coefficients for the first five PCs, nicely.

Code to do PCA and screeplot

rates_pca <- prcomp(rates_std[,-1], scale=FALSE)
mulgar::ggscree(rates_pca, q=24)
options(digits=2)
summary(rates_pca)

Code to make a nice summary

# Summarise the coefficients nicely
rates_pca_smry <- tibble(evl=rates_pca$sdev^2) |>
  mutate(p = evl/sum(evl), 
         cum_p = cumsum(evl/sum(evl))) |> 
  t() |>
  as.data.frame()
colnames(rates_pca_smry) <- colnames(rates_pca$rotation)
rates_pca_smry <- bind_rows(as.data.frame(rates_pca$rotation),
                            rates_pca_smry)
rownames(rates_pca_smry) <- c(rownames(rates_pca$rotation),
                              "Variance", "Proportion", 
                              "Cum. prop")
rates_pca_smry[,1:5]

Solution

Importance of components:
                         PC1   PC2    PC3    PC4    PC5    PC6     PC7     PC8
Standard deviation     4.193 1.679 1.0932 0.9531 0.7358 0.5460 0.38600 0.33484
Proportion of Variance 0.733 0.118 0.0498 0.0379 0.0226 0.0124 0.00621 0.00467
Cumulative Proportion  0.733 0.850 0.8999 0.9377 0.9603 0.9727 0.97893 0.98360
                           PC9    PC10    PC11    PC12    PC13    PC14    PC15
Standard deviation     0.30254 0.25669 0.25391 0.17893 0.16189 0.15184 0.14260
Proportion of Variance 0.00381 0.00275 0.00269 0.00133 0.00109 0.00096 0.00085
Cumulative Proportion  0.98741 0.99016 0.99284 0.99418 0.99527 0.99623 0.99708
                          PC16    PC17    PC18    PC19    PC20    PC21    PC22
Standard deviation     0.11649 0.10691 0.09923 0.09519 0.08928 0.07987 0.07222
Proportion of Variance 0.00057 0.00048 0.00041 0.00038 0.00033 0.00027 0.00022
Cumulative Proportion  0.99764 0.99812 0.99853 0.99891 0.99924 0.99950 0.99972
                          PC23    PC24
Standard deviation     0.05985 0.05588
Proportion of Variance 0.00015 0.00013
Cumulative Proportion  0.99987 1.00000

              PC1    PC2      PC3    PC4     PC5
ARS         0.215 -0.121  0.19832  0.181 -0.2010
AUD         0.234  0.013  0.11466  0.018  0.0346
BRL         0.229 -0.108  0.10513  0.093 -0.0526
CAD         0.235 -0.025 -0.02659 -0.037  0.0337
CHF        -0.065  0.505 -0.33521 -0.188 -0.0047
CNY         0.144  0.237 -0.45337 -0.238 -0.5131
EUR         0.088  0.495  0.24474  0.245 -0.1416
FJD         0.234  0.055  0.04470  0.028  0.0330
GBP         0.219  0.116 -0.00915 -0.073  0.3059
IDR         0.218 -0.022 -0.24905 -0.117  0.2362
INR         0.223 -0.147 -0.00734 -0.014  0.0279
ISK         0.230 -0.016  0.10979  0.093  0.1295
JPY        -0.022  0.515  0.14722  0.234  0.3388
KRW         0.214  0.063  0.17488  0.059 -0.3404
KZT         0.217  0.013 -0.23244 -0.119  0.3304
MXN         0.229 -0.059 -0.13804 -0.102  0.2048
MYR         0.227  0.040 -0.13970 -0.115 -0.2009
NZD         0.230  0.061  0.04289 -0.056 -0.0354
QAR        -0.013  0.111  0.55283 -0.807  0.0078
RUB         0.233 -0.102 -0.05863 -0.042  0.0063
SEK         0.205  0.240  0.07570  0.085  0.0982
SGD         0.227  0.057  0.14225  0.115 -0.2424
UYU         0.231 -0.101  0.00064 -0.053  0.0957
ZAR         0.232 -0.070 -0.00328  0.042 -0.0443
Variance   17.582  2.820  1.19502  0.908  0.5413
Proportion  0.733  0.118  0.04979  0.038  0.0226
Cum. prop   0.733  0.850  0.89989  0.938  0.9603

The first two principal components explain 85% of the total variation.
PC1 is a combination of all of the currencies except for CHF, EUR, JPY, QAR.
PC2 is a combination of CHF, EUR, JPY.

Make a biplot of the first two PCs. Explain what you learn.

Biplot code

library(ggfortify)
autoplot(rates_pca, loadings = TRUE, 
         loadings.label = TRUE)

Solution

Most of the currencies contribute substantially to PC1. Only three contribute strongly to PC2: CHF, JPY, EUR. Similar to what is learned from the summary table (made in b).
The pattern of the points is most unusual! It has a curious S shape. Principal components are supposed to be a random scattering of values, with no obvious structure. This is a very strong pattern.

Make a time series plot of PC1 and PC2. Explain why this is useful to do for this data.

Code to plot PCs

rates_pca$x |>
  as.data.frame() |>
  mutate(date = rates_std$date) |>
  ggplot(aes(x=date, y=PC1)) + geom_line()

rates_pca$x |>
  as.data.frame() |>
  mutate(date = rates_std$date) |>
  ggplot(aes(x=date, y=PC2)) + geom_line()

Solution

Because there is a strong pattern in the first two PCs, it could be useful to understand if this is related to the temporal context of the data.
Here we might expect that the PCs extract the main temporal patterns. We see this is the case.
PC1 reflects the large group of currencies that greatly increase in mid-March.
PC2 reflects the few currencies that decrease at the start of March.

Note that: increase here means that the value of the currency declines relative to the USD and a decrease indicates stronger relative to the USD. Is this correct?

You’ll want to drill down deeper to understand what the PCA tells us about the movement of the various currencies, relative to the USD, over the volatile period of the COVID pandemic. Plot the first two PCs again, but connect the dots in order of time. Make it interactive with plotly, where the dates are the labels. What does following the dates tell us about the variation captured in the first two principal components?

Code to use interaction of the PC plot

library(plotly)
p2 <- rates_pca$x |>
  as.data.frame() |>
  mutate(date = rates_std$date) |>
  ggplot(aes(x=PC1, y=PC2, label=date)) +
    geom_point() +
    geom_path()
ggplotly(p2, width=400, height=400)

Solution

The pattern in PC1 vs PC2 follows time. Prior to the pandemic there is a tangle of values on the left. Towards the end of February, when the world was starting to realise that COVID was a major health threat, there is a dramatic reaction from the world currencies, at least in relation to the USD. Currencies such as EUR, JPY, CHF reacted first, gaining strength relative to USD, and then they lost that strength. Most other currencies reacted later, losing value relative to the USD.

6. Write a simple question about the week’s material and test your neighbour, or your tutor.

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.