# Load libraries used everywherelibrary(tidyverse)library(tidymodels)library(conflicted)library(colorspace)library(patchwork)library(MASS)library(randomForest)library(gridExtra)library(GGally)library(geozoo)library(mulgar)conflicts_prefer(dplyr::filter)conflicts_prefer(dplyr::select)conflicts_prefer(dplyr::slice)conflicts_prefer(palmerpenguins::penguins)conflicts_prefer(tourr::flea)
🎯 Objectives
The goal for this week is for you to learn and practice visualising high-dimensional data.
🔧 Preparation
Complete the quiz
Do the reading related to week 2
Exercises:
Open your project for this unit called iml.Rproj.
1. The sparseness of high dimensions
Randomly generate data points that are uniformly distributed in a hyper-cube of 3, 5 and 10 dimensions, with 500 points in each sample, using the cube.solid.random function of the geozoo package. What differences do we expect to see? Now visualise each set in a grand tour and describe how they differ, and whether this matched your expectations?
Each of the projections has a boxy shape, which gets less distinct as the dimension increases.
As the dimension increases, the points tend to concentrate in the centre of the plot window, with a smattering of points in the edges.
2. Detecting clusters
For the data sets, c1, c3 from the mulgar package, use the grand tour to view and try to identify structure (outliers, clusters, non-linear relationships).
Code to show in a tour
animate_xy(c1)animate_xy(c3)
Solution
The first data set c1 has 6 clusters, 4 small ones, and two big ones. The two big ones look like planes because they have no variation in some dimensions.
The second data set c3 has a triangular prism shape, which itself is divided into several smaller triangular prisms. It also has several dimensions with no variation, because the points collapse into a line in some projections.
3. Effect of covariance
Examine 5D multivariate normal samples drawn from populations with a range of variance-covariance matrices. (You can use the mvtnorm package to do the sampling, for example.) Examine the data using a grand tour. What changes when you change the correlation from close to zero to close to 1? Can you see a difference between strong positive correlation and strong negative correlation?
The points in data d1 are pretty spread in every projection. For the data d2, d3 have some projections where the data is concentrated along a line. This should be seen to be when variables 3 and 4 are contributing to the projection in d2, and when variables 1, 2, 3, 4 contributing to the projection in d3.
4. Principal components analysis on the simulated data
🧐 For data sets d2 and d3 what would you expect would be the number of PCs suggested by PCA?
👨🏽💻👩💻Conduct the PCA. Report the variances (eigenvalues), and cumulative proportions of total variance, make a scree plot, and the PC coefficients.
🤯Often, the selected number of PCs are used in future work. For both d3 and d4, think about the pros and cons of using 4 PCs and 3 PCs, respectively.
Solution
Thinking about it: In d2 there is strong correlation between variables 3 and 4, which means probably only 4PC s would be needed. In d3 there is strong correlation also between variables 1 and 2 which would mean that only 3 PCs would be needed.
Three PCs explain 88% of the variation, and the last two PCs have much smaller variance than the others. PC 1 and 2 are combinations of variables 1, 2, 3 and 4, which captures this reduced dimension, and PC 3 is primarily variable 5.
The PCs are awkward combinations of the original variables. For d2, it would make sense to use PC1 (or equivalently and equal combination of V3, V4), and then keep the original variables V1, V2, V5.
For d3 it’s harder to make this call because the first two PCs are combinations of four variables. Its hard to see from this that the ideal solution would be to use an equal combination of V1, V2, and equal combination of V3, V4 and V5 on its own.
Often understanding the variance that is explained by the PCs is hard to interpret.
5. PCA on cross-currency time series
The rates.csv data has 152 currencies relative to the USD for the period of Nov 1, 2019 through to Mar 31, 2020. Treating the dates as variables, conduct a PCA to examine how the cross-currencies vary, focusing on this subset: ARS, AUD, BRL, CAD, CHF, CNY, EUR, FJD, GBP, IDR, INR, ISK, JPY, KRW, KZT, MXN, MYR, NZD, QAR, RUB, SEK, SGD, UYU, ZAR.
Standardise the currency columns to each have mean 0 and variance 1. Explain why this is necessary prior to doing the PCA or is it? Use this data to make a time series plot overlaying all of the cross-currencies.
It isn’t necessary to standardise the variables before using the prcomp function because we can set scale=TRUE to have it done as part of the PCA computation. However, it is useful to standardise the variables to make the time series plot where all the currencies are drawn. This is useful for interpreting the principal components.
Conduct a PCA. Make a scree plot, and summarise proportion of the total variance. Summarise these values and the coefficients for the first five PCs, nicely.
Most of the currencies contribute substantially to PC1. Only three contribute strongly to PC2: CHF, JPY, EUR. Similar to what is learned from the summary table (made in b).
The pattern of the points is most unusual! It has a curious S shape. Principal components are supposed to be a random scattering of values, with no obvious structure. This is a very strong pattern.
Make a time series plot of PC1 and PC2. Explain why this is useful to do for this data.
Because there is a strong pattern in the first two PCs, it could be useful to understand if this is related to the temporal context of the data.
Here we might expect that the PCs extract the main temporal patterns. We see this is the case.
PC1 reflects the large group of currencies that greatly increase in mid-March.
PC2 reflects the few currencies that decrease at the start of March.
Note that: increase here means that the value of the currency declines relative to the USD and a decrease indicates stronger relative to the USD. Is this correct?
You’ll want to drill down deeper to understand what the PCA tells us about the movement of the various currencies, relative to the USD, over the volatile period of the COVID pandemic. Plot the first two PCs again, but connect the dots in order of time. Make it interactive with plotly, where the dates are the labels. What does following the dates tell us about the variation captured in the first two principal components?
The pattern in PC1 vs PC2 follows time. Prior to the pandemic there is a tangle of values on the left. Towards the end of February, when the world was starting to realise that COVID was a major health threat, there is a dramatic reaction from the world currencies, at least in relation to the USD. Currencies such as EUR, JPY, CHF reacted first, gaining strength relative to USD, and then they lost that strength. Most other currencies reacted later, losing value relative to the USD.
6. Write a simple question about the week’s material and test your neighbour, or your tutor.
👋 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.