ETC3250/5250 Tutorial 4

Re-sampling and regularisation

Author

Prof. Di Cook

Published

18 March 2024

Load the libraries and avoid conflicts
# Load libraries used everywhere
library(tidyverse)
library(tidymodels)
library(conflicted)
library(patchwork)
library(mulgar)
library(mvtnorm)
library(boot)
library(nullabor)
library(palmerpenguins)
library(GGally)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(palmerpenguins::penguins)

options(digits=2)
p_tidy <- penguins |>
  select(species, bill_length_mm:body_mass_g) |>
  rename(bl=bill_length_mm,
         bd=bill_depth_mm,
         fl=flipper_length_mm,
         bm=body_mass_g) |>
  filter(!is.na(bl)) |>
  arrange(species)

🎯 Objectives

The goal for this week is for you to practice resampling methods, in order to tune models, assess model variance, and determine importance of variables.

🔧 Preparation

  • Complete the quiz
  • Do the reading related to week 3

Exercises:

Open your project for this unit called iml.Rproj.

1. Assess the significance of PC coefficients using bootstrap

In the lecture, we used bootstrap to examine the significance of the coefficients for the second principal component from the womens’ track PCA. Do this computation for PC1. The question for you to answer is: Can we consider all of the coefficients to be equal?

The data can be read using:

track <- read_csv("https://raw.githubusercontent.com/numbats/iml/master/data/womens_track.csv")

2. Using simulation to assess results when there is no structure

The ggscree function in the mulgar package computes PCA on multivariate standard normal samples, to learn what the largest eigenvalue might be when there the covariance between variables is 0.

  1. What is the mean and covariance matrix of a multivariate standard normal distribution?
  1. Simulate a sample of 55 observations from a 7D standard multivariate normal distribution. Compute the sample mean and covariance. (Question: Why 55 observations? Why 7D?)
  1. Compute PCA on your sample, and note the variance of the first PC. How does this compare with variance of the first PC of the women’s track data?

3. Making a lineup plot to assess the dependence between variables

Permutation samples is used to significance assess relationships and importance of variables. Here we will use it to assess the strength of a non-linear relationship.

  1. Generate a sample of data that has a strong non-linear relationship but no correlation, as follows:
set.seed(908)
n <- 205
df <- tibble(x1 = runif(n)-0.5, x2 = x1^2 + rnorm(n)*0.01)

and then use permutation to generate another 19 plots where x1 is permuted. You can do this with the nullabor package as follows:

set.seed(912)
df_l <- lineup(null_permute('x1'), df)

and make all 20 plots as follows:

ggplot(df_l, aes(x=x1, y=x2)) + 
  geom_point() + 
  facet_wrap(~.sample)

Is the data plot recognisably different from the plots of permuted data?

  1. Repeat this with a sample simulated with no relationship between the two variables. Can the data be distinguished from the permuted data?

4. Computing \(k\)-folds for cross-validation

For the penguins data, compute 5-fold cross-validation sets, stratified by species.

  1. List the observations in each sample, so that you can see there is no overlap.
  1. Make a scatterplot matrix for each fold, coloured by species. Do the samples look similar?

5. What was the easiest part of this tutorial to understand, and what was the hardest?

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.