ETC3250/5250 Tutorial 4

Re-sampling and regularisation

Author

Prof. Di Cook

Published

18 March 2024

Load the libraries and avoid conflicts

# Load libraries used everywhere
library(tidyverse)
library(tidymodels)
library(conflicted)
library(patchwork)
library(mulgar)
library(mvtnorm)
library(boot)
library(nullabor)
library(palmerpenguins)
library(GGally)
conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::select)
conflicts_prefer(dplyr::slice)
conflicts_prefer(palmerpenguins::penguins)

options(digits=2)
p_tidy <- penguins |>
  select(species, bill_length_mm:body_mass_g) |>
  rename(bl=bill_length_mm,
         bd=bill_depth_mm,
         fl=flipper_length_mm,
         bm=body_mass_g) |>
  filter(!is.na(bl)) |>
  arrange(species)

🎯 Objectives

The goal for this week is for you to practice resampling methods, in order to tune models, assess model variance, and determine importance of variables.

🔧 Preparation

Complete the quiz
Do the reading related to week 3

Exercises:

Open your project for this unit called iml.Rproj.

1. Assess the significance of PC coefficients using bootstrap

In the lecture, we used bootstrap to examine the significance of the coefficients for the second principal component from the womens’ track PCA. Do this computation for PC1. The question for you to answer is: Can we consider all of the coefficients to be equal?

The data can be read using:

track <- read_csv("https://raw.githubusercontent.com/numbats/iml/master/data/womens_track.csv")

Solution

compute_PC1 <- function(data, index) {
  pc1 <- prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,1]
  # Coordinate signs
  if (sign(pc1[1]) < 0) 
    pc1 <- -pc1 
  return(pc1)
}
# Make sure sign of first PC element is positive
PC1_boot <- boot(data=track[,1:7], compute_PC1, R=1000)
colnames(PC1_boot$t) <- colnames(track[,1:7])
PC1_boot_ci <- as_tibble(PC1_boot$t) %>%
  gather(var, coef) %>% 
  mutate(var = factor(var, levels=c("m100", "m200", "m400", "m800", "m1500", "m3000", "marathon"))) %>%
  group_by(var) %>%
  summarise(q2.5 = quantile(coef, 0.025), 
            q5 = median(coef),
            q97.5 = quantile(coef, 0.975)) %>%
  mutate(t0 = PC1_boot$t0) 
  
# The red horizontal line indicates the null value 
# of the coefficient when all are equal.
ggplot(PC1_boot_ci, aes(x=var, y=t0)) + 
  geom_hline(yintercept=1/sqrt(7), linetype=2, colour="red") +
  geom_point() +
  geom_errorbar(aes(ymin=q2.5, ymax=q97.5), width=0.1) +
  #geom_hline(yintercept=0, linewidth=3, colour="white") +
  xlab("") + ylab("coefficient")

2. Using simulation to assess results when there is no structure

The ggscree function in the mulgar package computes PCA on multivariate standard normal samples, to learn what the largest eigenvalue might be when there the covariance between variables is 0.

What is the mean and covariance matrix of a multivariate standard normal distribution?

Solution

The mean is a \(p\)-dimensional vector of 0, and the covariance is a \(p\)-dimensional variance-covariance matrix.

Simulate a sample of 55 observations from a 7D standard multivariate normal distribution. Compute the sample mean and covariance. (Question: Why 55 observations? Why 7D?)

Solution

set.seed(854)
d <- rmvnorm(55, mean = rep(0, 7), sigma = diag(7))
apply(d, 2, mean)

[1]  0.271  0.125  0.054 -0.076 -0.012 -0.141 -0.055

cov(d)

        [,1]   [,2]    [,3]   [,4]   [,5]    [,6]    [,7]
[1,]  0.8162 -0.126  0.0102 -0.030  0.244 -0.0932  0.0097
[2,] -0.1263  0.915 -0.0050 -0.051 -0.092 -0.1128 -0.0242
[3,]  0.0102 -0.005  1.1710  0.077  0.387 -0.0019  0.1609
[4,] -0.0298 -0.051  0.0766  0.659  0.027  0.1862  0.0463
[5,]  0.2438 -0.092  0.3872  0.027  0.917 -0.1307  0.0143
[6,] -0.0932 -0.113 -0.0019  0.186 -0.131  0.8257  0.0120
[7,]  0.0097 -0.024  0.1609  0.046  0.014  0.0120  0.8046

Compute PCA on your sample, and note the variance of the first PC. How does this compare with variance of the first PC of the women’s track data?

Solution

d_pca <- prcomp(d, center=FALSE, scale=FALSE)
d_pca$sdev^2

[1] 1.55 1.15 1.04 0.77 0.68 0.56 0.48

The variance of the first PC of the womens’ track data is 5.8, which is much higher than that from this sample. It says that there is substantially more variance explained by PC 1 of the womens’s track data than would be expected if there was no association between any variables.

You should repeat generating the multivariate normal samples and computing the variance of PC 1 a few more times to learn what is the largest that would be observed.

3. Making a lineup plot to assess the dependence between variables

Permutation samples is used to significance assess relationships and importance of variables. Here we will use it to assess the strength of a non-linear relationship.

Generate a sample of data that has a strong non-linear relationship but no correlation, as follows:

set.seed(908)
n <- 205
df <- tibble(x1 = runif(n)-0.5, x2 = x1^2 + rnorm(n)*0.01)

and then use permutation to generate another 19 plots where x1 is permuted. You can do this with the nullabor package as follows:

set.seed(912)
df_l <- lineup(null_permute('x1'), df)

and make all 20 plots as follows:

ggplot(df_l, aes(x=x1, y=x2)) + 
  geom_point() + 
  facet_wrap(~.sample)

Is the data plot recognisably different from the plots of permuted data?

Solution

The data and the permuted data are very different. The permutation breaks any relationship between the two variables, so we know that there is NO relationship in any of the permuted data examples. This says that the relationship seen in the data is strongly statistically significant.

Repeat this with a sample simulated with no relationship between the two variables. Can the data be distinguished from the permuted data?

Solution

set.seed(916)
n <- 205
df <- tibble(x1 = runif(n)-0.5, x2 = rnorm(n)*0.1)
df_l <- lineup(null_permute('x1'), df)
ggplot(df_l, aes(x=x1, y=x2)) + 
  geom_point() + 
  facet_wrap(~.sample)

The data cannot be distinguished from the permuted data, so there is no statistically significant relatiomship between the two variables.

4. Computing \(k\)-folds for cross-validation

For the penguins data, compute 5-fold cross-validation sets, stratified by species.

List the observations in each sample, so that you can see there is no overlap.

Solution

set.seed(929)
p_folds <- vfold_cv(p_tidy, 5, strata=species)
c(1:nrow(p_tidy))[-p_folds$splits[[1]]$in_id]

 [1]   3   6  31  36  42  44  51  53  59  62  65  66  67  79  85  88  93  96 103
[20] 104 105 107 108 113 114 118 122 128 141 143 144 155 157 158 163 170 177 179
[39] 182 194 195 202 204 211 213 221 222 224 226 239 246 248 256 258 264 265 275
[58] 280 287 292 295 296 297 307 322 327 328 335 336 339

c(1:nrow(p_tidy))[-p_folds$splits[[2]]$in_id]

 [1]   1   8  13  17  19  21  24  29  41  50  54  56  78  86  87  89  97 100 101
[20] 112 117 121 123 129 130 132 133 139 149 150 152 159 166 167 168 169 171 189
[39] 190 191 193 198 212 215 225 228 231 241 244 249 250 259 260 262 266 268 269
[58] 270 271 272 276 282 283 284 288 321 331 337 342

c(1:nrow(p_tidy))[-p_folds$splits[[3]]$in_id]

 [1]   4   9  10  15  25  30  32  35  37  39  43  47  48  55  57  64  69  71  80
[20]  82  91 109 111 116 124 127 134 136 140 147 162 176 178 180 186 199 200 203
[39] 207 208 210 216 218 219 220 229 232 236 240 243 247 252 254 261 267 277 279
[58] 286 290 299 300 303 306 308 312 320 325 326 329

c(1:nrow(p_tidy))[-p_folds$splits[[4]]$in_id]

 [1]   5  11  18  20  22  23  27  28  33  34  52  70  72  73  75  77  81  90  92
[20]  94  95 106 110 119 125 137 138 142 145 151 154 156 160 161 165 174 181 183
[39] 187 192 196 206 214 223 227 234 237 238 245 255 257 274 281 285 289 293 294
[58] 298 302 313 314 315 317 324 330 332 338

c(1:nrow(p_tidy))[-p_folds$splits[[5]]$in_id]

 [1]   2   7  12  14  16  26  38  40  45  46  49  58  60  61  63  68  74  76  83
[20]  84  98  99 102 115 120 126 131 135 146 148 153 164 172 173 175 184 185 188
[39] 197 201 205 209 217 230 233 235 242 251 253 263 273 278 291 301 304 305 309
[58] 310 311 316 318 319 323 333 334 340 341

Make a scatterplot matrix for each fold, coloured by species. Do the samples look similar?

Solution

p_sub <- p_tidy[-p_folds$splits[[1]]$in_id, ]
ggscatmat(p_sub, columns=2:5, color="species") +
  theme(legend.position="none",
        axis.text = element_blank())

p_sub <- p_tidy[-p_folds$splits[[2]]$in_id, ]
ggscatmat(p_sub, columns=2:5, color="species") +
  theme(legend.position="none",
        axis.text = element_blank())

p_sub <- p_tidy[-p_folds$splits[[3]]$in_id, ]
ggscatmat(p_sub, columns=2:5, color="species") +
  theme(legend.position="none",
        axis.text = element_blank())

p_sub <- p_tidy[-p_folds$splits[[4]]$in_id, ]
ggscatmat(p_sub, columns=2:5, color="species") +
  theme(legend.position="none",
        axis.text = element_blank())

p_sub <- p_tidy[-p_folds$splits[[5]]$in_id, ]
ggscatmat(p_sub, columns=2:5, color="species") +
  theme(legend.position="none",
        axis.text = element_blank())

The folds are similar but there are some noticeable differences that might lead to variation in the statistics that are calculated from each other. However, one should consider this variation something that might generally occur if we had different samples.

5. What was the easiest part of this tutorial to understand, and what was the hardest?

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.