The goal for this week is for you to practice resampling methods, in order to tune models, assess model variance, and determine importance of variables.
🔧 Preparation
Complete the quiz
Do the reading related to week 3
Exercises:
Open your project for this unit called iml.Rproj.
1. Assess the significance of PC coefficients using bootstrap
In the lecture, we used bootstrap to examine the significance of the coefficients for the second principal component from the womens’ track PCA. Do this computation for PC1. The question for you to answer is: Can we consider all of the coefficients to be equal?
compute_PC1 <-function(data, index) { pc1 <-prcomp(data[index,], center=TRUE, scale=TRUE)$rotation[,1]# Coordinate signsif (sign(pc1[1]) <0) pc1 <--pc1 return(pc1)}# Make sure sign of first PC element is positivePC1_boot <-boot(data=track[,1:7], compute_PC1, R=1000)colnames(PC1_boot$t) <-colnames(track[,1:7])PC1_boot_ci <-as_tibble(PC1_boot$t) %>%gather(var, coef) %>%mutate(var =factor(var, levels=c("m100", "m200", "m400", "m800", "m1500", "m3000", "marathon"))) %>%group_by(var) %>%summarise(q2.5 =quantile(coef, 0.025), q5 =median(coef),q97.5 =quantile(coef, 0.975)) %>%mutate(t0 = PC1_boot$t0) # The red horizontal line indicates the null value # of the coefficient when all are equal.ggplot(PC1_boot_ci, aes(x=var, y=t0)) +geom_hline(yintercept=1/sqrt(7), linetype=2, colour="red") +geom_point() +geom_errorbar(aes(ymin=q2.5, ymax=q97.5), width=0.1) +#geom_hline(yintercept=0, linewidth=3, colour="white") +xlab("") +ylab("coefficient")
2. Using simulation to assess results when there is no structure
The ggscree function in the mulgar package computes PCA on multivariate standard normal samples, to learn what the largest eigenvalue might be when there the covariance between variables is 0.
What is the mean and covariance matrix of a multivariate standard normal distribution?
Solution
The mean is a \(p\)-dimensional vector of 0, and the covariance is a \(p\)-dimensional variance-covariance matrix.
Simulate a sample of 55 observations from a 7D standard multivariate normal distribution. Compute the sample mean and covariance. (Question: Why 55 observations? Why 7D?)
Solution
set.seed(854)d <-rmvnorm(55, mean =rep(0, 7), sigma =diag(7))apply(d, 2, mean)
The variance of the first PC of the womens’ track data is 5.8, which is much higher than that from this sample. It says that there is substantially more variance explained by PC 1 of the womens’s track data than would be expected if there was no association between any variables.
You should repeat generating the multivariate normal samples and computing the variance of PC 1 a few more times to learn what is the largest that would be observed.
3. Making a lineup plot to assess the dependence between variables
Permutation samples is used to significance assess relationships and importance of variables. Here we will use it to assess the strength of a non-linear relationship.
Generate a sample of data that has a strong non-linear relationship but no correlation, as follows:
Is the data plot recognisably different from the plots of permuted data?
Solution
The data and the permuted data are very different. The permutation breaks any relationship between the two variables, so we know that there is NO relationship in any of the permuted data examples. This says that the relationship seen in the data is strongly statistically significant.
Repeat this with a sample simulated with no relationship between the two variables. Can the data be distinguished from the permuted data?
The folds are similar but there are some noticeable differences that might lead to variation in the statistics that are calculated from each other. However, one should consider this variation something that might generally occur if we had different samples.
5. What was the easiest part of this tutorial to understand, and what was the hardest?
👋 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.