---
title: "ETC3250/5250 Tutorial 5 Instructions"
subtitle: "Visualising high-dimensional data"
author: "prepared by Professor Di Cook"
date: "Week 5"
output:
html_document:
after_body: tutorial-footer.html
css: tutorial.css
---
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
error = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 8,
fig.align = "center",
cache = FALSE
)
library(emo)
```
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
error = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 8,
fig.align = "center",
cache = FALSE
)
```
## `r emo::ji("target")` Objective
The objectives for this week are to
- learn to use the tour to develop intuition about multiple dimensions
- recognise features in high dimensions including multivariate outliers, clustering and linear and nonlinear dependence
- use the manual tour for interpretation, to assess variable importance
## `r emo::ji("wrench")` Preparation
Make sure you have these packages installed:
```
install.packages(c("tidyverse", "tourr", "MASS", "spinifex", "mvtnorm", "tidymodels", "discrim", "broom", "GGally"))
```
### `r emo::ji("book")` Reading
- Reading material on high-dimensional data visualisation on moodle
## `r emo::ji("waving_hand")` Getting started
If you are in a zoom tutorial, say hello in the chat. If in person, do say hello to your tutor and to your neighbours.
## ðŸ’¬ Class discussion exercises
- In tour 1, do you see any clustering?
```{r eval=FALSE, echo=TRUE}
# This is code that YOU CAN RUN YOURSELF to see the tour
library(tidyverse)
olive <- read_csv("http://www.ggobi.org/book/data/olive.csv") %>%
dplyr::select(-`...1`)
library(tourr)
animate_xy(olive[, 3:10], axes="off")
```
- In tour 2, the three classes have been coloured. can you see differences between the three groups? Do any of these group break into more clusters?
```{r eval=FALSE, echo=TRUE}
# This is code that YOU CAN RUN YOURSELF to see the tour,
# but its not necessary to run in order to do the exercise
animate_xy(olive[,3:10], axes="off",
col=olive$region)
```
## `r emo::ji("gear")` Exercises
### 1. Data with different variance-covariances
Take a look at the data from tutorial 5 using a grand tour. Can you see the difference between the two sets better now? You should see that one group has roughly the same spread of observations for each group, in any combination of the variables. The other has some combinations of the variables where the spread is different for each group.
**Set A** has equal variance-covariance between groups, $\Sigma$:
$$\Sigma = \begin{bmatrix} 3.0&0.2&-1.2&0.9\\
0.2&2.5&-1.4&0.3\\
-1.2&-1.4&2.0&1.0\\
0.9&0.3&1.0&3.0\\
\end{bmatrix}$$
and **set B** has different variance-covariances between groups, $\Sigma_1, \Sigma_2, \Sigma_3$:
$\Sigma_1 = \Sigma$
$$\Sigma_2 = \begin{bmatrix}3.0&-0.8&1.2&0.3\\
-0.8&2.5&1.4&0.3\\
1.2&1.4&2.0&1.0\\
0.3&0.3&1.0&3.0\\
\end{bmatrix}$$
$$\Sigma_3 = \begin{bmatrix}2.0&-1.0&1.2&0.3\\
-1.0&2.5&1.4&0.3\\
1.2&1.4&4.0&-1.2\\
0.3&0.3&-1.2&3.0\\
\end{bmatrix}$$
Make a scatterplot matrix to match the variance-covariance matrix with the spread of the observations. In setA, the spread is (approximately) **homogeneous** between groups. In setB, the spread is clearly **heterogeneous** between groups.
This code is used simulate the data:
```{r echo=TRUE, eval=FALSE}
set.seed(20200416)
library(mvtnorm)
vc1 <- matrix(c(3, 0.2, -1.2, 0.9, 0.2, 2.5, -1.4, 0.3, -1.2, -1.4, 2.0, 1.0, 0.9, 0.3, 1.0, 3.0), ncol=4, byrow=TRUE)
vc2 <- matrix(c(3, -0.8, 1.2, 0.3, -0.8, 2.5, 1.4, 0.3, 1.2, 1.4, 2.0, 1.0, 0.3, 0.3, 1.0, 3.0), ncol=4, byrow=TRUE)
vc3 <- matrix(c(2.0, -1.0, 1.2, 0.3, -1.0, 2.5, 1.4, 0.3, 1.2, 1.4, 4.0, -1.2, 0.3, 0.3, -1.2, 3.0), ncol=4, byrow=TRUE)
m1 <- c(0,0,3,0)
m2 <- c(0,3,-3,0)
m3 <- c(-3,0,3,3)
n1 <- 85
n2 <- 104
n3 <- 48
setA <- rbind(rmvnorm(n1, m1, vc1), rmvnorm(n2, m2, vc1), rmvnorm(n3, m3, vc1))
setA <- data.frame(setA)
setA$class <- c(rep("1", n1), rep("2", n2), rep("3", n3))
setB <- rbind(rmvnorm(n1, m1, vc1), rmvnorm(n2, m2, vc2), rmvnorm(n3, m3, vc3))
setB <- data.frame(setB)
setB$class <- c(rep("1", n1), rep("2", n2), rep("3", n3))
```
This code can be used to make the scatterplot matrix:
```{r eval=FALSE}
library(GGally)
ggscatmat(setA, columns=1:4, color="class")
ggscatmat(setB, columns=1:4, color="class")
```
You can use this code to run the tour (RUN CODE ONE LINE AT A TIME!):
```{r eval=FALSE}
library(tourr)
animate_xy(setA[,1:3], col=setA$class)
animate_xy(setB[,1:3], col=setB$class)
```
Also note that if you include the fourth variable `1:4` in the code above, that the variance-covariance of setA is collinear (all the points align along a line) in some projections. That means that the spread or variation in each group is only 3D. This is something that a tour can help you see that was not at all visible from the scatterplot matrix. It can also be determined from PCA, which would return a 0 eigenvalue for the fourth PC, and the first three PCs would together explain 100% of the total variance.
### 2. Exploring for class separations, heterogeneous variance-covariance and outliers
Remember the chocolates data? The chocolates data was compiled by students in a previous class of Prof Cook, by collecting nutrition information on the chocolates as listed on their internet sites. All numbers were normalised to be equivalent to a 100g serving. Units of measurement are listed in the variable name. You are interested in answering "How do milk and dark chocolates differ on nutritional values?"
a. Examine all of the nutritional variables, relative to the chocolate type, using a grand tour (`tourr::animate_xy()`) and a guided tour (look up the help for `tourr::guided_tour` to see example of how to use the `lda_pp` index). Explain what you see in terms of differences between groups.
b. From the tour, should you assume that the variance-covariance matrices of the two types of chocolates is the same? Regardless of your answer, conduct a linear discriminant analysis, on the standardised chocolates data. Because the variables are standardised the magnitude of the coefficients of the linear discriminant can be used to determine the most important variables. What are the three most important variables? What are the four least important variables? Look at the data with a grand tour of only the three important variables, and discuss the differences between the groups.
```{r echo=TRUE, eval=FALSE}
choc <- read_csv("http://iml.numbat.space/data/chocolates.csv")
choc <- choc %>%
rename(Cl = Calories,
CF = CalFat,
TF = TotFat_g,
SF = SatFat_g,
Ch = Chol_mg,
Na = Na_mg,
Cb = Carbs_g,
Fb = Fiber_g,
Sg = Sugars_g,
Pr = Protein_g
)
std <- function(x) (x-mean(x))/sd(x)
choc_std <- choc %>%
mutate_if(is.numeric, std)
animate_xy(choc_std[,5:14], col=choc$Type)
animate_xy(choc_std[,5:14],
guided_tour(lda_pp(choc_std$Type)),
col = choc_std$Type)
# Conduct LDA
library(discrim)
library(parsnip)
library(MASS)
choc_std <- choc_std %>%
mutate(Type = as.factor(Type))
lda_mod <- discrim_linear() %>%
set_engine("MASS", prior = c(0.5, 0.5)) %>%
translate()
choc_lda_fit <-
lda_mod %>%
fit(Type ~ .,
data = choc_std[, 4:14])
choc_lda_fit
animate_xy(choc_std[,c(7,10,12)], col=choc$Type)
```
### 3. Assessing variable importance with the manual tour
This example uses the olive oils data.
a. Read in the data. Keep `region` and the fatty acid content variables. Standardize the variables to have mean and variance 1.
b. Fit a linear discriminant analysis model, to a training set of data. This will produce a 2D discriminant space because there are three variables. Based on the coefficients which variable(s) are important for the first direction, and which are important for the second direction?
c. Using a manual tour, from the `spinifex` package, , starting from the projection given by the discriminant space explore the importance of (i) `eicosenoic` for separating region 1, (ii) `oleic`, `linoleic` and `arachidic` for separating regions 2 and 3, (iii) and that `stearic` is not important for any of the separations.
```{r echo=TRUE, eval=FALSE}
library(rsample)
library(yardstick)
library(spinifex)
olive <- read_csv("http://ggobi.org/book/data/olive.csv") %>%
# dplyr::filter(region != 1) %>%
# dplyr::select(region, oleic, linoleic) %>%
dplyr::select(-`...1`, -area) %>%
mutate(region = factor(region))
# Standardise variables
std <- function(x) (x-mean(x))/sd(x)
olive_std <- olive %>%
mutate_if(is.numeric, std)
set.seed(775)
olive_split <- initial_split(olive_std, 2/3, strata = region)
olive_train <- analysis(olive_split)
olive_test <- assessment(olive_split)
lda_mod <- discrim_linear() %>%
set_engine("MASS", prior = c(1/3, 1/3, 1/3)) %>%
translate()
olive_lda_fit <-
lda_mod %>%
fit(region ~ .,
data = olive_train)
olive_lda_fit
# Collect the projection giving best separation,
# to use with manual tour for assessing importance
lda_proj <- olive_lda_fit$fit$scaling
# Generate the manual path examining the importance of eicosenoic (variable 8)
path1 <- manual_tour(basis = lda_proj,
manip_var = 8) #<<
ggt <- ggtour(path1, olive_std[,2:9], angle = 0.05) +
proto_point(
aes_args = list(color =
olive_std$region)) +
proto_basis()
animate_plotly(ggt)
# Generate the manual path examining the importance of oleic (variable 4) - Change the 8 to a 4 in the previous code
# Generate the manual path examining the importance of linoleic (variable 5) - Change the 8 to a 5 in the previous code
# Generate the manual path examining the importance of arachidic (variable 7) - Change the 8 to a 7 in the previous code
# Generate the manual path examining the importance of stearic (variable 3) - Change the 8 to a 3 in the previous code
```
##### Â© Copyright 2022 Monash University