ETC3250/5250 Tutorial 2

Basics of machine learning

Author

Prof. Di Cook

Published

4 March 2024

🎯 Objectives

The goal for this week is for you to learn and practice some of the basics of machine learning.

🔧 Preparation

Complete the quiz
Do the reading related to week 1

Exercises:

Open your project for this unit called iml.Rproj.

1. Answer the following questions for this data matrix,

\[\begin{align*} {\mathbf X} = \left[\begin{array}{rrrrr} 2 & -2 & -8 & 6 & -7 \\ 6 & 6 & -4 & 9 & 6 \\ 5 & 4 & 3 & -7 & 8 \\ 1 & -7 & 6 & 7 & -1 \end{array}\right] \end{align*}\]

What is \(X_1\) (variable 1)?

Solution

\(X_1 = (2 ~6 ~5 ~1)\)

What is observation 3?

Solution

\(5 ~ 4 ~ 3 ~ -7 ~ 8\)

What is \(n\)?

Solution

\(4\)

What is \(p\)?

Solution

\(5\)

What is \(X^\top\)?

Solution

\[\begin{align*} {\mathbf X}^\top = \left[\begin{array}{rrrr} 2 & 6 & 5 & 1\\ -2 & 6 & 4 & -7\\ -8 & -4 & 3 & 6 \\ 6 & 9 & -7 & 7 \\ -7 & 6 & 8 & -1 \end{array}\right] \end{align*}\]

Write a projection matrix which would generate a 2D projection where the first data projection has variables 1 and 4 combined equally, and the second data projection has one third of variable 2 and two thirds of 5.

Solution

\[\begin{align*} {\mathbf A} = \left[\begin{array}{rr} \frac{1}{\sqrt{2}} & 0 \\ 0 & \frac{1}{\sqrt{3}} \\ 0 & 0 \\ \frac{1}{\sqrt{2}} & 0 \\ 0 & \frac{\sqrt{2}}{\sqrt{3}} \\ \end{array}\right] \end{align*}\]

Why can’t the following matrix considered a projection matrix?

\[\begin{align*} {\mathbf A} = \left[\begin{array}{rr} -1/\sqrt{2} & 1/\sqrt{3} \\ 0 & 0 \\ 1/\sqrt{2} & 0 \\ 0 & \sqrt{2}/\sqrt{3} \\ \end{array}\right] \end{align*}\]

Solution

The columns are not orthonormal. The cross-product is not equal to 0.

2. Which of these statements is the most accurate? And which is the most precise?

A. It is almost certain to rain in the next week.

B. It is 90% likely to get at least 10mm of rain tomorrow.

Solution

A is more accurate, but B is more precise.

3. For the following data, make an appropriate training test split of 60:40. The response variable is `cause`. Deomstrate that you have made an appropriate split.

library(readr)
library(dplyr)
library(rsample)

bushfires <- read_csv("https://raw.githubusercontent.com/dicook/mulgar_book/pdf/data/bushfires_2019-2020.csv")
bushfires |> count(cause)

# A tibble: 4 × 2
  cause           n
  <chr>       <int>
1 accident      138
2 arson          37
3 burning_off     9
4 lightning     838

Solution

The data is unbalanced, so it is especially important to stratify the sampling by the response variable. Without stratifying the test set is likely missing observations in the burning_off category.

set.seed(1156)
bushfires_split <- initial_split(bushfires, prop = 0.60, strata=cause)
training(bushfires_split) |> count(cause)

# A tibble: 4 × 2
  cause           n
  <chr>       <int>
1 accident       84
2 arson          21
3 burning_off     5
4 lightning     502

testing(bushfires_split) |> count(cause)

# A tibble: 4 × 2
  cause           n
  <chr>       <int>
1 accident       54
2 arson          16
3 burning_off     4
4 lightning     336

4. In the lecture slides from week 1 on bias vs variance, these four images were shown.

Mark the images with the labels “true model”, “fitted model”, “bias”. Then explain in your own words why the different model shown in each has (potentially) large bias or small bias, and small variance or large variance.

Solution

The linear model will be very similar regardless of the training sample, so it has small variance. But because it misses the curved nature of the true model, it has large bias, missing critical parts of the two classes that are different.

The non-parametric model which captures the curves thus has small bias, but the fitted model might vary a lot from one training sample to another which would result in it being considered to have large variance.

5. The following data contains `true` class and predictive probabilities for a model fit. Answer the questions below for this data.

pred_data <- read_csv("https://raw.githubusercontent.com/numbats/iml/master/data/tutorial_pred_data.csv") |>
  mutate(true = factor(true))

How many classes?

Solution

pred_data |> count(true)

# A tibble: 2 × 2
  true          n
  <fct>     <int>
1 Adelie       30
2 Chinstrap     5

Compute the confusion table, using the maximum predictive probability to label the observation.

Solution

library(tidyr)
pred_data <- pred_data |>
  mutate(pred = levels(pred_data$true)[apply(pred_data[,-1], 1, which.max)])
pred_data |> count(true, pred) |>  
  group_by(true) |>
  mutate(cl_err = n[pred==true]/sum(n)) |>
  pivot_wider(names_from = pred, 
              values_from = n,
              values_fill = 0) |>
  dplyr::select(true, Adelie, Chinstrap, cl_err)

# A tibble: 2 × 4
# Groups:   true [2]
  true      Adelie Chinstrap cl_err
  <fct>      <int>     <int>  <dbl>
1 Adelie        30         0    1  
2 Chinstrap      2         3    0.6

Compute the accuracy, and accuracy if all observations were classified as Adelie. Why is the accuracy almost as good when all observations are predicted to be the majority class?

Solution

Accuracy = 33/35 = 0.94

Accuracy when all predicted to be Adelie = 30/35 = 0.86

There are only 5 observations in the Chinstrap class. So accuracy remains high, if we simply ignore this class.

Compute the balanced accuracy, by averaging the class errors. Why is it lower than the overall accuracy? Which is the better accuracy to use to reflect the ability to classify this data?

Solution

The balanced accuracy is 0.8. This is a better reflection on the predictive ability of the model for this data because it reflects the difficulty in predicting the Chinstrap group.

6. This question relates to feature engineering, creating better variables on which to build your model.

The following spam data has a heavily skewed distribution for the size of the email message. How would you transform this variable to better see differences between spam and ham emails?

library(ggplot2)
library(ggbeeswarm)
spam <- read_csv("http://ggobi.org/book/data/spam.csv")
ggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +
  geom_quasirandom() +
  scale_color_brewer("", palette = "Dark2") + 
  coord_flip() +
  theme(legend.position="none")

Solution

ggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +
  geom_quasirandom() +
  scale_color_brewer("", palette = "Dark2") + 
  coord_flip() +
  theme(legend.position="none") +
  scale_y_log10()

For the following data, how would you construct a new single variable which would capture the difference between the two classes using a linear model?

olive <- read_csv("http://ggobi.org/book/data/olive.csv") |>
  dplyr::filter(region != 1) |>
  dplyr::select(region, arachidic, linoleic) |>
  mutate(region = factor(region))
ggplot(olive, aes(x=linoleic, 
                  y=arachidic, 
                  colour=region)) +
  geom_point() +
  scale_color_brewer("", palette = "Dark2") + 
   theme(legend.position="none", 
        aspect.ratio=1)

Solution

olive <- olive |>
  mutate(linoarch = 0.377 * linoleic + 
           0.926 * arachidic)
ggplot(olive, aes(x=region, 
                  y=linoarch, 
                  colour=region)) +
  geom_quasirandom() +
  scale_color_brewer("", palette = "Dark2") + 
  coord_flip() +
  theme(legend.position="none")

7. Discuss with your neighbour, what you found the most difficult part of last week’s content. Find some material (from resources or googling) together that gives alternative explanations that make it clearer.

👋 Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.

🎯 Objectives

🔧 Preparation

Exercises:

1. Answer the following questions for this data matrix,

2. Which of these statements is the most accurate? And which is the most precise?

3. For the following data, make an appropriate training test split of 60:40. The response variable is cause. Deomstrate that you have made an appropriate split.

4. In the lecture slides from week 1 on bias vs variance, these four images were shown.

5. The following data contains true class and predictive probabilities for a model fit. Answer the questions below for this data.

6. This question relates to feature engineering, creating better variables on which to build your model.

7. Discuss with your neighbour, what you found the most difficult part of last week’s content. Find some material (from resources or googling) together that gives alternative explanations that make it clearer.

👋 Finishing up

3. For the following data, make an appropriate training test split of 60:40. The response variable is `cause`. Deomstrate that you have made an appropriate split.

5. The following data contains `true` class and predictive probabilities for a model fit. Answer the questions below for this data.