ETC3250/5250 Tutorial 2

Basics of machine learning

Author

Prof. Di Cook

Published

4 March 2024

🎯 Objectives

The goal for this week is for you to learn and practice some of the basics of machine learning.

🔧 Preparation

  • Complete the quiz
  • Do the reading related to week 1

Exercises:

Open your project for this unit called iml.Rproj.

1. Answer the following questions for this data matrix,

\[\begin{align*} {\mathbf X} = \left[\begin{array}{rrrrr} 2 & -2 & -8 & 6 & -7 \\ 6 & 6 & -4 & 9 & 6 \\ 5 & 4 & 3 & -7 & 8 \\ 1 & -7 & 6 & 7 & -1 \end{array}\right] \end{align*}\]

  1. What is \(X_1\) (variable 1)?
  1. What is observation 3?
  1. What is \(n\)?
  1. What is \(p\)?
  1. What is \(X^\top\)?
  1. Write a projection matrix which would generate a 2D projection where the first data projection has variables 1 and 4 combined equally, and the second data projection has one third of variable 2 and two thirds of 5.
  1. Why can’t the following matrix considered a projection matrix?

\[\begin{align*} {\mathbf A} = \left[\begin{array}{rr} -1/\sqrt{2} & 1/\sqrt{3} \\ 0 & 0 \\ 1/\sqrt{2} & 0 \\ 0 & \sqrt{2}/\sqrt{3} \\ \end{array}\right] \end{align*}\]

2. Which of these statements is the most accurate? And which is the most precise?

A. It is almost certain to rain in the next week.

B. It is 90% likely to get at least 10mm of rain tomorrow.

3. For the following data, make an appropriate training test split of 60:40. The response variable is cause. Deomstrate that you have made an appropriate split.

library(readr)
library(dplyr)
library(rsample)

bushfires <- read_csv("https://raw.githubusercontent.com/dicook/mulgar_book/pdf/data/bushfires_2019-2020.csv")
bushfires |> count(cause)
# A tibble: 4 Ă— 2
  cause           n
  <chr>       <int>
1 accident      138
2 arson          37
3 burning_off     9
4 lightning     838

4. In the lecture slides from week 1 on bias vs variance, these four images were shown.

Mark the images with the labels “true model”, “fitted model”, “bias”. Then explain in your own words why the different model shown in each has (potentially) large bias or small bias, and small variance or large variance.

5. The following data contains true class and predictive probabilities for a model fit. Answer the questions below for this data.

pred_data <- read_csv("https://raw.githubusercontent.com/numbats/iml/master/data/tutorial_pred_data.csv") |>
  mutate(true = factor(true))
  1. How many classes?
  1. Compute the confusion table, using the maximum predictive probability to label the observation.
  1. Compute the accuracy, and accuracy if all observations were classified as Adelie. Why is the accuracy almost as good when all observations are predicted to be the majority class?
  1. Compute the balanced accuracy, by averaging the class errors. Why is it lower than the overall accuracy? Which is the better accuracy to use to reflect the ability to classify this data?

6. This question relates to feature engineering, creating better variables on which to build your model.

  1. The following spam data has a heavily skewed distribution for the size of the email message. How would you transform this variable to better see differences between spam and ham emails?
library(ggplot2)
library(ggbeeswarm)
spam <- read_csv("http://ggobi.org/book/data/spam.csv")
ggplot(spam, aes(x=spam, y=size.kb, colour=spam)) +
  geom_quasirandom() +
  scale_color_brewer("", palette = "Dark2") + 
  coord_flip() +
  theme(legend.position="none")

  1. For the following data, how would you construct a new single variable which would capture the difference between the two classes using a linear model?
olive <- read_csv("http://ggobi.org/book/data/olive.csv") |>
  dplyr::filter(region != 1) |>
  dplyr::select(region, arachidic, linoleic) |>
  mutate(region = factor(region))
ggplot(olive, aes(x=linoleic, 
                  y=arachidic, 
                  colour=region)) +
  geom_point() +
  scale_color_brewer("", palette = "Dark2") + 
   theme(legend.position="none", 
        aspect.ratio=1)

7. Discuss with your neighbour, what you found the most difficult part of last week’s content. Find some material (from resources or googling) together that gives alternative explanations that make it clearer.

đź‘‹ Finishing up

Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.