ETC3250/5250 Assignment 1

Author

Prof. Di Cook

Published

March 8, 2024

๐Ÿ† Goal

This assignment will assess your understanding of these topics:

  • basic math and calculations for ML
  • ML concepts
  • visualisation
  • dimension reduction

๐Ÿ”‘ Instructions

  • This is an open book assignment, and you are allowed to use any resources that you find helpful. However, every resource used needs to be appropriately cited. You can add the citations to the end of the report, the particular style is not important. Lack of citation of resources will incur up to 50% reduction in final score.
  • You are encouraged to use Generative AI, so that you become accustomed to where it is helpful and where it is problematic on topics related to machine learning. You are expected to include the full script of your conversion at the end of your report.
  • This is an individual assignment. You are expected to complete this assignment individually, which means that the only tutors or instructors can be consulted. This means you are not permitted to discuss the questions or answers with other people, including students in this unit, or post questions to help sites. You can either the send a message to the class email address or send a private message to the teaching team on the discussion forum ED.
  • You need to follow the rules detailed at Maintain academic integrity information for students. If you are concerned about, you can report these to the chief examiner.
  • For any reason, but especially if there is suspicion of violation of academic integrity, the chief examiner can request that you attend an oral exam to explain any of your answers, or to answer related questions on the assignment. Your score will be adjusted based on answers provided during the oral exam.
  • The assignment needs to be turned in as (1) quarto (.qmd), and (2) as html, to Moodle. That is, two files need to be submitted, ideally as a zip of the two files into a single file. No other formats will be marked. It is expected that the knitting the qmd will produce the html file submitted. If the qmd file does not knit, then the score for assignment will be reduced by 25%.
  • R code should be hidden in the final report, unless it is specifically requested.
  • A skeleton assignment file zip is provided get you started and help understand what to turn in.

๐Ÿƒ๐Ÿฟโ€โ™€๏ธ๐Ÿƒ๐ŸฝExercises

1. Basic math and computing (5pts)

For the following matrices:

\[S = \left[ \begin{array}{cc} 3 & 1 \\ 1 & 2 \end{array} \right] \]

\[ X = \left[ \begin{array}{c} 4 \\ 2 \end{array} \right] \]
and

\[ A = \left[ \begin{array}{c} 1 \\ -1 \end{array} \right] \]

compute the quantity \((X-A)^\top S^{-1} (X-A)\) both algebraicly (by hand), and numerically (using R). Be sure to show:

  • the steps of your calculations.
  • your code to compute it numerically.

2. ML concepts (8pts)

The data in assign01_pred.csv

d_pred <- read_csv("https://raw.githubusercontent.com/numbats/iml/master/data/pred_data.csv")
d_pred |> slice_head(n=3)
# A tibble: 3 ร— 3
  true   adelie chinstrap
  <chr>   <dbl>     <dbl>
1 Adelie   1.00  0.000154
2 Adelie   1.00  0.000236
3 Adelie   1.00  0.000208

the variable y is the true class, pred1 are class predictions for model 1, and pred2 are class predictions for model 2. The columns bilby1, quokka1 are predictive probabilities for model 1, and similarly bilby2, quokka2 are predictive probabilities for model 2.

  1. Compute the accuracy and balanced accuracy for each model.
  2. The class predictions were made by using 0.5 and above as the value at which to predict the observation to be a bilby. Compute sensitivity and 1-specificity if (i) 0.3 and (ii) 0.4 were used instead of 0.5.
  3. Make the ROC curves for both models, where bilby is considered the positive class, and explain which is the better of the two.

3. Visualisation (8pts)

In the mulgar package, there is a data set called c7. It has 6 variables. Your job is to explore the data structure and report as accurately as possible. Particularly comment on:

  • number and shape of clusters
  • dimensionality/linear relationships
  • outliers

You will want to use a scatterplot matrix, a tour, and a 2D view provided by UMAP, to view different aspects of the data.

4. Dimension reduction (15pts)

The cricketdata package allows you to download player statistics from matches across the globe. The statistics for Australian women have been collected and made available in the file auswt20.csv. Your task is to conduct a principal component analysis of this data, and explain what is learned. Particularly, your answer needs these components:

  • Summary of the PCA
  • Biplot of the first two PCs, and an explanation of the structure and variable contributions
  • Decision on appropriate number of PCs to use, as supported by proportion of total variance, and a scree plot.
  • Interpretations of the PCs
  • Discussion of the main patterns discovered by PCA, including notes about particular players.

If you are unfamiliar with womenโ€™s cricket, the wikipedia page might be good to read.

You can read the data with:

auswt20 <- read_csv("https://raw.githubusercontent.com/numbats/iml/master/data/auswt20.csv")

โš–๏ธ Marking guide

  • Total: 36pts (scaled back to 9pts)
  • Answers should be written in complete sentences, when explanations are required.
  • Correct answers will score full points. Partial credit will be given where possible.
  • Readability is important, and up to 4 points will be deducted for spelling errors and poor organisation.
  • Deductions apply for lack of reproducibility, lack of citations, lack of supporting material.