ETC3250/5250 Assignment 2
🏆 Goal
This assignment will assess your understanding of these topics:
- re-sampling methods
- simple classification models
and assumes knowledge of content in the first three weeks.
🔑 Instructions
- This is an open book assignment, and you are allowed to use any resources that you find helpful. However, every resource used needs to be appropriately cited. You can add the citations to the end of the report, the particular style is not important. Lack of citation of resources will incur up to 50% reduction in final score.
- You are encouraged to use Generative AI, so that you become accustomed to where it is helpful and where it is problematic on topics related to machine learning. You are expected to include the full script of your conversation at the end of your report.
- This is an individual assignment. You are expected to complete this assignment individually, which means that the only tutors or instructors can be consulted. This means you are not permitted to discuss the questions or answers with other people, including students in this unit, or post questions to help sites. You can either the send a message to the class email address or send a private message to the teaching team on the discussion forum ED.
- You need to follow the rules detailed at Maintain academic integrity information for students. If you are concerned about, you can report these to the chief examiner.
- For any reason, but especially if there is suspicion of violation of academic integrity, the chief examiner can request that you attend an oral exam to explain any of your answers, or to answer related questions on the assignment. Your score will be adjusted based on answers provided during the oral exam.
- The assignment needs to be turned in as (1) quarto (
.qmd
), and (2) ashtml
, to Moodle. That is, two files need to be submitted, ideally as a zip of the two files into a single file. No other formats will be marked. It is expected that the knitting theqmd
will produce thehtml
file submitted. If theqmd
file does not knit, then the score for assignment will be reduced by 25%. - R code should be hidden in the final report, unless it is specifically requested.
- A skeleton assignment file zip is provided get you started and help understand what to turn in.
🏃🏿♀️🏃🏽Exercises
1. Bootstrapping your way to provide evidence (8pts)
On the womens cricket data set, using just the variables NotOuts
:FiveWickets
(columns 6 to 22), compute the PCA, and use bootstrap to determine which statistics primarily contribute to PC1, and to PC2.
- Include numerical and visual summaries of your results.
- Write a paragraph with a simple descriptive interpretation of these two PCs.
2. How is your thinking about simple classifiers? (10pts)
- For each of the steps on slide 7 of week 4 lecture slides, that takes
\[~~~~y = \frac{e^{\beta_0+\beta_1x}}{1+e^{\beta_0+\beta_1x}}\] to get
\[\log_e\frac{y}{1 - y} = \beta_0+\beta_1x\]
explain what change is made to get to the next line.
If \(\beta_0=0.5, \beta_1=-2\), and an observation has the value \(x=-1\) what would the logistic regression class prediction be?
Now look at the equations for linear discriminant analysis. For these sample statistics:
\[ S = \left( \begin{array}{cc} 2 & 1 \\ 1 & 2 \end{array} \right)~~~ \bar{x}_A = \left( \begin{array}{r} -2\\ 2 \end{array} \right)~~~ \bar{x}_B = \left( \begin{array}{r} 2\\ 2 \end{array} \right) \] where \(S\) is the pooled variance covariance, predict the class of \(x_0=\left( \begin{array}{c} -3\\ -2\end{array} \right)\).
- For the following data, without using logistic regression or linear discriminant analysis, come up with a rule that would separate these two classes. Be sure to explain your thinking.
3. How well can you build a simple classifier? (18pts)
Can you write a classifier to distinguish between financial time series and audio tracks of birds?
The data to use is in the file finance_and_birds.csv
. This contains records for 974 time series, containing the four variables, trend
, linearity
, entropy
, x_acf1
, computed using features of the series, and the class variable type
. You can read more about the way features are calculated at http://pkg.robjhyndman.com/tsfeatures/.
Your task is to use both linear discriminant analysis and logistic regression to arrive at the best model for predicting the type of series. The expected components of your solution are:
- Visual summaries to understand whether and how the two are distinguishable, and whether assumptions hold.
- Breaking the data into training and test samples, appropriately. Show your code.
- Fit both types of models, even if the assumptions don’t hold.
- Summarise the fits of both models, and your fit statistics. Include what you learn about variable importance from these summaries, and also confusion matrices.
- Diagnose the models. Where are they making mistakes? Are they making mistakes for the same observations?
- Make an ROC curve to decide which of the two is the better model.
- Write a short paragraph describing how time series for financial data and birdsongs typically differ.
⚖️ Marking guide
- Total: 36pts (scaled back to 9pts)
- Answers should be written in complete sentences, when explanations are required.
- Correct answers will score full points. Partial credit will be given where possible.
- Readability is important, and up to 4 points will be deducted for spelling errors and poor organisation.
- Deductions apply for lack of reproducibility, lack of citations, lack of supporting material.