ETC3250/5250 Assignment 3
🏆 Goal
This assignment will assess your understanding of these topics:
- tree and forest models
- tuning a model
- model choice
and assumes knowledge of content in the first seven weeks.
🔑 Instructions
- This is an open book assignment, and you are allowed to use any resources that you find helpful. However, every resource used needs to be appropriately cited. You can add the citations to the end of the report, the particular style is not important. Lack of citation of resources will incur up to 50% reduction in final score.
- You are encouraged to use Generative AI, so that you become accustomed to where it is helpful and where it is problematic on topics related to machine learning. You are expected to include the full script of your conversation at the end of your report.
- This is an individual assignment. You are expected to complete this assignment individually, which means that the only tutors or instructors can be consulted. This means you are not permitted to discuss the questions or answers with other people, including students in this unit, or post questions to help sites. You can either the send a message to the class email address or send a private message to the teaching team on the discussion forum ED.
- You need to follow the rules detailed at Maintain academic integrity information for students. If you are concerned about, you can report these to the chief examiner.
- For any reason, but especially if there is suspicion of violation of academic integrity, the chief examiner can request that you attend an oral exam to explain any of your answers, or to answer related questions on the assignment. Your score will be adjusted based on answers provided during the oral exam.
- The assignment needs to be turned in as (1) quarto (
.qmd
), (2) ashtml
, and any other supporting files, such as images and css, to Moodle. The collection of files need to be submitted as a zip of the two files into a single file. No other formats will be marked. It is expected that the knitting theqmd
will produce thehtml
file submitted. If theqmd
file does not knit, then the score for assignment will be reduced by 25%. - R code should be hidden in the final report, unless it is specifically requested.
- Conciseness is important. Overly long answers containing irrelevant information may result in a reduced score.
- A skeleton assignment file zip is provided get you started and help understand what to turn in.
🏃🏿♀️🏃🏽Exercises
1. Basics of trees and forests (9pts)
- For the following tree, predict this observation,
x1=1.53, x2=1.96, x3=1.36, x4=-0.346
.
n= 192
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 192 100 B (0.224 0.458 0.318)
2) x3< 1.3 126 48 B (0.341 0.619 0.040)
4) x3< 0.19 61 18 A (0.705 0.295 0.000)
8) x1>=-0.0071 48 5 A (0.896 0.104 0.000) *
9) x1< -0.0071 13 0 B (0.000 1.000 0.000) *
5) x3>=0.19 65 5 B (0.000 0.923 0.077) *
3) x3>=1.3 66 10 C (0.000 0.152 0.848)
6) x1>=1.2 9 1 B (0.000 0.889 0.111) *
7) x1< 1.2 57 2 C (0.000 0.035 0.965) *
- For this forest of trees, predict these three observations
x1 x2 x3 x4
1 1.53 1.96 1.36 -0.346
2 1.12 0.741 1.47 0.926
3 0.0899 -0.139 -0.0951 1.87
on each tree, and then using majority rule, their final prediction.
- For the three observations, which was the model most uncertain about? Explain your reasoning. (Note that, these three observations were out-of-bag for each of the tree models.)
2. Tuning a model (12pts)
This question uses the same data as Assignment 2 Q3 and the same training/test split.
Reminder: The data is in the file finance_and_birds.csv
. This contains records for 974 time series, containing the four variables, trend
, linearity
, entropy
, x_acf1
, computed using features of the series, and the class variable type
. You can read more about the way features are calculated at http://pkg.robjhyndman.com/tsfeatures/.
- Fit a default tree to the training data, using the
rpart
package. Report and summarise the tree fit, and summarise the fit using the test set. (Be sure to use thetidymodels
style of coding.) - Using the capabilities in
tidymodels
tune the tree on the parameters,tree_depth
,min_n
,cost_complexity
. Include your code, summarise the results, and the parameters that will lead to the best model. - Fit this best model to the training data. Assess and summarise the fit like done in part a. Write a sentence or two on how this improved the model or not.
3. Which is the better classifier? (15pts)
Using the same data as Assignment 2 Q3 and the same training/test split, build a random forest fit.
- Fit, summarise and assess a boosted tree model.
- Fit, summarise and assess a random forest model.
- Make an ROC curve to help decide which of the five models fitted is the better model.
- Write a short paragraph describing your choice of best model, and what you have learned about how the time series for financial data and birdsongs typically differ.
⚖️ Marking guide
- Total: 36pts (scaled back to 9pts)
- Answers should be written in complete sentences, when explanations are required.
- Correct answers will score full points. Partial credit will be given where possible.
- Readability is important, and up to 4 points will be deducted for spelling errors and poor organisation.
- Deductions apply for lack of reproducibility, lack of citations, lack of supporting material.