## Instructions

• You need to turn into moodle, the two files, your Rmd and html files.
• List your team members as authors on the report. For each of the four assignments, one team member needs to be nominated as the leader, and is responsible for coordinating the efforts of other team members, and submitting the assignment.
• The expected mode of working is that you individually complete the assignment, and then compare your answers and explanations with your team mates. Each student will have the opportunity to report on other team member’s efforts on the assignment, and also report on their own lack of contribution. If a member does not substantially contribute to the team submission they may get a reduced mark. If a team member fails to report their lack of contribution they will receive a zero mark.
• R code should be hidden in the final report, unless it is specifically requested.
• Original work is expected. Any material used from external sources needs to be acknowledged and cited appropriately.
• To make it a little easier for you, a skeleton of R code is provided in the Rmd file. Where you see ??? means that something is missing and you will need to fill it in with the appropriate function, argument or operator. You will also need to rearrange the code as necessary to do the calculations needed.

## Marks

• Total mark will be out or 25
• 5 points will be reserved for readability, and appropriate citing of external sources
• 5 points will be reserved for reproducibility, that the report can be re-generated from the submitted Rmarkdown.
• Accuracy and completeness of answers, and clarity of explanations will be the basis for the remaining 15 points.

## Exercises

1. This question explores bias-variance trade-off. Read in the simulated data cuddly_koalas.rds. This data is generated using the following function:

$y = -4x + 6x^2 - 100sin(x) + \varepsilon, ~~\text{where}~~x\in [-10, 20], ~~\varepsilon\sim N(0, 50^2)$

1. (1)Make a plot of the data, overlaying the true model.

2. (1)Break the data into a $$2/3$$ training and a $$1/3$$ test set. (Hint: You can use the function createDataPartition from the caret package.) Fit a linear model, using the training set. Compute the training MSE and test MSE. Overlay the linear model fit on a plot of the data and true model.

3. Now examine the behaviour of the training and test MSE, for a loess fit.

1. (1)Look up the loess model fit, and write a paragraph explaining how this fitting procedure works. In particular, explain what the span argument does. Add a (hand) sketch illustrating the method.

2. (1)Compute the training and test MSE for a range of span values, 2, 1, 0.5, 0.3, 0.2, 0.1, 0.05. Plot the training and test MSE against the span parameter. For each model, also make a plot of the data and fitted model. Include just the plot of the fit of the model that you think best captures the relationship between x and y.)

3. (2)Write a paragraph explaining the effect of increasing the flexibility of the fit has on the training and test MSE. Indicate what you think is the optimal span value for this data. Make a plot of this optimal fit.

4. (2)Make a sketch indicating observed data, the true model, fitted model, and indicate what the bias, variance and MSE refer to. Remember that to understand bias and variance, you need to think about taking multiple (and actually all possible) samples. Your illustration would have predictor ($$x$$) on the horizontal axis and response on the vertical axis. Represent and observed value with a dot, and use curves for fitted models and the true model.

1. The current COVID-19 health crisis worries us all. John Hopkins University has been carefully documenting incidence, recoveries and deaths around the globe at https://github.com/CSSEGISandData/COVID-19. Read the incidence data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv, into R.
1. (2)The data shows cumulative counts by date for many countries. Extract the data for Australia. It is currently multiple rows corresponding to counts in different states. Pivot the data into long tidy form, and convert the text date into a date variable. Difference the days, so that you have the incidence for each day. Make a bar chart of incidence by date. Add a loess smooth to the plot.

2. (3)Fit an appropriate linear model, using glm to the data. (Hint: ) Make a summary of the model fit, write down the model equation and a plot of the data with the model overlaid. Compute the ratio of the deviance relative to the null deviance. What does this say about the model fit? Is it a good summary of the variation in counts?

3. (1)Would the glm model be considered a flexible or inflexible model?

4. (1)Use your model to predict the count for Apr 6.