Instructions

Marks

Exercises

  1. This question explores bias-variance trade-off. Read in the simulated data cuddly_koalas.rds. This data is generated using the following function:

\[ y = -4x + 6x^2 - 100sin(x) + \varepsilon, ~~\text{where}~~x\in [-10, 20], ~~\varepsilon\sim N(0, 50^2)\]

  1. (1)Make a plot of the data, overlaying the true model.

  2. (1)Break the data into a \(2/3\) training and a \(1/3\) test set. (Hint: You can use the function createDataPartition from the caret package.) Fit a linear model, using the training set. Compute the training MSE and test MSE. Overlay the linear model fit on a plot of the data and true model.

  3. Now examine the behaviour of the training and test MSE, for a loess fit.

    1. (1)Look up the loess model fit, and write a paragraph explaining how this fitting procedure works. In particular, explain what the span argument does. Add a (hand) sketch illustrating the method.

    2. (1)Compute the training and test MSE for a range of span values, 2, 1, 0.5, 0.3, 0.2, 0.1, 0.05. Plot the training and test MSE against the span parameter. For each model, also make a plot of the data and fitted model. Include just the plot of the fit of the model that you think best captures the relationship between x and y.)

    3. (2)Write a paragraph explaining the effect of increasing the flexibility of the fit has on the training and test MSE. Indicate what you think is the optimal span value for this data. Make a plot of this optimal fit.

  4. (2)Make a sketch indicating observed data, the true model, fitted model, and indicate what the bias, variance and MSE refer to. Remember that to understand bias and variance, you need to think about taking multiple (and actually all possible) samples. Your illustration would have predictor (\(x\)) on the horizontal axis and response on the vertical axis. Represent and observed value with a dot, and use curves for fitted models and the true model.

  1. The current COVID-19 health crisis worries us all. John Hopkins University has been carefully documenting incidence, recoveries and deaths around the globe at https://github.com/CSSEGISandData/COVID-19. Read the incidence data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv, into R.
  1. (2)The data shows cumulative counts by date for many countries. Extract the data for Australia. It is currently multiple rows corresponding to counts in different states. Pivot the data into long tidy form, and convert the text date into a date variable. Difference the days, so that you have the incidence for each day. Make a bar chart of incidence by date. Add a loess smooth to the plot.

  2. (3)Fit an appropriate linear model, using glm to the data. (Hint: ) Make a summary of the model fit, write down the model equation and a plot of the data with the model overlaid. Compute the ratio of the deviance relative to the null deviance. What does this say about the model fit? Is it a good summary of the variation in counts?

  3. (1)Would the glm model be considered a flexible or inflexible model?

  4. (1)Use your model to predict the count for Apr 6.