---
title: "ETC3250/5250 Assignment 1"
date: "DUE: Friday Apr 10, 5pm"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE,
warning = FALSE,
message = FALSE,
eval = FALSE)
```
## Instructions
- You need to turn into moodle, the two files, your `Rmd` and `html` files.
- **List your team members as authors** on the report. For each of the four assignments, one team member needs to be nominated as the leader, and is responsible for coordinating the efforts of other team members, and submitting the assignment.
- The expected mode of working is that **you individually complete the assignment**, and then compare your answers and explanations with your team mates. Each student will have the opportunity to report on other team member's efforts on the assignment, and also report on their own lack of contribution. If a member does not substantially contribute to the team submission they may get a reduced mark. If a team member fails to report their lack of contribution they will receive a zero mark.
- R code should be hidden in the final report, unless it is specifically requested.
- Original work is expected. Any material used from external sources needs to be acknowledged and cited appropriately.
- To make it a little easier for you, a skeleton of R code is provided in the `Rmd` file. Where you see `???` means that something is missing and you will need to fill it in with the appropriate function, argument or operator. You will also need to rearrange the code as necessary to do the calculations needed.
## Marks
- Total mark will be out or 25
- 5 points will be reserved for readability, and appropriate citing of external sources
- 5 points will be reserved for reproducibility, that the report can be re-generated from the submitted Rmarkdown.
- Accuracy and completeness of answers, and clarity of explanations will be the basis for the remaining 15 points.
```{r}
# Load libraries
library(caret)
library(broom)
library(tidyverse)
```
## Exercises
1. This question explores bias-variance trade-off. Read in the simulated data `cuddly_koalas.rds`. This data is generated using the following function:
$$ y = -4x + 6x^2 - 100sin(x) + \varepsilon, ~~\text{where}~~x\in [-10, 20], ~~\varepsilon\sim N(0, 50^2)$$
a. (1)Make a plot of the data, overlaying the true model.
```{r}
# Read data
df <- readRDS(???)
# Compute the true model values
df <- df %>% mutate(true=???)
# Plot data and true model
ggplot(df, aes(x=x, y=y)) + geom_???() +
geom_???(aes(y=???), colour="blue")
```
b. (1)Break the data into a $2/3$ training and a $1/3$ test set. (Hint: You can use the function `createDataPartition` from the `caret` package.) Fit a linear model, using the training set. Compute the training MSE and test MSE. Overlay the linear model fit on a plot of the data and true model.
```{r}
# Create training and test sets
set.seed(20200318)
tr_indx <- createDataPartition(df$y, p=???)$Resample1
tr <- df[???,]
ts <- df[-tr_indx,]
# Fit linear model
fit1 <- lm(???, data=tr)
tr_aug <- augment(???, tr)
ts_aug <- augment(???, newdata=ts)
ts_aug$.resid <- ts_aug$y - ???
tr_mse <- sum(???^2)/length(???)
ts_mse <- sum(???^2)/length(???)
# Plot the data, true model and fitted model
ggplot(???, aes(x=x, y=y)) + geom_???() +
geom_???(aes(y=???)) + geom_???(data=???, aes(x=x, y=.fitted), colour="orange")
```
c. Now examine the behaviour of the training and test MSE, for a `loess` fit.
i. (1)Look up the `loess` model fit, and write a paragraph explaining how this fitting procedure works. In particular, explain what the `span` argument does. Add a (hand) sketch illustrating the method.
ii. (1)Compute the training and test MSE for a range of `span` values, 2, 1, 0.5, 0.3, 0.2, 0.1, 0.05. Plot the training and test MSE against the span parameter. For each model, also make a plot of the data and fitted model. Include just the plot of the fit of the model that you think best captures the relationship between x and y.)
```{r}
span <- c(2, 1, ???)
tr_mse2 <- NULL
ts_mse2 <- NULL
# Fit a loess model and compute MSEs
for (i in 1:length(???)) {
fit2 <- loess(???, data=tr, span=span[i])
tr_aug2 <- augment(???, tr)
ts_aug2 <- augment(???, newdata=ts)
ts_aug2$.resid <- ts_aug2$y - ???
trm <- sum(???^2)/length(???)
tsm <- sum(???^2, na.rm=TRUE)/
length(???)
tr_mse2 <- c(tr_mse2, ???)
ts_mse2 <- c(ts_mse2, ???)
}
mse_df <- tibble(span, `train MSE`=???, `test MSE`=???)
mse_df <- mse_df %>%
pivot_longer(cols = ???, names_to = "type", values_to="mse")
ggplot(mse_df, aes(x=???, y=???, colour=???)) +
geom_???() +
geom_???() +
scale_x_reverse() +
ylab("MSE") +
scale_colour_brewer("", palette="Dark2")
```
iii. (2)Write a paragraph explaining the effect of increasing the flexibility of the fit has on the training and test MSE. Indicate what you think is the optimal span value for this data. Make a plot of this optimal fit.
```{r}
fit_all <- loess(???, data=df, span=???)
df_all <- augment(fit_all, ???)
ggplot(df, aes(x=???, y=???)) + geom_???() +
geom_???(data=df_all, aes(x=???, y=???), colour="orange")
```
d. (2)Make a sketch indicating observed data, the true model, fitted model, and indicate what the bias, variance and MSE refer to. Remember that to understand bias and variance, you need to think about taking multiple (and actually all possible) samples. Your illustration would have predictor ($x$) on the horizontal axis and response on the vertical axis. Represent and observed value with a dot, and use curves for fitted models and the true model.
2. The current COVID-19 health crisis worries us all. John Hopkins University has been carefully documenting incidence, recoveries and deaths around the globe at https://github.com/CSSEGISandData/COVID-19. Read the incidence data from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv, into R.
a. (2)The data shows cumulative counts by date for many countries. Extract the data for Australia. It is currently multiple rows corresponding to counts in different states. Pivot the data into long tidy form, and convert the text date into a date variable. Difference the days, so that you have the incidence for each day. Make a bar chart of incidence by date. Add a loess smooth to the plot.
```{r}
library(tidyverse)
library(lubridate)
library(broom)
library(tsibble)
covid_jh <- ???("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv")
covid_jh_oz <- covid_jh %>%
filter(`Country/Region` == ???) %>%
pivot_longer(cols=???, names_to = "date") %>%
mutate(date = ???(date)) %>%
group_by(???) %>%
summarise(count = ???) %>%
mutate(dif = ???(count))
covid_jh_oz %>%
ggplot(aes(x=???, y=???)) +
geom_???() +
geom_???(se=FALSE) +
ylab("count") + xlab("")
```
b. (3)Fit an appropriate linear model, using `glm` to the data. (Hint: ) Make a summary of the model fit, write down the model equation and a plot of the data with the model overlaid. Compute the ratio of the deviance relative to the null deviance. What does this say about the model fit? Is it a good summary of the variation in counts?
```{r}
covid_jh_oz_lm <- glm(???, data=covid_jh_oz, family=???)
covid_jh_oz_lm
covid_jh_oz <- augment(covid_jh_oz_lm, ???) %>%
mutate(.fitted = ???)
covid_jh_oz %>%
ggplot(aes(x=???, y=???)) +
geom_???() +
geom_???(se=FALSE) +
geom_line(aes(y=???), color="red") +
xlab("") + ylab("daily count")
```
d. (1)Would the `glm` model be considered a flexible or inflexible model?
e. (1)Use your model to predict the count for Apr 6.
```{r}
predict(covid_jh_oz_lm, newdata=???, type=???)
```