---
title: "Lab 9"
subtitle: "Monash University, Econ & Bus Stat, ETC3250/5250"
author: "prepared by Professor Di Cook"
date: "Materials for Week 9"
output:
html_document: default
editor_options:
chunk_output_type: console
---
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
echo = FALSE,
message = FALSE,
warning = FALSE,
error = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 6,
fig.align = "center",
cache = FALSE
)
```
# Objective
The objectives for this week are:
- Begin thinking about high dimension, low sample size issues
- Learn about fitting a neural network model
- Practice fitting a deep learning model
# Class discussion
Chapter 6, exercise 1.
# Theory (this is a refresher from ETC2420/ETC5242)
Chapter 6, exercise 7a. Take a look at the solution at http://www.jeffreylimbacher.me/ISLR-exercises/chap6.html
a. What's a density function, and what is it used for?
b. What's a likelihood function (assuming a sample of data), and what is it used for?
c. If $\epsilon_i \sim N(0, \sigma^2)$ write down the density function, from the model $y_i = \beta_0 + \sum_{j=1}^{p} x_{ij}\beta_j + \epsilon_i$.
d. Given c, what distribution does $y_i|x_i$ have? Write it down.
e. Using your density function for $y_i$ write down the likelihood function.
f. What values for $\beta_j$ maximize this likelihood?
g. Make a simple example: set $p=1, \beta_0, \beta_2, \sigma=1$, as set in the code below, and the sample using
```{r echo=TRUE, eval=FALSE}
library(tidyverse)
set.seed(20200515)
n <- 1000
b <- c(-2,-1)
x <- rnorm(n)
y <- b[1]+b[2]*x+rnorm(n)
df <- data.frame(x,y)
ggplot(df, aes(x, y)) + geom_point()
```
What would you expect the values for $\beta_0, \beta_1$ be (or close to) from maximum likelihood?
What are the least squares estimates for these quantities?
```{r eval=FALSE}
LL <- function(b) {
b0 <- rep(b[1], nrow(df))
b1 <- rep(b[2], nrow(df))
#d <- exp(-0.5*sum((df$y - (b0+b1*df$x))^2))
d <- dnorm(df$y, mean=b0+b1*df$x)
sum(log(d))
}
grid <- expand.grid(b0=seq(-3, 3, 0.1), b1=seq(-3, 3, 0.1))
grid <- grid %>% as_tibble() %>%
rowwise() %>%
mutate(l = LL(b=c(b0, b1)))
library(viridis)
ggplot(data=grid) +
geom_tile(aes(x=b0, y=b1, fill=l)) +
scale_fill_viridis_c("") + theme(aspect.ratio=1)
grid[which.max(grid$l),]
#par <- c(0,1)
#optim(par, LL)
```
Change the values for $\beta_0, \beta_1$ (within the range of -3, 3 only) and the sample size, and examine the resulting maxima.
# Exercises
This exercise is investigating neural network model fitting. A neural network model was fitted to the `wiggle.csv` using the `nnet` package in R, from many random starts, and using 2-4 nodes in the hidden layer. The best model is in the data object `nnet_best.rda`, and the full set of model fits are stored in `nnet_many.rda`. We are going investigate the best fit, and the complete set of fits. The data set is actually a test set, it was not used to fit the model, but it was simulated from the same process as the training set, which we don't have.
## 1.
In this problem we will replicate what was done in Section 6, of https://vita.had.co.nz/papers/model-vis.pdf. Please read this first! It documents a struggle with getting the neural network to fit to a reasonably simple problem.
a. Read in the data, and make an appropriate plot, that communicates the relationship between the two variables and the two groups.
```{r eval=FALSE}
library(tidyverse)
w <- read_csv(???)
ggplot(w, aes(x=???, y=???, colour=???, shape=???)) +
geom_point() +
scale_color_brewer("", palette="Dark2") +
scale_shape("") +
theme(aspect.ratio=1)
```
b. Read in the best model. Take a look at the object. There are three components: `hidden`, `output` and `nnet`. The best model uses $s=4$. The `nnet` component has the estimated model coefficients, fitted values and residuals. The `hidden` component has information related to the models used in the 4 nodes of the hidden layer, and the `output` has the same information for the second layer. These latter two contain a grid of values for the predictors, $x$, $y$ and the predicted values for each grid point.
(i) Plot the grid of predicted values for the second layer, using node 1. Overlay the data. How well has the model captured the class structure?
```{r eval=FALSE}
load("data/nnet_many.rda")
load("data/nnet_best.rda")
ggplot(???, aes(???, ???)) +
geom_raster(aes(fill = pred)) +
geom_???(aes(shape = class), data = w) +
scale_fill_gradient2(low="#1B9E77", high="#D95F02", mid = "white", midpoint = 0.5) +
theme(aspect.ratio=1)
```
(ii) Plot the grid of predicted values for each node in the hidden layer, with the data overlaid. Explain how the models at each node would combine to make the final model predictions, which we have already seen are extremely good.
```{r eval=FALSE}
ggplot(???, aes(???, ???)) +
geom_raster(aes(fill = ???)) +
geom_point(aes(shape = class), data = w) +
scale_fill_gradient2(low="#1B9E77", high="#D95F02", mid = "white", midpoint = 0.5) +
facet_grid(. ~ ???) +
theme(aspect.ratio=1)
```
(iii) How many parameters are there in this model? Check that your answer matches the number of values in the `wgts` element of the `nnet` component.
(iv) Write down the equation corresponding to the model at first node of the hidden layer. You need to look at the `wgts` element of the `nnet` component. There are 6 sets of linear model coefficients.
(v) OPTIONAL ADVANCED: See if you can compute the combination of the prediction on each hidden node, to get final prediction.
c. Read in the complete set of models fitted. There were 600 models fitted, 200 random starts for each $s = 2, 3, 4$. The `nnet` function has its own measure of the goodness of fit, which is used to determine when to stop minimising RSS, which is called `value` in this data. (You can think of this like it is training error.) Plot the predictive accuracy against function's returned value of model fit. Explain how the change in $s$ affects the predictive accuracy.
```{r eval=FALSE}
qual <- many %>% dplyr::select(value, accuracy, nodes, id) %>%
distinct()
ggplot(data = ???, aes(???, ???)) +
geom_point() +
xlab("Predictive accuracy") + ylab("Value of fitting criterion") +
facet_wrap(. ~ ???)
```
## 2.
a. Based on the lecture notes, explain whether the keras model can or can't be applied to the paintings data from the past labs.
b. Fit the keras model to the handwritten digit data, following the code in the lecture notes. Read section 13.7 of https://bradleyboehmke.github.io/HOML/deep-learning.html#why-dl. Change your fitting code to use:
(i) change the number of nodes in each hidden layer, like Table 13.1 small and medium
(ii) 2 hidden layer only (as well as the 1 layer from notes)
(iii) add a batch normalisation step
(iv) add a regularisation component to each layer
(v) try fitting the model with just 100 images, instead of 60000.
Examine the validation error vs the training error, like the plots in the chapter to compare the results for each of the changes in model structure.