Professor Di Cook

Econometrics and Business Statistics

Monash University" date: "Week 1 (b)" output: xaringan::moon_reader: css: ["kunoichi", "ninjutsu", "mystyle.css"] lib_dir: libs nature: ratio: '16:9' highlightStyle: github highlightLines: true countIncrementalSlides: false --- ```{r setup, include=FALSE} library(knitr) knitr::opts_chunk$set(tidy = FALSE, message = FALSE, warning = FALSE, echo = FALSE, fig.retina = 4) options(htmltools.dir.version = FALSE) library(magick) ``` ## Learning from Data - .orange[Better understand] or .orange[make predictions] about a certain phenomenon under study

- .orange[Construct a model] of that phenomenon by finding relations between several variables

- If phenomenon is complex or depends on a large number of variables, an .orange[analytical solution] might not be available

- However, we can .orange[collect data] and learn a model that .orange[approximates] the true underlying phenomenon --- # Learning from Data ```{r fig.width=10, fig.height=4, fig.align='center'} library(tidyverse) library(gapminder) library(gridExtra) p1 <- gapminder %>% filter(country == "Australia") %>% ggplot(aes(x=year, y=lifeExp)) + geom_point() + geom_smooth() + xlab("predictor") + ylab("response") + ggtitle("Regression") + theme(aspect.ratio=1) flea <- read_csv("http://www.ggobi.org/book/data/flea.csv") p2 <- ggplot(flea, aes(x=tars1, y=aede1, colour = species)) + geom_point() + scale_colour_brewer(palette = "Dark2") + xlab("Var 1") + ylab("Var 2") + ggtitle("Classification") + theme(aspect.ratio=1, legend.position="None") p3 <- ggplot(flea, aes(x=tars1, y=aede1)) + geom_point() + xlab("Var 1") + ylab("Var 2") + ggtitle("Clustering") + theme(aspect.ratio=1) grid.arrange(p1, p2, p3, ncol=3) ``` .tip[**Statistical learning** provides a framework for constructing models from the data.] --- ## Different Learning Problems - .green[Supervised] learning, $y_i$ .orange[available] for all $x_i$ - Regression (or prediction) - Classification - .green[Unsupervised] learning, $y_i$ .orange[unavailable] for all $x_i$ - .green[Semi-supervised] learning, $y_i$ available only for few $x_i$ - Other types of learning: reinforcement learning, online learning, active learning, etc. .tip[Being able to .green[**identify**] which is the type of learning problem you have is important in practice] --- ## Supervised learning $\mathcal{D} = \{(y_i, x_i)\}_{i = 1}^N$ where $(y_i, x_i) \sim P(Y, X) = P(X) \underbrace{P(Y|X)}_{}$ where $P(Y, X)$ means that these arise from some probability distribution. $``\sim"$ means distributed as, arise from. Typically, we only are interested in $P(Y|X)$, the distribution of $Y$ conditional on $X$. --- ## Supervised learning - $Y = (Y_1, \dots, Y_q)$: response (output) (could be multivariate, $q=1$ for us) - $X = (X_1, \dots, X_p)$: set of $p$ predictors (input) We seek a function $h(X)$ for predicting $Y$ given values of the input $X$. This function is computed using $\mathcal{D}$. --- ## Supervised learning $\mathcal{D} = \{(y_i, x_i)\}_{i = 1}^N \mbox{ where } (y_i, x_i) \sim P(Y, X)$ We are interested in minimizing the expected .orange[out-of-sample] prediction error: $\mbox{Err}_{\mbox{out}}(h) = E[L(Y, h(X))]$ where $L(y, {\hat{y}})$ is a non-negative real-valued .orange[loss function], such as $L(y, \hat{y}) = (y - \hat{y})^2$ and $L(y, \hat{y}) = I(y \neq \hat{y})$. .tip[The goal is that the predictions from the model are accurate for future samples.] --- # Regression vs Classification Problems

- **Prediction**: - $\hat{y}_{*} = \hat f(x_{*})$ for a new observation $x_{*}$ - **Inference (or explanation)**: - Which predictors are associated with the response? - What is the relationship between the response and each predictor? --- ## Estimation

`r set.seed(1000); emo::ji("math")` Assumption about the form of $f$, e.g. linear

`r set.seed(1000); emo::ji("smile")` The problem of estimating $f$ reduces to estimating a set of parameters

`r set.seed(1000); emo::ji("smile")` Usually a good starting point for many learning problems

`r set.seed(1000); emo::ji("frown")` Poor performance if model assumption (such as linearity) is wrong ]] .column[.pad10px[ .orange[.center[Non-parametric methods]]

`r set.seed(1000); emo::ji("smile")` No *explicit* assumptions about the form of $f$, e.g. nearest neighbours: $\hat Y(x) = \frac1k \sum_{x_i \in N_k(x)} y_i$

`r set.seed(1000); emo::ji("smile")` High flexibility: it can potentially fit a range of shapes

`r set.seed(1000); emo::ji("frown")` A large number of observations is required to estimate $f$ with good accuracy ]] ]]] --- ## Measures of accuracy Suppose we have a regression model $y=f(x)+\varepsilon$. .orange[Estimate] $\hat{f}$ from some .orange[training data], $Tr=\{x_i,y_i\}_{i=1}^n$. One common measure of accuracy is: .orange[Training Mean Squared Error] $MSE_{Tr} = \mathop{\mbox{Ave}}\limits_{i\in Tr}[y_i-\hat{f}(x_i)]^2 = \frac{1}{n}\sum_{i=1}^n [(y_i-\hat{f}(x_i)]^2$ --- ## Measures of accuracy Suppose we have a regression model $y=f(x)+\varepsilon$. .orange[Estimate] $\hat{f}$ from some .orange[training data], $Tr=\{x_i,y_i\}_{i=1}^n$. Measure .orange[real accuracy] using .orange[test data] $Te=\{x_j,y_j\}_{j=1}^m$, .orange[Test Mean Squared Error] $MSE_{Te} = \mathop{\mbox{Ave}}\limits_{j\in Te}[y_j-\hat{f}(x_j)]^2 = \frac{1}{m}\sum_{j=1}^m [(y_j-\hat{f}(x_j)]^2$ --- ## Training vs Test MSEs

- In general, the more .orange[flexible] a method is, the .orange[lower] its .orange[training MSE] will be. i.e. it will “fit” the training data very well. - However, the .orange[test MSE] may be .orange[higher] for a more .orange[flexible] method than for a simple approach like linear regression. - Flexibility also makes interpretation more difficult. There is a trade-off between .orange[flexibility] and .orange[model interpretability]. --- class: split-two .column[.pad50px[ ## Interpretability vs Flexibility Simplistic overview of methods on the flexibility vs interpretability scale. Interpretability is when it is clear how the explanatory variable is related to the response, e.g. linear model. .orange[Poor interpretability] is often called a .orange["black box"] method.

.font_tiny[(Chapter2/2.7.pdf)] ]] .column[.content.vmiddle.center[ ]] --- class: split-70 .row.bg-main5[.content.vmiddle.center[ ]] .row[.content[ .split-two[ .column[.content.vmiddle.center[ .black[LEFT:] .orange[Linear regression]

.green[Smoothing splines]

.black[True curve] ]] .column[.content.vmiddle.center[ .black[RIGHT:] .gray[Training MSE], .red[Test MSE], .black[Dashed: Minimum test MSE] .font_tiny[(Chapter2/2.9.pdf)]] ]] ]]] --- class: split-70 .row.bg-main5[.content.vmiddle.center[ ]] .row[.content[ .split-two[ .column[.content.vmiddle.center[ .orange[Linear regression]

.green[Smoothing] splines

.black[True curve] ]] .column[.content.vmiddle.center[ .gray[Training MSE], .red[Test MSE]

.black[Dashed: Minimum test MSE] .font_tiny[(Chapter2/2.9.pdf)]] ]] ]]] --- class: split-70 .row.bg-main5[.content.vmiddle.center[ ]] .row[.content[ .split-two[ .column[.content.vmiddle.center[ .orange[Linear regression]

.green[Smoothing] splines

.black[True curve] ]] .column[.content.vmiddle.center[ .gray[Training MSE], .red[Test MSE]

.black[Dashed: Minimum test MSE] .font_tiny[(Chapter2/2.9.pdf)]] ]] ]]] --- ## Bias - variance tradeoff

.font_tiny[Source: Statistical Statistics Memes] --- ### ... as well as high bias (under fitted) models!

.font_tiny[Source: Statistical Statistics Memes] --- ## The Bias Variance Tradeoff As you may have guessed, there is a trade off between increasing variance (flexibility) and decreasing bias (simplicity) and vice versa.

.font_tiny[Source: Statistical Statistics Memes] --- ## KNN

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.