--- title: "ETC3250/5250: Supervised Learning" subtitle: "Semester 1, 2020" author: "
Professor Di Cook

Monash University" date: "Week 1 (b)" output: xaringan::moon_reader: css: ["kunoichi", "ninjutsu", "mystyle.css"] lib_dir: libs nature: ratio: '16:9' highlightStyle: github highlightLines: true countIncrementalSlides: false --- {r setup, include=FALSE} library(knitr) knitr::opts_chunkset(tidy = FALSE, message = FALSE, warning = FALSE, echo = FALSE, fig.retina = 4) options(htmltools.dir.version = FALSE) library(magick)  ## Learning from Data - .orange[Better understand] or .orange[make predictions] about a certain phenomenon under study - .orange[Construct a model] of that phenomenon by finding relations between several variables - If phenomenon is complex or depends on a large number of variables, an .orange[analytical solution] might not be available - However, we can .orange[collect data] and learn a model that .orange[approximates] the true underlying phenomenon --- # Learning from Data {r fig.width=10, fig.height=4, fig.align='center'} library(tidyverse) library(gapminder) library(gridExtra) p1 <- gapminder %>% filter(country == "Australia") %>% ggplot(aes(x=year, y=lifeExp)) + geom_point() + geom_smooth() + xlab("predictor") + ylab("response") + ggtitle("Regression") + theme(aspect.ratio=1) flea <- read_csv("http://www.ggobi.org/book/data/flea.csv") p2 <- ggplot(flea, aes(x=tars1, y=aede1, colour = species)) + geom_point() + scale_colour_brewer(palette = "Dark2") + xlab("Var 1") + ylab("Var 2") + ggtitle("Classification") + theme(aspect.ratio=1, legend.position="None") p3 <- ggplot(flea, aes(x=tars1, y=aede1)) + geom_point() + xlab("Var 1") + ylab("Var 2") + ggtitle("Clustering") + theme(aspect.ratio=1) grid.arrange(p1, p2, p3, ncol=3)  .tip[**Statistical learning** provides a framework for constructing models from the data.] --- ## Different Learning Problems - .green[Supervised] learning,y_i$.orange[available] for all$x_i$- Regression (or prediction) - Classification - .green[Unsupervised] learning,$y_i$.orange[unavailable] for all$x_i$- .green[Semi-supervised] learning,$y_i$available only for few$x_i$- Other types of learning: reinforcement learning, online learning, active learning, etc. .tip[Being able to .green[**identify**] which is the type of learning problem you have is important in practice] --- ## Supervised learning$\mathcal{D} = \{(y_i, x_i)\}_{i = 1}^N$where$(y_i, x_i) \sim P(Y, X) = P(X) \underbrace{P(Y|X)}_{}$where$P(Y, X)$means that these arise from some probability distribution.$\sim"$means distributed as, arise from. Typically, we only are interested in$P(Y|X)$, the distribution of$Y$conditional on$X$. --- ## Supervised learning -$Y = (Y_1, \dots, Y_q)$: response (output) (could be multivariate,$q=1$for us) -$X = (X_1, \dots, X_p)$: set of$p$predictors (input) We seek a function$h(X)$for predicting$Y$given values of the input$X$. This function is computed using$\mathcal{D}$. --- ## Supervised learning$\mathcal{D} = \{(y_i, x_i)\}_{i = 1}^N \mbox{ where } (y_i, x_i) \sim P(Y, X)$We are interested in minimizing the expected .orange[out-of-sample] prediction error:$\mbox{Err}_{\mbox{out}}(h) = E[L(Y, h(X))]$where$L(y, {\hat{y}})$is a non-negative real-valued .orange[loss function], such as$L(y, \hat{y}) = (y - \hat{y})^2$and$L(y, \hat{y}) = I(y \neq \hat{y})$. .tip[The goal is that the predictions from the model are accurate for future samples.] --- # Regression vs Classification Problems .center[.font_tiny[Source: [Taylor Fogarty](https://towardsdatascience.com/regression-or-classification-linear-or-logistic-f093e8757b9c) ]] --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/Rainbow_after_storm.jpg/1599px-Rainbow_after_storm.jpg) class: middle center # Regression --- ## Regression We often assume that our data arose from a statistical model$Y=f(X) + \varepsilon,$where$f$is the true unknown function,$\varepsilon$is the random error term with$E[\varepsilon] = 0$and is independent of$X$. - The additive error model is a useful approximation to the truth -$f(x) = E[Y|X = x]$- Not a deterministic relationship:$Y \neq f(X)$--- .light-blue[Blue curve] is$f(x)$, the true functional relationship. .font_tiny[(Chapter2/2.2.pdf)] --- .light-blue[Blue surface] is$f(x)$, the true functional relationship. .font_tiny[(Chapter2/2.3.pdf)] --- ## Goals of a regression analysis - **Prediction**: -$\hat{y}_{*} = \hat f(x_{*})$for a new observation$x_{*}$- **Inference (or explanation)**: - Which predictors are associated with the response? - What is the relationship between the response and each predictor? --- ## Estimation  Linear model:$\hat f(\mbox{education}, \mbox{seniority}) =~~~~~~~~\hat \beta_0 + \hat \beta_1 \times \mbox{education} + \hat \beta_2 \times \mbox{seniority}$.green[Why would we ever choose to use a **more restrictive method** instead of a **very flexible approach**?] .font_tiny[(Chapter2/2.3.pdf, 2.4.pdf)] --- class: split-20 .row[.content.vmiddle[ ## Methods {r} library(emo)  ]] .row[.content[ .split-two[ .column[.pad10px[ .orange[.center[Parametric methods]] r set.seed(1000); emo::ji("math") Assumption about the form of$f$, e.g. linear r set.seed(1000); emo::ji("smile") The problem of estimating$f$reduces to estimating a set of parameters r set.seed(1000); emo::ji("smile") Usually a good starting point for many learning problems r set.seed(1000); emo::ji("frown") Poor performance if model assumption (such as linearity) is wrong ]] .column[.pad10px[ .orange[.center[Non-parametric methods]] r set.seed(1000); emo::ji("smile") No *explicit* assumptions about the form of$f$, e.g. nearest neighbours:$\hat Y(x) = \frac1k \sum_{x_i \in N_k(x)} y_i$r set.seed(1000); emo::ji("smile") High flexibility: it can potentially fit a range of shapes r set.seed(1000); emo::ji("frown") A large number of observations is required to estimate$f$with good accuracy ]] ]]] --- ## Measures of accuracy Suppose we have a regression model$y=f(x)+\varepsilon$. .orange[Estimate]$\hat{f}$from some .orange[training data],$Tr=\{x_i,y_i\}_{i=1}^n$. One common measure of accuracy is: .orange[Training Mean Squared Error]$MSE_{Tr} = \mathop{\mbox{Ave}}\limits_{i\in Tr}[y_i-\hat{f}(x_i)]^2 = \frac{1}{n}\sum_{i=1}^n [(y_i-\hat{f}(x_i)]^2$--- ## Measures of accuracy Suppose we have a regression model$y=f(x)+\varepsilon$. .orange[Estimate]$\hat{f}$from some .orange[training data],$Tr=\{x_i,y_i\}_{i=1}^n$. Measure .orange[real accuracy] using .orange[test data]$Te=\{x_j,y_j\}_{j=1}^m$, .orange[Test Mean Squared Error]$MSE_{Te} = \mathop{\mbox{Ave}}\limits_{j\in Te}[y_j-\hat{f}(x_j)]^2 = \frac{1}{m}\sum_{j=1}^m [(y_j-\hat{f}(x_j)]^2$--- ## Training vs Test MSEs - In general, the more .orange[flexible] a method is, the .orange[lower] its .orange[training MSE] will be. i.e. it will “fit” the training data very well. - However, the .orange[test MSE] may be .orange[higher] for a more .orange[flexible] method than for a simple approach like linear regression. - Flexibility also makes interpretation more difficult. There is a trade-off between .orange[flexibility] and .orange[model interpretability]. --- class: split-two .column[.pad50px[ ## Interpretability vs Flexibility Simplistic overview of methods on the flexibility vs interpretability scale. Interpretability is when it is clear how the explanatory variable is related to the response, e.g. linear model. .orange[Poor interpretability] is often called a .orange["black box"] method. .font_tiny[(Chapter2/2.7.pdf)] ]] .column[.content.vmiddle.center[ ]] --- class: split-70 .row.bg-main5[.content.vmiddle.center[ ]] .row[.content[ .split-two[ .column[.content.vmiddle.center[ .black[LEFT:] .orange[Linear regression] .green[Smoothing splines] .black[True curve] ]] .column[.content.vmiddle.center[ .black[RIGHT:] .gray[Training MSE], .red[Test MSE], .black[Dashed: Minimum test MSE] .font_tiny[(Chapter2/2.9.pdf)]] ]] ]]] --- class: split-70 .row.bg-main5[.content.vmiddle.center[ ]] .row[.content[ .split-two[ .column[.content.vmiddle.center[ .orange[Linear regression] .green[Smoothing] splines .black[True curve] ]] .column[.content.vmiddle.center[ .gray[Training MSE], .red[Test MSE] .black[Dashed: Minimum test MSE] .font_tiny[(Chapter2/2.9.pdf)]] ]] ]]] --- class: split-70 .row.bg-main5[.content.vmiddle.center[ ]] .row[.content[ .split-two[ .column[.content.vmiddle.center[ .orange[Linear regression] .green[Smoothing] splines .black[True curve] ]] .column[.content.vmiddle.center[ .gray[Training MSE], .red[Test MSE] .black[Dashed: Minimum test MSE] .font_tiny[(Chapter2/2.9.pdf)]] ]] ]]] --- ## Bias - variance tradeoff .tip[There are two competing forces that govern the choice of learning method: .orange[bias] and .orange[variance].] .orange[Bias] is the error that is introduced by modeling a complicated problem by a simpler problem. - For example, linear regression assumes a linear relationship when few real relationships are exactly linear. - In general, the .orange[more flexible] a method is, the .orange[less bias] it will have. [This site](https://degreesofbelief.roryquinn.com/bias-variance-tradeoff) has a lovely explanation, if you don't like mine. --- ## Bias - variance tradeoff .tip[There are two competing forces that govern the choice of learning method: .orange[bias] and .orange[variance].] .orange[Variance] refers to how much your estimate would change if you had different training data. - In general, the .orange[more flexible] a method is, the .orange[more variance] it has. - The .orange[size] of the training data has an impact on the variance. --- ### High variance (over fitted) models can be problematic... .font_tiny[Source: Statistical Statistics Memes] --- ### ... as well as high bias (under fitted) models! .font_tiny[Source: Statistical Statistics Memes] --- ## The Bias Variance Tradeoff As you may have guessed, there is a trade off between increasing variance (flexibility) and decreasing bias (simplicity) and vice versa. .green[Decomposing the MSE can give us a mathematical intuition as to why this occurs.] --- ## MSE decomposition If$Y = f(x) + \varepsilon$and$f(x)=\mbox{E}[Y\mid X=x]$, then the expected **test** MSE for a new$Y$at$x_0$will be equal to$E[(Y-\hat{f}(x_0))^2] = [\mbox{Bias}(\hat{f}(x_0))]^2 + \mbox{Var}(\hat{f}(x_0)) + \mbox{Var}(\varepsilon)$Test MSE = Bias$^2$+ Variance + Irreducible variance - The expectation averages over the variability of$Y$as well as the variability in the training data. - As the flexibility of$\hat{f}$increases, its variance increases and its bias decreases. - **Choosing the flexibility based on average test MSE amounts to the .orange[bias-variance trade-off]** --- .nopadding[ .blue[Squared bias], .orange[variance], .black[Var(ε) (dashed line)], and .red[test MSE] for the three data sets shown earlier. The vertical dotted line indicates the flexibility level corresponding to the smallest test MSE. .font_tiny[(Chapter2/2.12.pdf)] ] --- ## Optimal Prediction The optimal MSE is obtained when$\hat{f}=f = \mbox{E}[Y\mid X=x].$Then .orange[bias=variance=0] and$\mbox{MSE} = \mbox{irreducible variance}$This is called the .orange[oracle predictor] because it is not achievable in practice. --- background-image: url(https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Cat_and_Dog_Game.jpg/2560px-Cat_and_Dog_Game.jpg) background-size: cover class: middle center # Classification --- ## Classification Here the response variable$Y$is .orange[qualitative]. - e.g., email is one of${\cal C} = (\mbox{spam},\mbox{ham})$- e.g., voters are one of${\cal C} = (\mbox{Liberal},\mbox{Labor},\mbox{Green},\mbox{National},\mbox{Other})$--- ## Classification - goals .green[Our goals are:] 1. Build a classifier$C(x)$that assigns a class label from${\cal C} = \{\mathcal{C}_1, \dots, \mathcal{C}_K\}$to a future unlabeled observation$x$. 2. Such a classifier will divide the input space into regions$\mathcal{R}_k$called decision regions, one for each class, such that all points in$\mathcal{R}_k$are assigned to class$\mathcal{C}_k$3. Assess the uncertainty in each classification (i.e., the probability of misclassification). 4. Understand the roles of the different predictors among$X = (X_1,X_2,\dots,X_p)$. --- ## Classification Recall that we want to minimize the expected prediction error$E_{(Y, X)}[L(Y, C(X))]$where$L(y, {\hat{y}})$is a non-negative real-valued .orange[loss function]. In classification, the output$Y$is a .orange[categorical variable], and our loss function can be represented by a$K \times K$matrix$L$, where$K = \mbox{card}(\mathcal{C})$.$L(k, l)$is the cost of classifying$C_k$as$C_l$. With zero-one loss, i.e.$L(y, \hat{y}) = I(y \neq \hat{y})$the cost is equal for mistakes between any class. --- ## Error rates .orange[Compute]$\hat{C}$from some .orange[training data],$Tr=\{x_i,y_i\}_1^n$. In place of MSE, we now use the error rate (fraction of misclassifications). .green[Training Error Rate]$\text{Error rate}_{Tr} = \frac{1}{n}\sum_{i=1}^n I(y_i \ne \hat{C}(x_i))$Measure .orange[real accuracy] using .orange[test data]$Te=\{x_j,y_j\}_1^m$.green[Test Error Rate]$\text{Error rate}_{Te} = \frac{1}{m}\sum_{j=1}^m I(y_j \ne \hat{C}(x_j))$--- ## Bayes Classifier Let${\cal C} = \{\mathcal{C}_1, \dots, \mathcal{C}_K\}$, and let$p_k(x) = \text{P}(Y = C_k\mid X = x),\qquad k = 1, 2, \dots , K.$These are the .orange[conditional class probabilities] at$x$. Then the .orange[Bayes] classifier at$x$is$C(x) = C_j \quad \mbox{ if } p_j(x) = \max\{p_1(x), p_2(x), \dots, p_K(x)\}$*This gives the minimum average test error rate.* --- ## Bayes Classifier$1-\text{E}\left(\max_j \text{P}(Y = C_j | X)\right)$- The .orange[Bayes error rate] is the lowest possible error rate that could be achieved *if we knew* exactly the .orange[true] probability distribution of the data. - It is analogous to the irreducible error in regression. - On test data, no classifier can get lower error rates than the Bayes error rate. .tip[In reality, the Bayes error rate is not known exactly!!] --- ## Bayes Classifier .font_tiny[(Chapter2/2.13.pdf)] --- ## K Nearest Neighbours (KNN) One of the simplest classifiers. Given a test observation$x_0$, - Find the$K$nearest points to$x_0$in the training data:${\cal N}_0$. - Estimate conditional probabilities$=P(Y = C_j \mid X=x_0) = \frac{1}{K}\sum_{i\in {\cal N}_0} I(y_i = C_j).$- Classify$x_0\$ to class with largest probability. --- ## KNN .font_tiny[Source: Statistical Statistics Memes] --- ## KNN .font_tiny[(Chapter2/2.14.pdf)] --- ## KNN .font_tiny[(Chapter2/2.16.pdf)] --- ## KNN .font_tiny[(Chapter2/2.15.pdf)] --- ## KNN .font_tiny[(Chapter2/2.17.pdf)] --- ## A fundamental picture --- layout: false # r set.seed(2019); emo::ji("technologist") Made by a human with a computer ### Slides at [https://iml.numbat.space](https://iml.numbat.space). ### Code and data at [https://github.com/numbats/iml](https://github.com/numbats/iml). ### Created using [R Markdown](https://rmarkdown.rstudio.com) with flair by [**xaringan**](https://github.com/yihui/xaringan), and [**kunoichi** (female ninja) style](https://github.com/emitanaka/ninja-theme). 