--- title: "ETC3250: Support Vector Machines" subtitle: "Semester 1, 2020" author: "
Professor Di Cook

Monash University" date: "Week 7 (b)" output: xaringan::moon_reader: css: ["kunoichi", "ninjutsu", "mystyle.css", "libs/animate.css"] lib_dir: libs nature: ratio: '16:9' highlightStyle: github highlightLines: true countIncrementalSlides: false editor_options: chunk_output_type: console --- {r setup, include=FALSE} library(knitr) knitr::opts_chunkset(tidy = FALSE, message = FALSE, warning = FALSE, echo = FALSE, fig.width=8, fig.height=6, fig.align = "center", fig.retina = 4) options(htmltools.dir.version = FALSE) library(magick)  ## Separating hyperplanes In ap$-dimensional space, a .orange[hyperplane] is a flat affine subspace of dimension$p - 1$. .font_tiny[(ISLR: Fig 9.1)] --- ## Separating hyperplanes The equation of$p$-dimensional hyperplane is given by $$\beta_0 + \beta_1 X_1 + \dots + \beta_p X_p = 0$$ If$x_i \in \Re^p$and$y_i \in \{-1, 1\}$for$i = 1, \dots, n$, then $$\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} > 0 \mbox{ if } y_i = 1,$$ $$\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} < 0 \mbox{ if } y_i = -1$$ Equivalently, $$y_i (\beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip}) > 0$$ --- ## Separating hyperplanes - A new observation is assigned a class depending on .orange[which side] of the hyperplane it is located - Classify the test observation$x^*$based on the .orange[sign] of $$s(x^*) = \beta_0 + \beta_1 x_1^* + \dots + \beta_p x_p^*$$ - If$s(x^*) > 0$, class$1$, and if$s(x^*) < 0$, class$-1$, i.e.$h(x^*) = \mbox{sign}(s(x^*)).$--- ## Separating hyperplanes What about the .orange[magnitude] of$s(x^*)$? -$s(x^*) \mbox{ far from zero } \rightarrowx^*$lies far from the hyperplane + **more confident** about our classification -$s(x^*) \mbox{ close to zero } \rightarrowx^*$near the hyperplane + **less confident** about our classification --- ## Separating hyperplane classifiers We will explore .green[*three*] different types of hyperplane classifiers, with each method generalising the one before it. - .orange[Maximal marginal classifier] for when the data is perfectly separated by a hyperplane - .orange[Support vector classifier/soft margin classifier] for when data is NOT perfectly separated by a hyperplane but still has a linear decision boundary, and - .orange[Support vector machines] used for when the data has non-linear decision boundaries. In practice, SVMs are used to refer to all three methods, however we will distinguish between the three notions in this lecture. --- ## Maximal margin classifier All are separating hyperplanes. .orange[Which is best?] {r} library(countdown) countdown(minutes = 0, seconds = 30)  --- ## Maximal margin classifier .font_tiny[Source: Machine Learning Memes for Convolutional Teens] --- ## From LDA to SVM - Linear discriminant analysis uses the difference between means to set the separating hyperplane. - Support vector machines uses .orange[gaps] between points on the outer edge of clusters to set the separating hyperplane. {r out.width="70%", fig.width=6, fig.height=3} library(tidyverse) library(MASS) library(e1071) library(gridExtra) olive <- read_csv("data/olive.csv") %>% rename(name=X1) %>% mutate(region = factor(region, levels=1:3, labels=c("south", "sardinia", "north"))) olive_p <- as_tibble(expand.grid(eicosenoic = seq(0, 60, 1), linoleic = seq(440, 1500, 10))) olive_lda <- lda(region~eicosenoic+linoleic, data=olive, prior=c(1/3, 1/3, 1/3)) olive_p$region <- predict(olive_lda, olive_p)$class p1 <- ggplot() + geom_point(data=olive_p, aes(x=eicosenoic, y=linoleic, color=region), alpha=0.1) + geom_point(data=olive, aes(x=eicosenoic, y=linoleic, color=region, shape=region)) + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio=1, legend.position="none") + ggtitle("LDA") olive_svm <- svm(region~eicosenoic+linoleic, data=olive, kernel="linear") olive_p$region <- predict(olive_svm, olive_p) p2 <- ggplot() + geom_point(data=olive_p, aes(x=eicosenoic, y=linoleic, color=region), alpha=0.1) + geom_point(data=olive, aes(x=eicosenoic, y=linoleic, color=region, shape=region)) + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio=1, legend.position="none") + ggtitle("SVM") grid.arrange(p1, p2, ncol=2)  --- ## SVM - If our data can be perfectly separated using a hyperplane, then there will in fact exist an **infinite number of such hyperplanes**. - We compute the (perpendicular) distance from each training observation to a given separating hyperplane. The .orange[smallest] such distance is known as the .orange[margin]. - The .orange[optimal separating hyperplane] (or maximal margin hyperplane) is the separating hyperplane for which the margin is .orange[largest]. - We can then classify a test observation based on which side of the maximal margin hyperplane it lies. This is known as the .orange[maximal margin classifier]. --- ## Support vectors See more detailed explanations [here](https://math.stackexchange.com/questions/1305925/why-is-the-svm-margin-equal-to-frac2-mathbfw). --- ## Support vectors - The .orange[support vectors] are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin. - They .orange[support] the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well .center[ ** The maximal margin hyperplane depends directly on the support vectors, but .orange[not on the other observations]**] --- ## Support vectors --- ### Example: Support vectors {r out.width="100%", fig.height=3, fig.width=6, align="center"} indx <- olive_svm$index[abs(olive_svm$coefs[,1])<1 & abs(olive_svm$coefs[,2])<1] svs <- olive[indx,] p1 <- ggplot() + geom_point(data=olive_p, aes(x=eicosenoic, y=linoleic, color=region), alpha=0.01) + geom_point(data=olive, aes(x=eicosenoic, y=linoleic, color=region, shape=region)) + geom_point(data=svs, aes(x=eicosenoic, y=linoleic), color="black", shape=1, size=3) + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio=1, legend.position="none") + ggtitle("Non-bounded") svs <- olive[olive_svm$index,] p2 <- ggplot() + geom_point(data=olive_p, aes(x=eicosenoic, y=linoleic, color=region), alpha=0.01) + geom_point(data=olive, aes(x=eicosenoic, y=linoleic, color=region, shape=region)) + geom_point(data=svs, aes(x=eicosenoic, y=linoleic), color="black", shape=1, size=3) + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio=1, legend.position="none") + ggtitle("All") grid.arrange(p1, p2, ncol=2)  --- ## Maximal margin classifier If $x_i \in \mathbb{R}^p$ and $y_i \in \{-1, 1\}$ for $i = 1, \dots, n$, the separating hyperplane is defined as $$\{x:\beta_0+x^T\beta=0\}$$ where $\beta=\sum_{i=1}^s (\alpha_iy_i)x_i$ and $s$ is the number of support vectors. Then the .orange[maximal margin hyperplane] is found by *maximising* $M$, subject to $\sum_{j=1}^p\beta_j^2=1$, and $y_i(x_i^T\beta+\beta_0)\geq M, i=1, ..., n$. --- class: split-60 layout: false .column[.pad50px[ ## Non-separable case The maximal margin classifier only works when we have perfect separability in our data. .green[What do we do if data is not perfectly separable by a hyperplane?] .orange[**The support vector classifier**] allows points to either lie on the wrong side of the margin, or on the wrong side of the hyperplane altogether. .font_tiny[Right: ISLR Fig 9.6] ]] .column[.content.vmiddle.center[ ]] --- ## Support vector classifier - optimisation
*Maximise* $M$, subject to $\sum_{i=1}^p\beta_i^2=1$, and $y_i(x_i'\beta+\beta_0)\geq M(1-\varepsilon_i), i=1, ..., n$, AND $\varepsilon_i\geq 0, \sum_{i=1}^n\epsilon_i\leq C$. $\varepsilon_i$ tells us where the $i$th observation is located and $C$ is a nonnegative .orange[tuning parameter]. - $\varepsilon_i = 0$: correct side of the margin, - $\varepsilon_i > 0$: wrong side of the margin (violation of the margin), - $\varepsilon_i > 1$: wrong side of the hyperplane. --- ## Non-separable case .orange[Tuning parameter]: decreasing the value of *C* --- ## Non-linear boundaries The support vector classifier doesn't work well for non-linear boundaries. .green[What solution do we have?] --- ## Enlarging the feature space Consider the following 2D non-linear classification problem. We can transform this to a linear problem separated by a maximal margin hyperplane by introducing an additional third dimension. .font_tiny[Source: Grace Zhang @zxr.nju] --- ## The inner product Consider two $p$-vectors \begin{align*} \mathbf{x} & = (x_1, x_2, \dots, x_p) \in \mathbb{R}^p \\ \mbox{and} \quad \mathbf{y} & = (y_1, y_2, \dots, y_p) \in \mathbb{R}^p. \end{align*} The inner product is defined as $$\langle \mathbf{x}, \mathbf{y}\rangle = x_1y_1 + x_2y_2 + \dots + x_py_p = \sum_{j=1}^{p} x_jy_j$$ .orange[A linear measure of similarity, and allows geometric constrctions such as the maximal marginal hyperplane.] --- ## Kernel functions A kernel function is an inner product of vectors mapped to a (higher dimensional) feature space $\mathcal{H} = \mathbb{R}^d, d > p$. $$\mathcal{K}(\mathbf{x}, \mathbf{y}) = \langle \psi(\mathbf{x}), \psi(\mathbf{y}) \rangle$$ $$\psi: \mathbb{R}^p \rightarrow \mathcal{H}$$ .orange[Non-linear measure of similarity, and allows geometric constructions in high dimensional space.] --- ## Examples of kernels Standard kernels include: \begin{align*} \mbox{Linear} \quad \mathcal{K}(\mathbf{x}, \mathbf{y}) & = \langle\mathbf{x}, \mathbf{y} \rangle \\ \mbox{Polynomial} \quad \mathcal{K}(\mathbf{x}, \mathbf{y}) & = (\langle\mathbf{x}, \mathbf{y} \rangle + 1)^d \\ \mbox{Radial} \quad \mathcal{K}(\mathbf{x}, \mathbf{y}) & = \exp(-\gamma\lvert\lvert\mathbf{x}-\mathbf{y}\lvert\lvert^2) \end{align*} --- class: split-60 layout: false .column[.pad50px[ ## Support Vector Machines .orange[The kernel trick] The linear support vector classifier can be represented as follows: $$f(x) = \beta_0 + \sum_{i \in \mathcal{S}} \alpha_i \langle x, x_i \rangle.$$ We can generalise this by replacing the inner product with the kernel function as follows: $$f(x) = \beta_0 + \sum_{i \in \mathcal{S}} \alpha_i \mathcal{K}( x, x_i ).$$ ]] .column[.content.vmiddle.center[ ]] --- ## Your turn Let $\mathbf{x}$ and $\mathbf{y}$ be vectors in $\mathbb{R}^2$. By expanding $\mathcal{K}(\mathbf{x}, \mathbf{y}) = (1 + \langle \mathbf{x}, \mathbf{y}\rangle) ^2$ show that this is equivalent to an inner product in $\mathcal{H} = \mathbb{R}^6$.
.green[Remember:] $\langle \mathbf{x}, \mathbf{y}\rangle =\sum_{j=1}^{p} x_jy_j$. {r} library(countdown) countdown(minutes = 3, seconds = 0)  --- ## Solution \begin{align*} \mathcal{K}(\mathbf{x}, \mathbf{y}) & = (1 + \langle \mathbf{x}, \mathbf{y}\rangle) ^2 \\ & = \left(1 + \sum_{j = 1}^2 x_jy_j \right) ^2 \\ & = (1 + x_1y_1 + x_2y_2)^2 \\ & = (1 + x_1^2y_1^2 + x_2^2y_2^2 + 2x_1y_1 + 2x_2y_2 + 2x_1x_2y_1y_2) \\ & = \langle \psi(\mathbf{x}), \psi(\mathbf{y}) \rangle \end{align*}
where $\psi(\mathbf{x}) = (1, x_1^2, x_2^2, \sqrt2x_1, \sqrt2x_2, \sqrt2x_1x_2)$. --- ## The kernel trick - why is it a trick? We do not need to know what the high dimensional enlarged feature space $\mathcal{H}$ really looks like. We just need to know the which kernel function is most appropriate as a measure of similarity.
.tip[The Support Vector Machine (SVM) is a maximal margin hyperplane in \$$\mathcal{H}\$$ built by using a kernel function in the low dimensional feature space \$$\mathbb{R}^p\$$.] --- ## Non-linear boundaries .orange[Polynomial] and .orange[radial] kernel SVMs {r} if (!file.exists("images/9.9.png")) image_write(image_read("http://www-bcf.usc.edu/~gareth/ISL/Chapter9/9.9.pdf", density = 300), "images/9.9.png", format = "png", density = 300) --- ## Non-linear boundaries Italian olive oils: Regions 2, 3 (North and Sardinia) {r out.width="70%"} olive_sub <- olive %>% filter(region != "south") olive_svm <- svm(region~linoleic + arachidic, data=olive_sub, kernel="polynomial", degree=2) olive_p <- data.frame(expand.grid(linoleic = seq(440, 1500, 10), arachidic = seq(0, 105, 2))) olive_p$region <- predict(olive_svm, olive_p) p1 <- ggplot() + geom_point(data=olive_p, aes(x=linoleic, y=arachidic, color=region), alpha=0.1) + geom_point(data=olive_sub, aes(x=linoleic, y=arachidic, color=region, shape=region), alpha=0.5) + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio=1, legend.position="none") + ggtitle("Polynomial kernel") olive_svm <- svm(region~linoleic + arachidic, data=olive_sub, kernel="radial") olive_p$region <- predict(olive_svm, olive_p) p2 <- ggplot() + geom_point(data=olive_p, aes(x=linoleic, y=arachidic, color=region), alpha=0.1) + geom_point(data=olive_sub, aes(x=linoleic, y=arachidic, color=region, shape=region), alpha=0.5) + scale_color_brewer("", palette="Dark2") + theme_bw() + theme(aspect.ratio=1, legend.position="none") + ggtitle("Radial kernel") grid.arrange(p1, p2, ncol=2)  --- ## SVM in high dimensions Examining misclassifications and which points are selected to be support vectors
{r eval=FALSE} # Independent code block, so all is here library(tidyverse) library(e1071) library(tourr) library(RColorBrewer) olive <- read_csv("data/olive.csv") %>% rename(name=X1) %>% mutate(region = factor(region, levels=1:3, labels=c("south", "sardinia", "north"))) olive_sub <- olive %>% filter(region == "sardinia") %>% mutate(area=factor(area)) olive_svm <- svm(area~., data=olive_sub[,-c(1,2)], kernel="linear") pch <- rep(1, nrow(olive_sub)) pch[olive_svm$index] <- 16 pal <- brewer.pal(3, "Dark2") col <- pal[as.numeric(olive_sub$area)] quartz() animate_xy(olive_sub[,4:10], axes="bottomleft", col=col, pch=pch)  --- ## SVM in high dimensions [Examining boundaries](http://www.ggobi.org/book/chap-class/classifly.mov) --- ## SVM in high dimensions [Boundaries of a radial kernel in 3D](https://vimeo.com/125405961) --- ## SVM in high dimensions [Boundaries of a polynomial kernel in 5D](https://vimeo.com/125405962) --- class: center ## Comparing decision boundaries {r out.width="100%", fig.width=6, fig.height=3} library(tidyverse) library(MASS) library(e1071) library(gridExtra) library(randomForest) library(mlbench) # Generate the data set.seed(1) spirals <-mlbench.spirals(300,1.5,0.05) data <- as.data.frame(cbind(spirals$x, spirals$classes)) colnames(data) <- c("X1", "X2", "class") data$class <- as.factor(data$class) data_p <- as.data.frame(expand.grid(X1 = seq(-1.5,1.5,0.05), X2 = seq(-1.5,1.5,0.05))) # Generate RF plot data_RF <- randomForest(class ~ X1 + X2, data = data) data_p$region_RF <- predict(data_RF, data_p) p1 <- ggplot() + geom_point(data = data_p, aes(x = X1, y = X2, color = region_RF), alpha = 0.1) + geom_point(data = data, aes(x = X1, y = X2, color = class, shape = class), size = 1) + geom_contour(data = data_p, aes(x= X1, y=X2, z= as.numeric(region_RF)), breaks=c(1.5), color="black", size=0.8) + scale_color_brewer("", palette="Dark2") + theme_minimal() + theme(aspect.ratio=1, legend.position="none") + ggtitle("Random Forest") # Generate SVM plot data_SVM <- svm(class ~ X1 + X2, data = data, kernel = "radial", cost = 10) data_p$region_SVM <- predict(data_SVM, data_p) p2 <- ggplot() + geom_point(data = data_p, aes(x = X1, y = X2, color = region_SVM), alpha = 0.1) + geom_point(data = data, aes(x = X1, y = X2, color = class, shape = class), size = 1) + geom_contour(data = data_p, aes(x= X1, y=X2, z= as.numeric(region_SVM)), breaks=c(1.5), color="black", size=0.8) + scale_color_brewer("", palette="Dark2") + theme_minimal() + theme(aspect.ratio=1, legend.position="none") + ggtitle("SVM") data_lda <- lda(class ~ X1 + X2, data = data) data_p$region_lda <- predict(data_lda, data_p)$class p3 <- ggplot() + geom_point(data = data_p, aes(x = X1, y = X2, color = region_lda), alpha = 0.1) + geom_point(data = data, aes(x = X1, y = X2, color = class, shape = class), size = 1) + geom_contour(data = data_p, aes(x= X1, y=X2, z= as.numeric(region_lda)), breaks=c(1.5), color="black", size=0.8) + scale_color_brewer("", palette="Dark2") + theme_minimal() + theme(aspect.ratio=1, legend.position="none") + ggtitle("LDA") grid.arrange(p1, p2, p3, ncol=3)  --- ## Increasing the value of cost in svm {r out.width="80%"} library(tidyverse) library(MASS) library(e1071) library(gridExtra) library(randomForest) library(mlbench) # Generate the data set.seed(1) spirals <-mlbench.spirals(300,1.5,0.05) data <- as.data.frame(cbind(spirals$x, spirals$classes)) colnames(data) <- c("X1", "X2", "class") data$class <- as.factor(data$class) data_p <- as.data.frame(expand.grid(X1 = seq(-1.5,1.5,0.05), X2 = seq(-1.5,1.5,0.05))) # Generate Models data_SVM <- svm(class ~ X1 + X2, data = data, kernel = "radial", cost = 1) data_p$region_1 <- predict(data_SVM, data_p) data_SVM <- svm(class ~ X1 + X2, data = data, kernel = "radial", cost = 2) data_p$region_2 <- predict(data_SVM, data_p) data_SVM <- svm(class ~ X1 + X2, data = data, kernel = "radial", cost = 5) data_p$region_5 <- predict(data_SVM, data_p) data_SVM <- svm(class ~ X1 + X2, data = data, kernel = "radial", cost = 10) data_p$region_10 <- predict(data_SVM, data_p) # Cost 1 g1 <- ggplot() + geom_point(data = data_p, aes(x = X1, y = X2, color = region_1), alpha = 0.1) + geom_point(data = data, aes(x = X1, y = X2, color = class, shape = class), size = 1) + geom_contour(data = data_p, aes(x= X1, y=X2, z= as.numeric(region_1)), breaks=c(1.5), color="black", size=0.8) + scale_color_brewer("", palette="Dark2") + theme_minimal() + theme(aspect.ratio=1, legend.position="none") + ggtitle("Cost = 1") # Cost = 2 g2 <- ggplot() + geom_point(data = data_p, aes(x = X1, y = X2, color = region_2), alpha = 0.1) + geom_point(data = data, aes(x = X1, y = X2, color = class, shape = class), size = 1) + geom_contour(data = data_p, aes(x= X1, y=X2, z= as.numeric(region_2)), breaks=c(1.5), color="black", size=0.8) + scale_color_brewer("", palette="Dark2") + theme_minimal() + theme(aspect.ratio=1, legend.position="none") + ggtitle("Cost = 2") # Cost = 5 g3 <- ggplot() + geom_point(data = data_p, aes(x = X1, y = X2, color = region_5), alpha = 0.1) + geom_point(data = data, aes(x = X1, y = X2, color = class, shape = class), size = 1) + geom_contour(data = data_p, aes(x= X1, y=X2, z= as.numeric(region_5)), breaks=c(1.5), color="black", size=0.8) + scale_color_brewer("", palette="Dark2") + theme_minimal() + theme(aspect.ratio=1, legend.position="none") + ggtitle("Cost = 5") # Cost = 10 g4 <- ggplot() + geom_point(data = data_p, aes(x = X1, y = X2, color = region_10), alpha = 0.1) + geom_point(data = data, aes(x = X1, y = X2, color = class, shape = class), size = 1) + geom_contour(data = data_p, aes(x= X1, y=X2, z= as.numeric(region_10)), breaks=c(1.5), color="black", size=0.8) + scale_color_brewer("", palette="Dark2") + theme_minimal() + theme(aspect.ratio=1, legend.position="none") + ggtitle("Cost = 10 ") library(gtable) library(grid) pl <- lapply(list(g1,g2,g3,g4), ggplotGrob) g12 <- cbind(pl[], pl[], size="first") g12$heights <- unit.pmax(pl[][["heights"]], pl[][["heights"]]) g34 <- cbind(pl[], pl[], size="first") g34$heights <- unit.pmax(pl[][["heights"]], pl[][["heights"]]) g1234 <- rbind(g12, g34, size="first") g1234\$widths <- unit.pmax(g12[["widths"]], g34[["widths"]]) grid.newpage() grid.draw(g1234)  --- layout: false # r set.seed(2020); emo::ji("technologist") Made by a human with a computer ### Slides at [https://iml.numbat.space](https://iml.numbat.space). ### Code and data at [https://github.com/numbats/iml](https://github.com/numbats/iml).
### Created using [R Markdown](https://rmarkdown.rstudio.com) with flair by [**xaringan**](https://github.com/yihui/xaringan), and [**kunoichi** (female ninja) style](https://github.com/emitanaka/ninja-theme). 