class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-04b.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-02.png") # .monash-blue[ETC3250/5250: Introduction to Machine Learning] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Dimension reduction</h2> .bottom_abs.width100[ Lecturer: *Professor Di Cook* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC3250.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 4b <br> ] --- class: middle background-image: url(https://upload.wikimedia.org/wikipedia/commons/9/98/Andromeda_Galaxy_%28with_h-alpha%29.jpg) background-position: 50% 50% class: center, bottom, inverse .white[Space is big. You just won't believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it's a long way down the road to the chemist's, but that's just peanuts to space.] **.white[Douglas Adams, Hitchhiker's Guide to the Galaxy]** --- # High Dimensional Data <br> <br> Remember, our data can be denoted as: `\(\mathcal{D} = \{(x_i, y_i)\}_{i = 1}^n, ~~~ \mbox{where}~ x_i = (x_{i1}, \dots, x_{ip})^{T}\)` <br> <br> then .info-box[.monash-orange2[.content[Dimension ]] .content[of the data is *p*, ] .monash-orange2[.content[ the number of variables.]]] --- # Cubes and Spheres Space expands exponentially with dimension: <img src="images/lecture-04b/hypercube.png" style="width: 50%; align: center" /> <img src="images/lecture-04b/cube_sphere.png" style="width: 30%; align: center" /> As dimension increases the .monash-orange2[volume of a sphere] of same radius as cube side length becomes much .monash-orange2[smaller than the volume of the cube]. --- .flex[ .w-45[ # Multivariate data Mostly though, we're working on problems where `\(n>>p\)`, and `\(p>1\)`. This would more commonly be referred to as .monash-orange2[multivariate data]. ] .w-10.white[ ] .w-45[ # Sub-spaces <br> <br> Data will often be confined to a region of the space having lower .monash-orange2[intrinsic dimensionality]. The data lives in a low-dimensional subspace. Analyse the data by, .monash-orange2[reducing dimensionality], to the subspace containing the data. ] ] --- # Principal Component Analysis (PCA) <br> .info-box[Principal component analysis (PCA) produces a low-dimensional representation of a dataset. It finds a sequence of linear combinations of the variables that have .monash-orange2[maximal variance], and are .monash-orange2[mutually uncorrelated]. It is an unsupervised learning method. ] <br> # Why use PCA? - We may have too many predictors for a regression. Instead, we can use the first few principal components. - Understanding relationships between variables. - Data visualisation. We can plot a small number of variables more easily than a large number of variables. --- # First principal component The first principal component of a set of variables `\(x_1, x_2, \dots, x_p\)` is the linear combination `$$z_1 = \phi_{11}x_1 + \phi_{21} x_2 + \dots + \phi_{p1} x_p$$` that has the largest variance such that `$$\displaystyle\sum_{j=1}^p \phi^2_{j1} = 1$$` The elements `\(\phi_{11},\dots,\phi_{p1}\)` are the .monash-orange2[loadings] of the first principal component. --- # Geometry - The loading vector `\(\phi_1 = [\phi_{11},\dots,\phi_{p1}]^T\)` defines direction in feature space along which data vary most. - If we project the `\(n\)` data points `\({x}_1,\dots,{x}_n\)` onto this direction, the projected values are the principal component scores `\(z_{11},\dots,z_{n1}\)`. - The second principal component is the linear combination `\(z_{i2} = \phi_{12}x_{i1} + \phi_{22}x_{i2} + \dots + \phi_{p2}x_{ip}\)` that has maximal variance among all linear combinations that are *uncorrelated* with `\(z_1\)`. - Equivalent to constraining `\(\phi_2\)` to be orthogonal (perpendicular) to `\(\phi_1\)`. And so on. - There are at most `\(\min(n - 1, p)\)` PCs. --- # Example <center> <a href="http://www-bcf.usc.edu/~gareth/ISL/Chapter6/6.14.pdf" target="_BLANK"> <img src="images/lecture-04b/6.14.png" style="width: 60%; align: center"/> </a> </center> .monash-green2[First PC]; .blue[second PC] .font_smaller2[(Chapter6/6.14.pdf)] --- # Example <center> <a href="http://www-bcf.usc.edu/~gareth/ISL/Chapter6/6.15.pdf" target="_BLANK"> <img src="images/lecture-04b/6.15.png" style="width: 80%; align: center"/> </a> </center> If you think of the first few PCs like a linear model fit, and the others as the error, it is like regression, except that errors are orthogonal to model. .font_smaller2[(Chapter6/6.15.pdf)] --- # Computation PCA can be thought of as fitting an `\(n\)`-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component. The new variables produced by principal components correspond to .monash-orange2[rotating] and .monash-orange2[scaling] the ellipse .monash-orange2[into a circle]. <center> <img src="images/lecture-04b/pc-demo.gif" style="width: 40%; align: center" /> </center> --- # Computation Suppose we have a `\(n\times p\)` data set `\(X = [x_{ij}]\)`. 1. Centre each of the variables to have mean zero (i.e., the column means of `\({X}\)` are zero). 2. Let `\(z_{i1} = \phi_{11}x_{i1} + \phi_{21} x_{i2} + \dots + \phi_{p1} x_{ip}\)` 3. Compute sample variance of `\(z_{i1}\)` is `\(\displaystyle\frac1n\sum_{i=1}^n z_{i1}^2\)`. 4. Estimate `\(\phi_{j1}\)` `$$\mathop{\text{maximize}}_{\phi_{11},\dots,\phi_{p1}} \frac{1}{n}\sum_{i=1}^n \left(\sum_{j=1}^p \phi_{j1}x_{ij}\right)^{\!\!\!2} \text{ subject to } \sum_{j=1}^p \phi^2_{j1} = 1$$` Repeat optimisation to estimate `\(\phi_{jk}\)`, with additional constraint that `\(\sum_{{j=1}, k<k'}^p \phi_{jk}\phi_{jk'} = 0\)` (next vector is orthogonal to previous eigenvector). --- .flex[ .w-45[ # Eigen-decomposition 1. Compute the covariance matrix (after centering the columns of `\({X}\)`) `$$S = {X}^T{X}$$` 2. Find eigenvalues (diagonal elements of `\(D\)`) and eigenvectors ( `\(V\)` ): `$${S}={V}{D}{V}^T$$` where columns of `\({V}\)` are orthonormal (i.e., `\({V}^T{V}={I}\)`) ] .w-10.white[ white space] .w-45[ # Singular Value Decomposition `$$X = U\Lambda V^T$$` - `\(X\)` is an `\(n\times p\)` matrix - `\(U\)` is `\(n \times r\)` matrix with orthonormal columns ( `\(U^TU=I\)` ) - `\(\Lambda\)` is `\(r \times r\)` diagonal matrix with non-negative elements. (Square root of the eigenvalues.) - `\(V\)` is `\(p \times r\)` matrix with orthonormal columns (These are the eigenvectors, and `\(V^TV=I\)` ). It is always possible to uniquely decompose a matrix in this way. ] ] --- # Total variance .monash-orange2[Total variance] in data (assuming variables centered at 0): `$$\text{TV} = \sum_{j=1}^p \text{Var}(x_j) = \sum_{j=1}^p \frac{1}{n}\sum_{i=1}^n x_{ij}^2$$` .info-box[**If variables are standardised, TV=number of variables!**] .monash-orange2[Variance explained] by *m*'th PC: `\(V_m = \text{Var}(z_m) = \frac{1}{n}\sum_{i=1}^n z_{im}^2\)` `$$\text{TV} = \sum_{m=1}^M V_m \text{ where }M=\min(n-1,p).$$` --- # How to choose `\(k\)`? <br> PCA is a useful dimension reduction technique for large datasets, but deciding on how many dimensions to keep isn't often clear. 🤔 <center> .think-box[How do we know how many principal components to choose?] </center> --- # How to choose `\(k\)`? <center> .info-box[.monash-orange2[Proportion] of variance explained: `$$\text{PVE}_m = \frac{V_m}{TV}$$` ] </center> Choosing the number of PCs that adequately summarises the variation in `\(X\)`, is achieved by examining the cumulative proportion of variance explained. <center> .info-box[ .monash-orange2[Cumulative proportion] of variance explained: `$$\text{CPVE}_k = \sum_{m=1}^k\frac{V_m}{TV}$$` ] </center> --- class: split-two layout: false .column[.pad50px[ # How to choose `\(k\)`? <br> .info-box[.monash-orange2[Scree plot: ].content[Plot of variance explained by each component vs number of component.]] ]] .column[.content.vmiddle.center[ <img src="images/lecture-04b/unnamed-chunk-6-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- class: split-two layout: false .column[.pad50px[ # How to choose `\(k\)`? <br> .info-box[.monash-orange2[Scree plot: ].content[Plot of variance explained by each component vs number of component.]] ]] .column[.content.vmiddle.center[ <img src="images/lecture-04b/unnamed-chunk-7-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- # Example - track records The data on national track records for women (as at 1984). ``` ## Rows: 55 ## Columns: 8 ## $ m100 <dbl> 11.61, 11.20, 11.43, 11.41, 11.46, 11.31, 12.14, 11.00, 12.00… ## $ m200 <dbl> 22.94, 22.35, 23.09, 23.04, 23.05, 23.17, 24.47, 22.25, 24.52… ## $ m400 <dbl> 54.50, 51.08, 50.62, 52.00, 53.30, 52.80, 55.00, 50.06, 54.90… ## $ m800 <dbl> 2.15, 1.98, 1.99, 2.00, 2.16, 2.10, 2.18, 2.00, 2.05, 2.08, 2… ## $ m1500 <dbl> 4.43, 4.13, 4.22, 4.14, 4.58, 4.49, 4.45, 4.06, 4.23, 4.33, 4… ## $ m3000 <dbl> 9.79, 9.08, 9.34, 8.88, 9.81, 9.77, 9.51, 8.81, 9.37, 9.31, 9… ## $ marathon <dbl> 178.52, 152.37, 159.37, 157.85, 169.98, 168.75, 191.02, 149.4… ## $ country <chr> "argentin", "australi", "austria", "belgium", "bermuda", "bra… ``` .font_smaller2[*Source*: Johnson and Wichern, Applied multivariate analysis] --- .flex[ # Explore the data <img src="images/lecture-04b/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> ] --- # Compute PCA ```r track_pca <- prcomp(track[,1:7], center=TRUE, scale=TRUE) track_pca ``` ``` ## Standard deviations (1, .., p=7): ## [1] 2.41 0.81 0.55 0.35 0.23 0.20 0.15 ## ## Rotation (n x k) = (7 x 7): ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## m100 0.37 0.49 -0.286 0.319 0.231 0.6198 0.052 ## m200 0.37 0.54 -0.230 -0.083 0.041 -0.7108 -0.109 ## m400 0.38 0.25 0.515 -0.347 -0.572 0.1909 0.208 ## m800 0.38 -0.16 0.585 -0.042 0.620 -0.0191 -0.315 ## m1500 0.39 -0.36 0.013 0.430 0.030 -0.2312 0.693 ## m3000 0.39 -0.35 -0.153 0.363 -0.463 0.0093 -0.598 ## marathon 0.37 -0.37 -0.484 -0.672 0.131 0.1423 0.070 ``` --- # Assess Summary of the principal components: <table class="table" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;color: white !important;background-color: #7570b3 !important;"> </th> <th style="text-align:right;color: white !important;background-color: #7570b3 !important;"> PC1 </th> <th style="text-align:right;color: white !important;background-color: #7570b3 !important;"> PC2 </th> <th style="text-align:right;color: white !important;background-color: #7570b3 !important;"> PC3 </th> <th style="text-align:right;color: white !important;background-color: #7570b3 !important;"> PC4 </th> <th style="text-align:right;color: white !important;background-color: #7570b3 !important;"> PC5 </th> <th style="text-align:right;color: white !important;background-color: #7570b3 !important;"> PC6 </th> <th style="text-align:right;color: white !important;background-color: #7570b3 !important;"> PC7 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 2.5em; color: white !important;background-color: #7570b3 !important;width: 2.5em; "> Variance </td> <td style="text-align:right;width: 2.5em; "> 5.81 </td> <td style="text-align:right;width: 2.5em; "> 0.65 </td> <td style="text-align:right;width: 2.5em; "> 0.30 </td> <td style="text-align:right;width: 2.5em; "> 0.13 </td> <td style="text-align:right;width: 2.5em; "> 0.05 </td> <td style="text-align:right;width: 2.5em; "> 0.04 </td> <td style="text-align:right;width: 2.5em; "> 0.02 </td> </tr> <tr> <td style="text-align:left;width: 2.5em; color: white !important;background-color: #7570b3 !important;width: 2.5em; "> Proportion </td> <td style="text-align:right;width: 2.5em; "> 0.83 </td> <td style="text-align:right;width: 2.5em; "> 0.09 </td> <td style="text-align:right;width: 2.5em; "> 0.04 </td> <td style="text-align:right;width: 2.5em; "> 0.02 </td> <td style="text-align:right;width: 2.5em; "> 0.01 </td> <td style="text-align:right;width: 2.5em; "> 0.01 </td> <td style="text-align:right;width: 2.5em; "> 0.00 </td> </tr> <tr> <td style="text-align:left;width: 2.5em; color: white !important;background-color: #7570b3 !important;width: 2.5em; color: white !important;background-color: #CA6627 !important;"> Cum. prop </td> <td style="text-align:right;width: 2.5em; color: white !important;background-color: #CA6627 !important;"> 0.83 </td> <td style="text-align:right;width: 2.5em; color: white !important;background-color: #CA6627 !important;"> 0.92 </td> <td style="text-align:right;width: 2.5em; color: white !important;background-color: #CA6627 !important;"> 0.97 </td> <td style="text-align:right;width: 2.5em; color: white !important;background-color: #CA6627 !important;"> 0.98 </td> <td style="text-align:right;width: 2.5em; color: white !important;background-color: #CA6627 !important;"> 0.99 </td> <td style="text-align:right;width: 2.5em; color: white !important;background-color: #CA6627 !important;"> 1.00 </td> <td style="text-align:right;width: 2.5em; color: white !important;background-color: #CA6627 !important;"> 1.00 </td> </tr> </tbody> </table> Increase in variance explained large until `\(k=3\)` PCs, and then tapers off. A choice of .monash-orange2[3 PCs] would explain 97% of the total variance. --- class: split-two layout: false .column[.pad50px[ # Assess <br> .monash-green2[Scree plot: Where is the elbow?] <br> At `\(k=2\)`, thus the scree plot suggests 2 PCs would be sufficient to explain the variability. ]] .column[.content.vmiddle.center[ <img src="images/lecture-04b/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- class: split-two layout: false .column[.pad50px[ # Assess <br> .info-box[.monash-orange2[Visualise model using a biplot]: Plot the principal component scores, and also the contribution of the original variables to the principal component. ] ]] .column[.content.vmiddle.center[ <img src="images/lecture-04b/unnamed-chunk-14-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- # Significance of loadings Bootstrap can be used to assess whether the coefficients of a PC are significantly different from 0. The 95% bootstrap confidence intervals can be computed by: 1. Generating B bootstrap samples of the data 2. Compute PCA, record the loadings 3. Re-orient the loadings, by choosing one variable with large coefficient to be the direction base 4. If B=1000, 25th and 975th sorted values yields the lower and upper bounds for confidence interval for each PC. <img src="images/lecture-04b/unnamed-chunk-15-1.png" width="50%" style="display: block; margin: auto;" /> --- <img src="images/lecture-04b/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" /> All of the coefficients on PC1 are significantly different from 0, and positive, approximately equal, .monash-orange2[not significantly different from being equal]. --- # Loadings for PC2 <img src="images/lecture-04b/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> On PC2 m100 and m200 contrast m1500 and m3000 (and possibly marathon). These are significantly different from 0. --- # Loadings for PC3 <img src="images/lecture-04b/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" /> On PC3 m400 and m800 (and possibly marathon) are significantly different from 0. --- # Interpretation - PC1 measures overall magnitude, the strength of the athletics program. High positive values indicate .monash-orange2[poor] programs with generally slow times across events. - PC2 measures the .monash-orange2[contrast] in the program between .monash-orange2[short and long distance] events. Some countries have relatively stronger long distance atheletes, while others have relatively stronger short distance athletes. - There are several .monash-orange2[outliers] visible in this plot, `wsamoa`, `cookis`, `dpkorea`. PCA, because it is computed using the variance in the data, can be affected by outliers. It may be better to remove these countries, and re-run the PCA. - PC3, may or may not be useful to keep. The interpretation would that this variable summarises countries with different middle distance performance. --- class: transition # Other techniques --- # Projection pursuit (PP) generalises PCA PCA: `$$\mathop{\text{maximize}}_{\phi_{11},\dots,\phi_{p1}} \frac{1}{n}\sum_{i=1}^n \left(\sum_{j=1}^p \phi_{j1}x_{ij}\right)^{\!\!\!2} \text{ subject to } \sum_{j=1}^p \phi^2_{j1} = 1$$` PP: `$$\mathop{\text{maximize}}_{\phi_{11},\dots,\phi_{p1}} ~~f\left(\sum_{j=1}^p \phi_{j1}x_{ij}\right) \text{ subject to } \sum_{j=1}^p \phi^2_{j1} = 1$$` --- # MDS generalises PCA .tip[.orange[Multidimensional scaling (MDS)] finds a low-dimensional layout of points that minimises the difference between distances computed in the *p*-dimensional space, and those computed in the low-dimensional space. ] `$$\mbox{Stress}_D(x_1, ..., x_N) = \left(\sum_{i, j=1; i\neq j}^N (d_{ij} - d_k(i,j))^2\right)^{1/2}$$` where `\(D\)` is an `\(N\times N\)` matrix of distances `\((d_{ij})\)` between all pairs of points, and `\(d_k(i,j)\)` is the distance between the points in the low-dimensional space. --- class: split-two .column[.pad50px[ # MDS can do nonlinear dimension reduction - Classical MDS similar results to PCA - Metric MDS incorporates power transformations on the distances, `\(d_{ij}^r\)`. - Non-metric MDS incorporates a monotonic transformation of the distances, e.g. rank ```r track <- read_csv(here::here("data/womens_track.csv")) track_mds <- * cmdscale(dist(track[,1:7])) %>% as_tibble() %>% mutate(country = track$country) ``` ]] .column[.pad50px[ <img src="images/lecture-04b/unnamed-chunk-20-1.png" width="60%" style="display: block; margin: auto;" /><img src="images/lecture-04b/unnamed-chunk-20-2.png" width="60%" style="display: block; margin: auto;" /> ]] --- # Challenge For each of these distance matrices, find a layout in 1 or 2D that accurately reflects the full distances. ``` ## # A tibble: 3 × 4 ## name A B C ## <chr> <dbl> <dbl> <dbl> ## 1 A 0.1 3.2 3.9 ## 2 B 3.2 -0.1 5.1 ## 3 C 3.9 5.1 0 ``` ``` ## # A tibble: 4 × 5 ## name A B C D ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 A 0.1 0.9 2.1 3 ## 2 B 0.9 0 1.1 1.9 ## 3 C 2.1 1.1 0.1 1.1 ## 4 D 3 1.9 1.1 -0.1 ``` --- # Non-linear dimension reduction <br> - .orange[T-distributed Stochastic Neighbor Embedding (t-SNE)]: similar to MDS, except emphasis is placed on grouping observations into clusters. Observations within a cluster are placed close in the low-dimensional representation, but clusters themselves are placed far apart. - .orange[Local linear embedding (LLE)]: Finds nearest neighbours of points, defines interpoint distances relative to neighbours, and preserves these proximities in the low-dimensional mapping. Optimisation is used to solve an eigen-decomposition of the knn distance construction. - .orange[Self-organising maps (SOM)]: First clusters the observations into `\(k \times k\)` groups. Uses the mean of each group laid out in a constrained 2D grid to create a 2D projection. --- background-size: cover class: title-slide background-image: url("images/bg-02.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Professor Di Cook* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC3250.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 4b <br> ]