class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-07a.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-02.png") # .monash-blue[ETC3250/5250: Introduction to Machine Learning] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Random forests</h2> .bottom_abs.width100[ Lecturer: *Professor Di Cook* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC3250.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 7a <br> ] --- # What's wrong with a single tree? <br> <center> <img src="images/lecture-07a/overfitting.jpg" style="width: 70%"/> </center> .font_tiny[Source: Hugh MacLeod / Statistical Statistics Memes] --- # Solution? Ensemble methods Ensemble methods use multiple learning algorithms to obtain better predictive performance than any of the single constituents. <center> <img src="images/lecture-07a/seagulls.jpg" style="width: 70%"/> </center> --- # Roadmap We will learn about different ensembles, increasing in complexity (but also potentially in predictive performance) as we go. These methods are - .monash-blue2[Bagging]: combine the predictions of multiple trees, fitted on bootstrap samples. - .monash-blue2[Random forests]: combine predictions from bagged trees, plus random samples of predictors. - .monash-blue2[Boosted trees]: combine predictions from trees sequentially fit to residuals from previous fit. --- # Bootstrap aggregation - Take `\(B\)` different *bootstrapped* training sets: `$$D_1, D_2, \dots, D_B$$` - Build a separate prediction model using each `\(D_{(\cdot)}\)`: `$$\hat{f}_1(x), \hat{f}_2(x), \dots, \hat{f}_B(x)$$` - Combine resulting predictions, e.g. average `$$\hat{f}_{\text{avg}}(x) = \frac1B \sum_{b = 1}^{B} \hat{f}_b(x)$$` --- # Bagging trees .monash-orange2[Bagged trees] - Construct `\(B\)` regression trees using `\(B\)` bootstrapped training sets, and average the resulting predictions. - Each individual tree has .monash-orange2[high variance, but low bias]. - Averaging these `\(B\)` trees .monash-orange2[reduces the variance]. - For classification trees, there are several possible aggregation methods, but the simplest is the **majority vote**. --- # Bagged trees - construction <center> <img src="images/lecture-07a/bagging-1.png" style="width: 100%"/> </center> --- # Bagged trees - construction <center> <img src="images/lecture-07a/bagging-2.png" style="width: 100%"/> </center> --- # Bagged trees - construction <center> <img src="images/lecture-07a/bagging-3.png" style="width: 100%"/> </center> --- # Bagged trees - construction <center> <img src="images/lecture-07a/bagging-4.png" style="width: 100%"/> </center> --- # Bagged trees - construction <center> <img src="images/lecture-07a/bagging-5.png" style="width: 100%"/> </center> --- # Out of bag error - .monash-orange2[No need to use (cross-)validation] to **estimate the test error** of a bagged model (**debatable by some**). - On average, each bagged tree makes use of around .monash-orange2[two-thirds of the observations]. (Check the textbook exercise.) - The remaining observations not used to fit a given bagged tree are referred to as the .monash-orange2[out-of-bag (OOB)] observations. - We can predict the response for the `\(i^{th}\)` observation using each of the trees in which that observation was OOB. This will yield around .monash-orange2[B/3 predictions] for the `\(i^{th}\)` observation. - To obtain a single prediction for the `\(i^{th}\)` observation, average these predicted responses (regression) or can take a majority vote (classification). --- # From bagging to random forests However, when bagging trees, a problem still exists. Although the model building steps are independent, the trees in bagging are not completely independent of each other since all the original features are considered at every split of every tree. Rather, trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to any underlying strong relationships. To deal with this, we can use .monash-orange2[random forests] to help over come this, by sampling the predictors as well as the samples! --- # Random forests - the algorithm 1. Input: `\(L = {(x_i, y_i), i=1, ..., n}, y_i \in \{1, ..., k\}, m < p\)`, number of variables chosen for each tree, `\(B\)` is the number of bootstrap samples. 2. For `\(b =1, 2, ..., B\)`: i. Draw a bootstrap sample, `\(L^{*b}\)` of size `\(n^{*b}\)` from `\(L\)`. ii. Grow tree classifier, `\(T^{*b}\)`. .monash-blue2[At each node use a random selection of `\\(m\\)` variables, and grow to maximum depth without pruning.] iii. Predict the class of each case not drawn in `\(L^{*b}\)`. 3. Combine the predictions for each case, by majority vote, to give predicted class. --- # Random forest - diagnostics - Error is computed automatically on the out-of-bag cases. - .monash-orange2[Variable importance]: more complicated than one might think - .monash-orange2[Vote matrix], `\(n\times K\)`: Proportion of times a case is predicted to the class `\(k\)`. - .monash-orange2[Proximities], `\(n\times n\)`: Closeness of cases measured by how often they are in the same terminal node. --- layout: true # Variable importance .row[.content[ 1.For every tree predict the oob cases and count the number of votes .monash-orange2[cast for the correct class]. ]] .row[.content[ 2..monash-orange2[Randomly permute] the values on a variable in the oob cases and predict the class for these cases. ]] .row[.content[ 3.Difference the votes for the correct class in the variable-permuted oob cases and the real oob cases. Average this number over all trees in the forest. If the .monash-orange2[value is large, then the variable is very important]. ]] .row[.content[ Alternatively, .monash-orange2[Gini importance] adds up the difference in impurity value of the descendant nodes with the parent node. Quick to compute. ]] <br><br><br> Read a fun explanation here by [Harriet Mason](https://numbat.space/post/permutation_variable_importance/permutationvariableimportance/) --- class: fade-row2 fade-row3 fade-row4 count: false --- class: fade-row3 fade-row4 count: false --- class: fade-row4 count: false --- count: false --- layout: false # Vote Matrix - .monash-orange2[Proportion of trees] the case is predicted to be each class, ranges between 0-1 - Can be used to .monash-orange2[identify troublesome] cases. - Used with plots of the actual data can help determine if it is the record itself that is the problem, or if method is biased. - Understand the difference in accuracy of prediction for different classes. --- layout: false # Proximities - Measure how each pair of observations land in the forest - Run both in- and out-of-bag cases down the tree, and increase proximity value of cases `\(i, j\)` by 1 each time they are in the same terminal node. - Normalize by dividing by `\(B\)`. --- class: split-two .column[.pad50px[ # Example - Olive oil data Distinguish the region where oils were produced by their fatty acid signature. Important in quality control and in determining fraudulent marketing. **Areas in the south:** 1. North-Apulia <br> 2. Calabria <br> 3. South-Apulia <br> 4. Sicily ]] .column[.content.vmiddle.center[ <img src="images/lecture-07a/Italian-olive-oils-map.png" style="width: 60%"/> ]] --- # Example - Olive oil data Classifying the olive oils in the south of Italy - difficult classification task. <center> <iframe src="olive1.html" width="800" height="500" scrolling="yes" seamless="seamless" frameBorder="0"> </iframe> </center> --- # Example - random forest fit ```r set.seed(2021) olive <- olive %>% mutate(area = factor(area)) %>% dplyr::select(area:arachidic) olive_split <- initial_split(olive, 2/3, strata = area) olive_tr <- training(olive_split) olive_ts <- testing(olive_split) *olive_rf <- rand_forest() %>% * set_engine("randomForest", * importance=TRUE, proximity=TRUE) %>% * set_mode("classification") %>% * fit(area~., data=olive_tr) ``` --- ``` ## parsnip model object ## ## ## Call: ## randomForest(x = maybe_data_frame(x), y = y, importance = ~TRUE, proximity = ~TRUE) ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 2 ## ## OOB estimate of error rate: 7.94% ## Confusion matrix: ## 1 2 3 4 class.error ## 1 9 1 0 2 0.250000000 ## 2 1 33 2 1 0.108108108 ## 3 0 0 141 1 0.007042254 ## 4 0 4 5 14 0.391304348 ``` --- # Test set confusion and accuracy ```r olive_ts_pred <- olive_ts %>% mutate(.pred = predict(olive_rf, olive_ts)$.pred_class) conf_mat(olive_ts_pred, area, .pred)$table %>% addmargins() ``` ``` ## Truth ## Prediction 1 2 3 4 Sum ## 1 11 0 0 0 11 ## 2 0 19 0 1 20 ## 3 0 0 64 3 67 ## 4 2 0 0 9 11 ## Sum 13 19 64 13 109 ``` ```r bal_accuracy(olive_ts_pred, area, .pred)$.estimate ``` ``` ## [1] 0.9299813 ``` --- # Diagnostics - variable importance ``` ## 1 2 3 4 ## palmitic 0.237 0.025 0.0097 0.052 ## palmitoleic 0.228 0.121 0.1194 0.158 ## stearic 0.010 0.093 0.0256 0.117 ## oleic 0.232 0.089 0.0574 0.060 ## linoleic 0.168 0.226 0.1237 0.065 ## linolenic -0.017 0.152 0.0112 0.062 ## arachidic 0.076 0.015 0.0063 0.089 ``` ``` ## MeanDecreaseAccuracy MeanDecreaseGini ## palmitic 0.029 10.4 ## palmitoleic 0.129 25.1 ## stearic 0.045 10.6 ## oleic 0.071 20.9 ## linoleic 0.136 24.5 ## linolenic 0.039 9.9 ## arachidic 0.020 8.5 ``` --- # Important variables <center> <img src="images/lecture-07a/unnamed-chunk-10-1.png" width="70%" style="display: block; margin: auto;" /> </center> --- class: split-40 .column[.pad50px[ # Diagnostics - vote matrix <br> Examining the vote matrix allows us to see which samples the algorithm had trouble classifying. .monash-orange2[Look rows 3 and 5. How confident would you be in the classifications of these two observations?] ]] .column[.content.vmiddle[ ``` ## # A tibble: 10 × 4 ## `1` `2` `3` `4` ## <matrix> <matrix> <matrix> <matrix> ## 1 0.704433 0.01970 0.00000 0.27586 ## 2 0.761628 0.03488 0.00000 0.20349 ## 3 0.615789 0.01579 0.00000 0.36842 ## 4 0.010638 0.73404 0.18617 0.06915 ## 5 0.059140 0.48387 0.34409 0.11290 ## 6 0.000000 0.77540 0.10160 0.12299 ## 7 0.011364 0.92614 0.02273 0.03977 ## 8 0.005556 0.69444 0.08889 0.21111 ## 9 0.033708 0.56180 0.02247 0.38202 ## 10 0.039106 0.33520 0.00000 0.62570 ``` ]] --- class: split-50 layout: false .column[.pad10px[ <img src="images/lecture-07a/unnamed-chunk-12-1.png" width="100%" style="display: block; margin: auto;" /> ]] .column[.top50px[ <img src="images/lecture-07a/unnamed-chunk-13-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- # From Random Forests to Boosting Whereas random forests build an ensemble of .monash-blue2[deep independent trees], .monash-orange2[boosted trees] build an ensemble of .monash-orange2[shallow trees in sequence] with each tree learning and improving on the previous one. <center> <img src="https://bradleyboehmke.github.io/HOML/images/boosted-trees-process.png" style="width: 90%"/> </center> .font_tiny[Source: [Boehmke (2020) Hands on Machine Learning with R](https://bradleyboehmke.github.io/HOML/)] --- # Boosted trees - the algorithm Boosting iteratively fits multiple trees, sequentially putting .monash-orange2[more weight] on observations that have predicted inaccurately. 1. Set `\(\hat{f}(x)=0\)` and `\(r_i=y_i \forall i\)` in training set 2. For b=1, 2, ..., B, repeat:<br> a. Fit a tree `\(\hat{f}^b\)` with `\(d\)` splits ( `\(d+1\)` terminal nodes)<br> b. Update `\(\hat{f}\)` by adding a weighted new tree `\(\hat{f}(x) = \hat{f}(x)+\lambda\hat{f}^b(x)\)`. <br> c. Update the residuals `\(r_i = r_i - \lambda\hat{f}^b(x_i)\)` <br> 3. Output boosted model, `\(\hat{f}(x) = \sum_{b=1}^B\lambda\hat{f}^b(x)\)` <br><br> Read a fun explanation of boosting here by [Harriet Mason](https://numbat.space/post/boosting/). --- # Boosting a regression tree - watch this! StatQuest by Josh Starmer <br> <p align="center"> <iframe width="750" height="421" src="https://www.youtube.com/embed/3CC4N4z3GJc" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </p> --- # Boosting a classification tree - watch this! StatQuest by Josh Starmer <p align="center"> <iframe width="708" height="398" src="https://www.youtube.com/embed/jxuNLH5dXCs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> </p> --- # More resources Cook & Swayne (2007) "Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi" have several videos illustrating techniques for exploring high-dimensional data in association with trees and forest classifiers: - [Trees video](http://www.ggobi.org/book/chap-class/Trees.mov) - [Forests video](http://www.ggobi.org/book/chap-class/Forests.mov) --- background-size: cover class: title-slide background-image: url("images/bg-02.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Professor Di Cook* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC3250.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 7a <br> ]