class: middle center hide-slide-number monash-bg-gray80 .info-box.w-50.bg-white[ These slides are viewed best by Chrome or Firefox and occasionally need to be refreshed if elements did not load properly. See <a href=lecture-06b.pdf>here for the PDF <i class="fas fa-file-pdf"></i></a>. ] <br> .white[Press the **right arrow** to progress to the next slide!] --- class: title-slide count: false background-image: url("images/bg-02.png") # .monash-blue[ETC3250/5250: Introduction to Machine Learning] <h1 class="monash-blue" style="font-size: 30pt!important;"></h1> <br> <h2 style="font-weight:900!important;">Regression Trees</h2> .bottom_abs.width100[ Lecturer: *Professor Di Cook* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC3250.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 6b <br> ] --- # Difference with classification tree The split criterion needs to use a quantitative response instead of categorical. `$$\mbox{MSE} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$` Split the data where combining MSE for left bucket (MSE_L) and right bucket (MSE_R), makes the biggest reduction from the overall MSE. <br> Note that, `\(\hat{y} = \bar{y}\)` in regression trees. --- ## Predicting Salary A regression tree to predict the `logSalary` of a baseball player, given their `Years` of playing and number of `Hits`. ``` ## parsnip model object ## ## n= 263 ## ## node), split, n, deviance, yval ## * denotes terminal node ## ## 1) root 263 39.071620 2.574160 ## 2) Years< 4.5 90 7.988302 2.217851 * ## 3) Years>=4.5 173 13.713070 2.759523 ## 6) Hits< 117.5 90 5.298802 2.605063 * ## 7) Hits>=117.5 83 3.938792 2.927009 * ``` --- class: split-two .column[.pad50px[ ## Predicting Salary <br> Using the function `rpart`, we can build a regression tree to predict the `logSalary` of a baseball player, given their `Years` of playing and number of `Hits`. ]] .column[.content.vmiddle.center[ <img src="images/lecture-06b/unnamed-chunk-5-1.png" width="100%" style="display: block; margin: auto;" /> ]] --- ## Regions of the decision tree <img src="images/lecture-06b/unnamed-chunk-6-1.png" width="60%" style="display: block; margin: auto;" /> --- class: split-60 .column[.pad50px[ ## Deeper trees <br> By decreasing the value of the complexity parameter `cp`, we can build deeper trees. ```r # Fit a regression tree rpart_mod2 <- decision_tree(cost_complexity = 0.012) %>% set_engine("rpart") %>% set_mode("regression") %>% translate() hitters_fit2 <- rpart_mod2 %>% fit(lSalary ~ Hits+Years, data = Hitters) ``` ]] .column[.content.vmiddle[ <img src="images/lecture-06b/unnamed-chunk-9-1.png" width="90%" style="display: block; margin: auto;" /> ]] --- ## Regions <img src="images/lecture-06b/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Regression trees - construction - We divide the predictor space - that is, the set of possible values for `\(X_1,X_2, . . .,X_p\)` - into `\(J\)` .monash-orange2[distinct] and .monash-orange2[non-overlapping] regions, `\(R_1,R_2, . . . , R_M\)`. - The regions could have any shape. However, for simplicity and for ease of interpretation, we divide the predictor space into high-dimensional .monash-orange2[rectangles]. - We model the response as a constant `\(c_j\)` in each region `\(f(x) = \sum_{j = 1}^J c_j ~ I(x \in R_m)\)` e.g. `$${R_1} = \{X | \mbox{Years} < 4.5 \}$$` `$${R_2} = \{X | \mbox{Years} \geq 4.5, \mbox{Hits} < 117.5 \}$$` `$${R_3} = \{X | \mbox{Years} \geq 4.5, \mbox{Hits} \geq 117.5 \}$$` --- class: split-two .column[.pad50px[ ## Leaves and Branches <br> - `\(R_1\)`, `\(R_2\)`, `\(R_3\)` are .monash-orange2[terminal nodes] or .monash-orange2[leaves]. - The points where we split are .monash-orange2[internal nodes]. - The segments that connect the nodes are .monash-orange2[branches]. ]] .column[.content.vmiddle[ <img src="images/lecture-06b/unnamed-chunk-11-1.png" width="90%" style="display: block; margin: auto;" /> ]] --- class: split-two .column[.pad50px[ ### Linear regression `$$\small{f(X) = \beta_0 + \sum_{j = 1}^p X_j \beta_j}$$` <br> <a href="http://www-bcf.usc.edu/~gareth/ISL/Chapter2/2.4.pdf" target="_BLANK"> <img src="images/lecture-06b/2.4.png" style="width: 80%; align: center"/></a> .font_tiny[(Chapter 2/2.4)] ]] .column[.pad50px[ ### Regression trees `$$\small{f(X) = \sum_{m = 1}^M c_m ~ I(X \in R_m)}$$` <br> <a href="http://www-bcf.usc.edu/~gareth/ISL/Chapter8/8.3.pdf" target="_BLANK"> <img src="images/lecture-06b/8.3a.png" style="width: 80%; align: center"/> </a> ]] --- ## Strategy for finding good splits <br> - .monash-orange2[Top-down]: it begins at the top of the tree (all observations belong to a single region) and then successively splits the predictor space; each split is indicated via two new branches further down on the tree. - .monash-orange2[Greedy]: at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. --- ## Algorithm <br> 1. Start with a single region `\(R_1\)` (entire input space), and iterate: a. Select a region `\(R_m\)`, a predictor `\(X_j\)` , and a splitting point `\(s\)`, such that splitting `\(R_m\)` with the criterion `\(X_j < s\)` produces the largest decrease in RSS b. Redefine the regions with this additional split. 2. Continues until stopping criterion reached. --- ## Stopping criterion <br> - `\(N_m < a\)`: Number of observations in `\(R_m\)` is too small to further splitting (`minsplit`). (There is usually another control criteria, even if `\(N_m\)` is large enough, you can't split it small number of observations off, e.g. 1 and `\(N_m-1\)`, `minbucket`. ) - RSS `\(< tol\)`: If reduction of error is too small to bother splitting further. (`cp` parameter in `rpart` measures this as a proportional drop - see earlier examples displaying the change in this parameter. ) --- ## Model fit <br> .monash-blue2[Residual Sum of Squared Error] `$$\mbox{RSS}(T) = \sum_{m = 1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{y}_m)^2$$` where `\(|T|\)` is the number of terminal nodes in `\(T\)`, and remember `\(\hat{y}=\bar{y}\)`. And MSE is obtained by dividing by `\(n\)`, and RMSE takes the square root. --- ## Size of tree <br> - It is possible to produce good predictions on the **training set**, but is likely to .monash-orange2[overfit] the data (trees are very flexible). - A smaller tree with fewer splits (that is, fewer regions) might lead to .monash-orange2[lower variance] and better interpretation at the cost of a .monash-orange2[little bias]. - Tree size is a tuning parameter governing the **model’s complexity**, and the optimal tree size should be adaptively chosen from the data - Produce splits only if RSS decrease exceeds some **(high) threshold** can mean that a low gain split early on, might stop the fitting, even though there may be a very good split later. --- ## Pruning Grow a big tree, `\(T_0\)`, and then **prune** it back. The *pruning* procedure is: - Starting with with the initial full tree `\(T_0\)`, replace a subtree with a leaf node to obtain a new tree `\(T_1\)`. Select subtree to prune by minimizing `$$\frac{ \text{RSS}(T_1) - \text{RSS}(T_0) }{|T_1| - |T_0| }$$` - Iteratively prune to obtain a sequence `\(T_0, T_1, T_2, \dots, T_{R}\)` where `\(T_{R}\)` is the tree with a single leaf node. - Select the optimal tree `\(T_m\)` by cross validation --- ## Model selection Using the `tune` package in tidymodels. <img src="images/lecture-06b/unnamed-chunk-12-1.png" width="50%" style="display: block; margin: auto;" /> ``` ## # A tibble: 1 × 3 ## cost_complexity min_n .config ## <dbl> <int> <chr> ## 1 0.0000000001 30 Preprocessor1_Model16 ``` --- Yielding this model: <img src="images/lecture-06b/unnamed-chunk-15-1.png" width="80%" style="display: block; margin: auto;" /> --- # Reminder of what the training data looks like <img src="images/lecture-06b/unnamed-chunk-16-1.png" width="70%" style="display: block; margin: auto;" /> --- class: transition middle center # Summary Regression trees can provide a very flexible model. --- background-size: cover class: title-slide background-image: url("images/bg-02.png") <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. .bottom_abs.width100[ Lecturer: *Professor Di Cook* Department of Econometrics and Business Statistics <i class="fas fa-envelope"></i> ETC3250.Clayton-x@monash.edu <i class="fas fa-calendar-alt"></i> Week 6b <br> ]