ETC3250/5250 Introduction to Machine Learning

Week 5: Trees and forests

Professor Di Cook

etc3250.clayton-x@monash.edu

Department of Econometrics and Business Statistics

Overview

We will cover:

Classification trees, algorithm, stopping rules
Difference between algorithm and parametric methods, especially trees vs LDA
Forests: ensembles of bagged trees
Diagnostics: vote matrix, variable importance, proximity
Boosted trees

Trees

Nice explanation of trees, forests, boosted trees

Algorithm: growing a tree

All observations in a single set
Sort values on first variable
Compute the chosen split criteria for all possible splits into two sets
Choose the best split on this variable. Save this info.
Repeat 2-4 for all other variables
Choose the best variable to split on, based on the best split. Your data is now in two sets.
Repeat 1-6 on each subset.
Stop when stopping rule that decides that the best classification model is achieved.

Pros and cons:

Trees are a very flexible way to fit a classifier.
They can
- utilise different types of predictor variables
- ignore missing values
- handle different units or scales on variables
- capture intricate patterns
However, they operate on a per variable basis, and do not effectively model separation when a combination of variables is needed.

Common split criteria

Classification

The Gini index measures is defined as: \[G = \sum_{k =1}^K \widehat{p}_{mk}(1 - \widehat{p}_{mk})\]
Entropy is defined as \[D = - \sum_{k =1}^K \widehat{p}_{mk} log(\widehat{p}_{mk})\] What corresponds to a high value, and what corresponds to a low value?

Regression

Define

\[\mbox{MSE} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \widehat{y}_i)^2\]

Split the data where combining MSE for left bucket (MSE_L) and right bucket (MSE_R), makes the biggest reduction from the overall MSE.

Illustration (1/2)

x	cl
11	A
33	A
39	B
44	A
50	A
56	B
70	B

Note: x is sorted from lowest to highest!

All possible splits shown by vertical lines

What do you think is the best split? 2, 3 or 5??

Illustration (2/2)

Calculate the impurity for split 5

The left bucket is

x	cl
11	A
33	A
39	B
44	A
50	A

and the right bucket is

x	cl
56	B
70	B

Using Gini $G = \sum_{k =1}^K \widehat{p}_{mk}(1 - \widehat{p}_{mk})$

Left bucket:

\[\widehat{p}_{LA} = 4/5, \widehat{p}_{LB} = 1/5, ~~ p_L = 5/7\]

\[G_L=0.8(1-0.8)+0.2(1-0.2) = 0.32\]

Right bucket:

\[\widehat{p}_{RA} = 0/2, \widehat{p}_{RB} = 2/2, ~~ p_R = 2/7\]

\[G_R=0(1-0)+1(1-1) = 0\] Combine with weighted sum to get impurity for the split:

\[5/7G_L + 2/7G_R=0.23\]

Your turn: Compute the impurity for split 2.

Splits on categorical variables

Possible best split would be if koala then assign to Vic else assign to WA, because Vic has more koalas but and WA has more emus and roos.

Dealing with missing values on predictors

x1	x2	x3	x4	y
19	-8	22	-24	A
NA	-10	26	-26	A
15	NA	32	-27	B
17	-6	27	-25	A
18	-5	NA	-23	A
13	-3	37	NA	B
12	-1	35	-30	B
11	-7	24	-31	B

50% of cases have missing values. Trees ignore missings only on a single variable.

Every other method ignores a full observation if missing on any variable. That is, would only be able to use half the data.

Example: penguins 1/3

set.seed(1156)
p_split <- initial_split(p_sub, 2/3, strata=species)
p_tr <- training(p_split)
p_ts <- testing(p_split)

tree_spec <- decision_tree() |>
  set_mode("classification") |>
  set_engine("rpart")

p_fit_tree <- tree_spec |>
  fit(species~., data=p_tr)

p_fit_tree

parsnip model object

n= 145 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 145 45 Adelie (0.690 0.310)  
  2) bl< 43 99  2 Adelie (0.980 0.020) *
  3) bl>=43 46  3 Chinstrap (0.065 0.935) *

Can you draw the tree?

Stopping rules

Minimum split: number of observations in a node, in order for a split to be made
Minimum bucket: Minimum number of observations allowed in a terminal node
Complexity parameter: minimum difference between impurity values required to continue splitting

Example: penguins 2/3

Defaults for rpart are:

rpart.control(minsplit = 20, 
  minbucket = round(minsplit/3), 
  cp = 0.01, 
  maxcompete = 4, 
  maxsurrogate = 5, 
  usesurrogate = 2, 
  xval = 10,
  surrogatestyle = 0, maxdepth = 30, 
  ...)

tree_spec <- decision_tree() |>
  set_mode("classification") |>
  set_engine("rpart",
             control = rpart.control(minsplit = 10), 
             model=TRUE)

p_fit_tree <- tree_spec |>
  fit(species~., data=p_tr)

p_fit_tree

parsnip model object

n= 145 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 145 45 Adelie (0.690 0.310)  
   2) bl< 43 99  2 Adelie (0.980 0.020)  
     4) bl< 41 75  0 Adelie (1.000 0.000) *
     5) bl>=41 24  2 Adelie (0.917 0.083)  
      10) bm>=3.4e+03 21  0 Adelie (1.000 0.000) *
      11) bm< 3.4e+03 3  1 Chinstrap (0.333 0.667) *
   3) bl>=43 46  3 Chinstrap (0.065 0.935)  
     6) bl< 46 10  3 Chinstrap (0.300 0.700)  
      12) bm>=3.8e+03 3  0 Adelie (1.000 0.000) *
      13) bm< 3.8e+03 7  0 Chinstrap (0.000 1.000) *
     7) bl>=46 36  0 Chinstrap (0.000 1.000) *

Example: penguins 3/3

p_fit_tree |>
  extract_fit_engine() |>
  rpart.plot(type=3, extra=1)

Example: penguins 3/4

Model fit summary

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.946

# A tibble: 2 × 4
# Groups:   species [2]
  species   Adelie Chinstrap Accuracy
  <fct>      <int>     <int>    <dbl>
1 Adelie        50         1    0.980
2 Chinstrap      3        20    0.870

# A tibble: 1 × 3
  .metric      .estimator .estimate
  <chr>        <chr>          <dbl>
1 bal_accuracy binary         0.925

Can you see the misclassified test cases?

Model-in-the-data-space

Comparison with LDA

Tree model

Data-driven, only split on single variables

LDA model

Assume normal, equal VC, oblique splits

Random forests

Overview

A random forest is an ensemble classifier, built from fitting multiple trees to different subsets of the training data.

Bagging and variable sampling

Take $B$ different bootstrapped training sets: $D_1, D_2, \dots, D_B$, each using a sample of variables.
Build a separate prediction model using each $D_{(\cdot)}$: \[\widehat{f}_1(x), \widehat{f}_2(x), \dots, \widehat{f}_B(x)\]
Predict the out-of-bag cases for each tree, compute proportion of trees a case was predicted to be each class.
Predicted value for each observation is the class with the highest proportion.

Each individual tree has high variance.
Aggregating the results from $B$ trees reduces the variance.

Comparison with a single tree and LDA

Tree model

Data-driven, only split on single variables

Random forest

Data-driven, multiple trees gives non-linear fit

LDA model

Assume normal, equal VC, oblique splits

Random forest fit and predicted values

Fit

rf_spec <- rand_forest(mtry=2, trees=1000) |>
  set_mode("classification") |>
  set_engine("randomForest")
p_fit_rf <- rf_spec |> 
  fit(species ~ ., data = p_tr)

parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~2,      x)) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.8%
Confusion matrix:
          Adelie Chinstrap class.error
Adelie        96         4       0.040
Chinstrap      3        42       0.067

Predicted values

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.973

# A tibble: 2 × 4
# Groups:   species [2]
  species   Adelie Chinstrap Accuracy
  <fct>      <int>     <int>    <dbl>
1 Adelie        51         0    1    
2 Chinstrap      2        21    0.913

# A tibble: 1 × 3
  .metric      .estimator .estimate
  <chr>        <chr>          <dbl>
1 bal_accuracy binary         0.957

Warning: Don’t use the predict() on the training set, you’ll always get 0 error. The object p_fit_rf$fit$predict has the fitted values.

Diagnostics

Error is computed automatically on the out-of-bag cases.
Vote matrix, $n\times K$: Proportion of times a case is predicted to the class $k$. Also consider these to be predictive probabilities.
Variable importance: uses permutation!
Proximities, $n\times n$: Closeness of cases measured by how often they are in the same terminal node.

Vote Matrix

Proportion of trees the case is predicted to be each class, ranges between 0-1
Can be used to identify troublesome cases.
Used with plots of the actual data can help determine if it is the record itself that is the problem, or if method is biased.
Understand the difference in accuracy of prediction for different classes.

p_fit_rf$fit$votes

    Adelie Chinstrap
1   1.0000    0.0000
2   1.0000    0.0000
3   0.9807    0.0193
4   1.0000    0.0000
5   1.0000    0.0000
6   1.0000    0.0000
7   1.0000    0.0000
8   0.3982    0.6018
9   1.0000    0.0000
10  1.0000    0.0000
11  1.0000    0.0000
12  0.8274    0.1726
13  0.3425    0.6575
14  1.0000    0.0000
15  1.0000    0.0000
16  0.7931    0.2069
17  1.0000    0.0000
18  1.0000    0.0000
19  0.9973    0.0027
20  1.0000    0.0000
21  0.7622    0.2378
22  1.0000    0.0000
23  0.9459    0.0541
24  1.0000    0.0000
25  1.0000    0.0000
26  0.8568    0.1432
27  1.0000    0.0000
28  1.0000    0.0000
29  1.0000    0.0000
30  1.0000    0.0000
31  1.0000    0.0000
32  1.0000    0.0000
33  1.0000    0.0000
34  1.0000    0.0000
35  1.0000    0.0000
36  1.0000    0.0000
37  1.0000    0.0000
38  1.0000    0.0000
39  1.0000    0.0000
40  1.0000    0.0000
41  1.0000    0.0000
42  1.0000    0.0000
43  1.0000    0.0000
44  1.0000    0.0000
45  0.2773    0.7227
46  1.0000    0.0000
47  0.9821    0.0179
48  1.0000    0.0000
49  0.9973    0.0027
50  1.0000    0.0000
51  1.0000    0.0000
52  1.0000    0.0000
53  1.0000    0.0000
54  1.0000    0.0000
55  1.0000    0.0000
56  1.0000    0.0000
57  1.0000    0.0000
58  1.0000    0.0000
59  1.0000    0.0000
60  1.0000    0.0000
61  0.9833    0.0167
62  1.0000    0.0000
63  0.9113    0.0887
64  1.0000    0.0000
65  1.0000    0.0000
66  1.0000    0.0000
67  1.0000    0.0000
68  1.0000    0.0000
69  0.9912    0.0088
70  1.0000    0.0000
71  0.9535    0.0465
72  0.9914    0.0086
73  1.0000    0.0000
74  0.9676    0.0324
75  1.0000    0.0000
76  1.0000    0.0000
77  1.0000    0.0000
78  1.0000    0.0000
79  1.0000    0.0000
80  1.0000    0.0000
81  1.0000    0.0000
82  0.9973    0.0027
83  1.0000    0.0000
84  1.0000    0.0000
85  1.0000    0.0000
86  0.4624    0.5376
87  0.6160    0.3840
88  1.0000    0.0000
89  1.0000    0.0000
90  1.0000    0.0000
91  1.0000    0.0000
92  0.9948    0.0052
93  1.0000    0.0000
94  0.9972    0.0028
95  1.0000    0.0000
96  1.0000    0.0000
97  1.0000    0.0000
98  1.0000    0.0000
99  1.0000    0.0000
100 1.0000    0.0000
101 0.0000    1.0000
102 0.0000    1.0000
103 0.0055    0.9945
104 0.0653    0.9347
105 0.0000    1.0000
106 0.0000    1.0000
107 0.0000    1.0000
108 0.0000    1.0000
109 0.0000    1.0000
110 0.1935    0.8065
111 0.0000    1.0000
112 0.0000    1.0000
113 0.0159    0.9841
114 0.0000    1.0000
115 0.0000    1.0000
116 0.2074    0.7926
117 0.0000    1.0000
118 0.0000    1.0000
119 0.0117    0.9883
120 1.0000    0.0000
121 0.0529    0.9471
122 0.9536    0.0464
123 0.0027    0.9973
124 0.0000    1.0000
125 0.0000    1.0000
126 0.0163    0.9837
127 0.0000    1.0000
128 0.0000    1.0000
129 0.0111    0.9889
130 0.0000    1.0000
131 0.0694    0.9306
132 0.0000    1.0000
133 0.0137    0.9863
134 0.0000    1.0000
135 0.0000    1.0000
136 0.0052    0.9948
137 0.0000    1.0000
138 0.0000    1.0000
139 0.0135    0.9865
140 0.0000    1.0000
141 0.0140    0.9860
142 0.0000    1.0000
143 0.6325    0.3675
144 0.0000    1.0000
145 0.0000    1.0000
attr(,"class")
[1] "matrix" "array"  "votes"

Curious

Where are the Adelie penguins in the training set that are misclassified?

parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~1000, mtry = min_cols(~2,      x)) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.8%
Confusion matrix:
          Adelie Chinstrap class.error
Adelie        96         4       0.040
Chinstrap      3        42       0.067

Join data containing true, predicted and predictive probabilities, to diagnose.

# A tibble: 7 × 6
  species      bl    bm pspecies  Adelie Chinstrap
  <fct>     <dbl> <int> <fct>      <dbl>     <dbl>
1 Adelie     41.1  3200 Chinstrap  0.398    0.602 
2 Adelie     46    4200 Chinstrap  0.342    0.658 
3 Adelie     45.8  4150 Chinstrap  0.277    0.723 
4 Adelie     44.1  4000 Chinstrap  0.462    0.538 
5 Chinstrap  40.9  3200 Adelie     1        0     
6 Chinstrap  42.5  3350 Adelie     0.954    0.0464
7 Chinstrap  43.5  3400 Adelie     0.632    0.368

Variable importance (1/2)

For every tree predict the oob cases and count the number of votes cast for the correct class.

Randomly permute the values on a variable in the oob cases and predict the class for these cases.

3.Difference the votes for the correct class in the variable-permuted oob cases and the real oob cases. Average this number over all trees in the forest. If the value is large, then the variable is very important.

Alternatively, Gini importance adds up the difference in impurity value of the descendant nodes with the parent node. Quick to compute.

Read a fun explanation by Harriet Mason

Variable importance (2/2)

p_fit_rf$fit$importance

   MeanDecreaseGini
bl             57.2
bm              4.5

p_tr_perm <- p_tr |>
  mutate(bl = sample(bl))
ggplot(p_tr_perm, aes(x=bl, y=bm, colour=species)) +
  geom_point()  +
  scale_color_discrete_divergingx(palette = "Zissou 1") +
  ggtitle("Permuted bl") +
  theme(legend.position="none")

Votes will be close to 0.5 for both classes.

Proximities

Measure how each pair of observations land in the forest
Run both in- and out-of-bag cases down the tree, and increase proximity value of cases $i, j$ by 1 each time they are in the same terminal node.
Normalize by dividing by $B$.

This creates a similarity matrix between all pairs of observations.

Use this for cluster analysis of the data for further diagnosing unusual observations, and model inadequacies.

Utilising diagnostics (1/3)

The votes matrix yields more information than the confusion matrix, about the confidence that the model has in the prediction for each observation, in the training set.

It is a $K$-D object, but lives in $(K-1)$-D because the rows add to 1.

Let’s re-fit the random forest model to the three species of the penguins.

p_ternary

Utilising diagnostics (2/3)

DEMO: Use interactivity to investigate the uncertainty in the predictions.

library(detourr)
library(crosstalk)
library(plotly)
library(viridis)
p_tr2_std <- p_tr2 |>
  mutate_if(is.numeric, function(x) (x-mean(x))/sd(x))
p_tr2_v <- bind_cols(p_tr2_std, p_rf_v_p[,1:2]) 
p_tr2_v_shared <- SharedData$new(p_tr2_v)

detour_plot <- detour(p_tr2_v_shared, tour_aes(
  projection = bl:bm,
  colour = species)) |>
    tour_path(grand_tour(2), 
                    max_bases=50, fps = 60) |>
       show_scatter(alpha = 0.9, axes = FALSE,
                    width = "100%", 
                    height = "450px",
                    palette = hcl.colors(3,
                      palette="Zissou 1"))

vot_mat <- plot_ly(p_tr2_v_shared, 
                    x = ~x1,
                    y = ~x2,
                    color = ~species,
                    colors = hcl.colors(3,
                        palette="Zissou 1"),
                    height = 450) |>
  highlight(on = "plotly_selected", 
              off = "plotly_doubleclick") %>%
    add_trace(type = "scatter", 
              mode = "markers")
  
bscols(
     detour_plot, vot_mat,
     widths = c(5, 6)
 )

Utilising diagnostics (3/3)

Variable importance can help with variable selection.

p_fit_rf2$fit$importance

   MeanDecreaseGini
bl               58
bd               28
fl               45
bm               12

Top two variables are bl and fl.

Especially useful when you have many more variables.

Boosted trees (1/3)

Random forests build an ensemble of independent trees, while boosted trees build an ensemble from shallow trees in a sequence with each tree learning and improving on the previous one, by re-weighting observations to give mistakes more importance.

Source: Boehmke (2020) Hands on Machine Learning with R

Boosted trees (2/3)

Boosting iteratively fits multiple trees, sequentially putting more weight on observations that have predicted inaccurately.

Set weights (probabilities) for all observations in training set ( according to class sample sizes using log odds ratio). Fit a tree with fixed $d$ splits ( $d+1$ terminal nodes).
For b=1, 2, …, B, repeat:
1. Compute fitted values
2. Compute pseudo-residuals
3. Fit the tree to the residuals
4. Compute new weights (probabilities)
Aggregate predictions from all trees.

This StatQuest video by Josh Starmer, is the best explanation!

And this is a fun explanation of boosting by Harriet Mason.

Boosted trees (3/3)

set.seed(1110)
bt_spec <- boost_tree() |>
  set_mode("classification") |>
  set_engine("xgboost")
p_fit_bt <- bt_spec |> 
  fit(species ~ ., data = p_tr2)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy multiclass     0.991

# A tibble: 3 × 4
# Groups:   species [3]
  species   Adelie Chinstrap Accuracy
  <fct>      <int>     <int>    <dbl>
1 Adelie        50         1    0.980
2 Chinstrap      0        23    1    
3 Gentoo         0         0    1

Limitations of trees

Most implementations only splits on a single variable, not combinations.
There are versions that build trees on combinations, eg PPTreeViz and PPforest, but you lose interpretability, and fitting is more difficult.
Sees only splits, but not gaps. (See support vector machines, in a few weeks.)
Algorithm takes variables in order, and splits in order, and will use first as best.
Need tuning and cross-validation.

x1	x2	x3	x4	y
19	-8	22	-24	A
NA	-10	26	-26	A
15	NA	32	-27	B
17	-6	27	-25	A
18	-5	NA	-23	A
13	-3	37	NA	B
12	-1	35	-30	B
11	-7	24	-31	B

x1	x2	x3	x4	y
19	-8	22	-24	A
NA	-10	26	-26	A
15	NA	32	-27	B
17	-6	27	-25	A
18	-5	NA	-23	A
13	-3	37	NA	B
12	-1	35	-30	B
11	-7	24	-31	B

ETC3250/5250 Introduction to Machine Learning

Overview

Trees

Algorithm: growing a tree

Common split criteria

Classification

Regression

Illustration (1/2)

Illustration (2/2)

Calculate the impurity for split 5

Example: penguins 1/3

Stopping rules

Example: penguins 2/3

Example: penguins 3/3

Example: penguins 3/4

Comparison with LDA

Random forests

Overview

Bagging and variable sampling

Comparison with a single tree and LDA

Random forest fit and predicted values

Diagnostics

Vote Matrix

Curious

Variable importance (1/2)

Variable importance (2/2)

Proximities

Utilising diagnostics (1/3)

Utilising diagnostics (2/3)

Utilising diagnostics (3/3)

Boosted trees (1/3)

Boosted trees (2/3)

Boosted trees (3/3)

Limitations of trees

Next: Neural networks and deep learning

x1	x2	x3	x4	y
19	-8	22	-24	A
NA	-10	26	-26	A
15	NA	32	-27	B
17	-6	27	-25	A
18	-5	NA	-23	A
13	-3	37	NA	B
12	-1	35	-30	B
11	-7	24	-31	B