ETC3250/5250 Introduction to Machine Learning

Week 5: Trees and forests

Professor Di Cook

Department of Econometrics and Business Statistics

Overview

We will cover:

  • Classification trees, algorithm, stopping rules
  • Difference between algorithm and parametric methods, especially trees vs LDA
  • Forests: ensembles of bagged trees
  • Diagnostics: vote matrix, variable importance, proximity
  • Boosted trees

Trees

Nice explanation of trees, forests, boosted trees

Algorithm: growing a tree

  1. All observations in a single set
  2. Sort values on first variable
  3. Compute the chosen split criteria for all possible splits into two sets
  4. Choose the best split on this variable. Save this info.
  5. Repeat 2-4 for all other variables
  6. Choose the best variable to split on, based on the best split. Your data is now in two sets.
  7. Repeat 1-6 on each subset.
  8. Stop when stopping rule that decides that the best classification model is achieved.

Pros and cons:

  • Trees are a very flexible way to fit a classifier.
  • They can
    • utilise different types of predictor variables
    • ignore missing values
    • handle different units or scales on variables
    • capture intricate patterns
  • However, they operate on a per variable basis, and do not effectively model separation when a combination of variables is needed.

Common split criteria

Classification

  • The Gini index measures is defined as: \[G = \sum_{k =1}^K \widehat{p}_{mk}(1 - \widehat{p}_{mk})\]
  • Entropy is defined as \[D = - \sum_{k =1}^K \widehat{p}_{mk} log(\widehat{p}_{mk})\] What corresponds to a high value, and what corresponds to a low value?

Regression

Define

\[\mbox{MSE} = \frac{1}{n}\sum_{i=1}^{n} (y_i - \widehat{y}_i)^2\]

Split the data where combining MSE for left bucket (MSE_L) and right bucket (MSE_R), makes the biggest reduction from the overall MSE.

Illustration (1/2)

x cl
11 A
33 A
39 B
44 A
50 A
56 B
70 B

Note: x is sorted from lowest to highest!

All possible splits shown by vertical lines