---
title: "ETC3250/5250 Tutorial 6 Instructions"
subtitle: "Tree models"
author: "prepared by Professor Di Cook"
date: "Week 6"
output:
html_document:
after_body: tutorial-footer.html
css: tutorial.css
---
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
error = FALSE,
eval = FALSE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 8,
fig.align = "center",
cache = FALSE
)
library(emo)
```
## `r emo::ji("target")` Objective
The objectives for this week are to
- Practice fitting a classification tree model
- Understand the way the tree model is fitted based on impurity
- Learn about the relationship between fitting parameters, bias and variance
## `r emo::ji("wrench")` Preparation
Make sure you have these packages installed:
```
install.packages(c("tidyverse", "tidymodels", "tourr", "rpart.plot", "discrim"))
```
### `r emo::ji("book")` Reading
- Textbook section 8.1
## `r emo::ji("waving_hand")` Getting started
If you are in a zoom tutorial, say hello in the chat. If in person, do say hello to your tutor and to your neighbours.
```{r eval=TRUE}
library(tidyverse)
library(knitr)
library(kableExtra)
library(tidymodels)
library(rpart.plot)
library(discrim)
library(tourr)
```
## `r emo::ji("gear")` Exercises
### 1. This question is about entropy as an impurity metric for a classification tree.
a. Write down the formula for entropy as an impurity measure for two groups.
b. Establish that the the worst case split has 50% one group and 50% the other group, in whatever way you would like (algebraicly or graphically).
c. Extend the entropy formula so that it can be used to describe the impurity for a possible split of the data into two subsets. That is, it needs to be the sum of the impurity for both left and right subsets of data.
```{r out.width="50%"}
p <- seq(0.01, 0.99, 0.01)
y <- -p*log(p)-(1-p)*log(1-p)
df <- tibble(p, y)
ggplot(df, aes(x=p, y=y)) + ???
```
### 2. Computing impurity
For this sample of data,
```{r eval=TRUE, echo=FALSE}
df <- tibble(x=c(1,3,4,5,7), y=c("A", "B", "A", "B", "B"))
kable(df) %>% kable_styling()
```
a. Compute the entropy impurity metric for all possible splits.
```{r}
splits <- tibble(split=c(2, 3.5, 4.5, 6),
impurity = c(4/5*(-1/4*log(1/4)-3/4*log(3/4)),
2/5*(-2*1/2*log(1/2))+3/5*(-1/3*log(1/3)-2/3*log(2/3)),
3/5*(-2/3*log(2/3)-1/3*log(1/3)),
4/5*(-2*1/2*log(1/2))) )
splits %>% kable() %>%
kable_styling(full_width = F)
```
b. Write down the classification rule for the tree that would be formed for the best split.
### 3. Write decision tree model
For the following data set, compute the default classification tree. Write out the tree rules, and also sketch the boundary between classes.
a. olive oils, for three regions
```{r}
olive <- read_csv("http://www.ggobi.org/book/data/olive.csv") %>%
rename(name=`...1`) %>%
dplyr::select(-name, -area) %>%
mutate(region = factor(region))
```
```{r echo=FALSE}
tree_spec <- decision_tree() %>%
set_engine("rpart")
class_tree_spec <- tree_spec %>%
set_mode("classification")
olive_rp <- class_tree_spec %>%
fit(region~., data=olive)
olive_rp
olive_rp %>%
extract_fit_engine() %>%
rpart.plot()
ggplot(olive, aes(x=eicosenoic,
y=linoleic,
colour=region)) +
geom_point() +
scale_color_brewer("", palette="Dark2") +
geom_vline(xintercept=6.5) +
annotate("line", x=c(0, 6.5), y=c(1053.5, 1053.5)) +
theme(aspect.ratio = 1)
```
b. chocolates, for type
```{r}
choc <- read_csv(here::here("data/chocolates.csv")) %>%
select(Type:Protein_g) %>%
mutate(Type = factor(Type))
```
```{r echo=FALSE}
choc_rp <- class_tree_spec %>%
fit(Type~., data=choc)
choc_rp
choc_rp %>%
extract_fit_engine() %>%
rpart.plot()
ggplot(choc, aes(x=Fiber_g, y=CalFat, colour=Type)) +
geom_point() +
scale_color_brewer("", palette="Dark2") +
geom_vline(xintercept=4.83) +
annotate("line", x=c(0, 4.83), y=c(337.7, 337.7)) +
theme(aspect.ratio = 1)
```
c. flea, for species
```{r}
data(flea)
```
```{r echo=FALSE}
flea_rp <- class_tree_spec %>%
fit(species~., data=flea)
flea_rp
flea_rp %>%
extract_fit_engine() %>%
rpart.plot()
ggplot(flea, aes(x=aede3, y=tars1, colour=species)) +
geom_point() +
scale_color_brewer("", palette="Dark2") +
geom_vline(xintercept=93.5) +
annotate("line", x=c(93.5, 123), y=c(159, 159)) +
theme(aspect.ratio = 1)
```
### 4. Which model should perform best
For the crabs data, make a new variable combining species and gender into one class variable.
a. Use the grand and guided tour with the LDA index to examine the data. Describe the shape. Between LDA and a classification tree which do you expect to perform better on this data?
```{r}
crabs <- read_csv("http://www.ggobi.org/book/data/australian-crabs.csv") %>%
mutate(class = interaction(species, sex)) %>%
dplyr::select(-index, -species,-sex)
```
b. Use 10-fold cross-validation to determine the best choice of minsplit, for the training set of an 80:20 training:test split of the original data. (Check the code from the lecture 6a/b notes to use as an example.)
c. Fit the classification tree with the recommended minsplit. Compute the test accuracy, using your 20% test set. Explain why the tree is so complicated. Compare with the accuracy from an LDA. Is this consistent with what you thought would be the best model?
```{r echo=FALSE, out.width = "80%"}
best_tree <- tree_res %>%
select_best()
final_wf <-
tree_wf %>%
finalize_workflow(best_tree)
# Fit best
final_tree <-
final_wf %>%
fit(data = crabs_tr)
# Plot tree
final_tree %>%
extract_fit_engine() %>%
rpart.plot()
# This is a nicer tree diagram
final_tree %>%
extract_fit_engine() %>%
prp(type = 3, ni = TRUE,
nn = TRUE, extra = 2, box.palette = "RdBu")
# Assessing model
crabs_ts_pred <- augment(final_tree, crabs_ts)
conf_mat(crabs_ts_pred, class, .pred_class)
metrics(crabs_ts_pred, truth = class, estimate = .pred_class)
# LDA
lda_mod <- discrim_linear() %>%
set_engine("MASS") %>%
translate()
crabs_lda_fit <-
lda_mod %>%
fit(class ~ .,
data = crabs_tr)
crabs_lda_pred <- augment(crabs_lda_fit, crabs_ts)
metrics(crabs_lda_pred, truth = class,
estimate = .pred_class)
```
##### © Copyright 2022 Monash University