---
title: "ETC3250/5250 Tutorial 7 Solution"
subtitle: "Tree models"
author: "prepared by Professor Di Cook"
date: "Week 7"
output:
html_document:
after_body: tutorial-footer.html
css: tutorial.css
---
```{r, echo = FALSE, message = FALSE, warning = FALSE, warning = FALSE}
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
error = FALSE,
eval = TRUE,
collapse = TRUE,
comment = "#",
fig.height = 4,
fig.width = 8,
fig.align = "center",
cache = FALSE
)
library(emo)
```
```{r}
library(tidyverse)
library(knitr)
library(kableExtra)
library(tidymodels)
library(rpart.plot)
library(discrim)
library(tourr)
```
## `r emo::ji("gear")` Exercises
### 1. This question is about entropy as an impurity metric for a classification tree.
a. Write down the formula for entropy as an impurity measure for two groups.
$$- \hat{p}_{1} log(\hat{p}_{1}) - \hat{p}_{2} log(\hat{p}_{2})$$
b. Establish that the the worst case split has 50% one group and 50% the other group, in whatever way you would like (algebraicly or graphically).
```{r out.width="50%"}
p <- seq(0.01, 0.99, 0.01)
y <- -p*log(p)-(1-p)*log(1-p)
df <- tibble(p, y)
ggplot(df, aes(x=p, y=y)) + geom_line() + ylab("entropy")
```
**The highest value occurs when $p=0.5$, which is the worst possible value the impurity can take.**
c. Extend the entropy formula so that it can be used to describe the impurity for a possible split of the data into two subsets. That is, it needs to be the sum of the impurity for both left and right subsets of data.
**Let $L$ indicate the subset of observations to the left of the split, and $R$ indicate those to the right.**
$$p_L(- \hat{p}_{L1} log(\hat{p}_{L1}) - \hat{p}_{L2} log(\hat{p}_{L2})) + p_R(-\hat{p}_{R1} log(\hat{p}_{R1}) - \hat{p}_{R2} log(\hat{p}_{R2}))$$
### 2. Computing impurity
For this sample of data,
```{r}
df <- tibble(x=c(1,3,4,5,7), y=c("A", "B", "A", "B", "B"))
kable(df) %>% kable_styling()
```
a. Compute the entropy impurity metric for all possible splits.
```{r}
splits <- tibble(split=c(2, 3.5, 4.5, 6),
impurity = c(4/5*(-1/4*log(1/4)-3/4*log(3/4)),
2/5*(-2*1/2*log(1/2))+3/5*(-1/3*log(1/3)-2/3*log(2/3)),
3/5*(-2/3*log(2/3)-1/3*log(1/3)),
4/5*(-2*1/2*log(1/2))) )
splits %>% kable() %>%
kable_styling(full_width = F)
```
b. Write down the classification rule for the tree that would be formed for the best split.
**If $x>4.5$ classify new observation to group B.**
### 3. Write decision tree model
For the following data set, compute the default classification tree. Write out the tree rules, and also sketch the boundary between classes.
a. olive oils, for three regions
```{r out.width="60%"}
olive <- read_csv("http://www.ggobi.org/book/data/olive.csv") %>%
rename(name=`...1`) %>%
dplyr::select(-name, -area) %>%
mutate(region = factor(region))
tree_spec <- decision_tree() %>%
set_engine("rpart")
class_tree_spec <- tree_spec %>%
set_mode("classification")
olive_rp <- class_tree_spec %>%
fit(region~., data=olive)
olive_rp
olive_rp %>%
extract_fit_engine() %>%
rpart.plot()
ggplot(olive, aes(x=eicosenoic,
y=linoleic,
colour=region)) +
geom_point() +
scale_color_brewer("", palette="Dark2") +
geom_vline(xintercept=6.5) +
annotate("line", x=c(0, 6.5), y=c(1053.5, 1053.5)) +
theme(aspect.ratio = 1)
```
b. chocolates, for type
```{r out.width="60%"}
choc <- read_csv(here::here("data/chocolates.csv")) %>%
select(Type:Protein_g) %>%
mutate(Type = factor(Type))
choc_rp <- class_tree_spec %>%
fit(Type~., data=choc)
choc_rp
choc_rp %>%
extract_fit_engine() %>%
rpart.plot()
ggplot(choc, aes(x=Fiber_g, y=CalFat, colour=Type)) +
geom_point() +
scale_color_brewer("", palette="Dark2") +
geom_vline(xintercept=4.83) +
annotate("line", x=c(0, 4.83), y=c(337.7, 337.7)) +
theme(aspect.ratio = 1)
```
c. flea, for species
```{r out.width="60%"}
data(flea)
flea_rp <- class_tree_spec %>%
fit(species~., data=flea)
flea_rp
flea_rp %>%
extract_fit_engine() %>%
rpart.plot()
ggplot(flea, aes(x=aede3, y=tars1, colour=species)) +
geom_point() +
scale_color_brewer("", palette="Dark2") +
geom_vline(xintercept=93.5) +
annotate("line", x=c(93.5, 123), y=c(159, 159)) +
theme(aspect.ratio = 1)
```
### 4. Which model should perform best
For the crabs data, make a new variable combining species and gender into one class variable.
a. Use the grand and guided tour with the LDA index to examine the data. Describe the shape. Between LDA and a classification tree which do you expect to perform better on this data?
```{r}
crabs <- read_csv("http://www.ggobi.org/book/data/australian-crabs.csv") %>%
mutate(class = interaction(species, sex)) %>%
dplyr::select(-index, -species,-sex)
```
**The variables are highly correlated, and the difference between groups uses a combination of variables. Trees will have a difficult time with this data. LDA should perform better.**
b. Use 10-fold cross-validation to determine the best choice of minsplit, for the training set of an 80:20 training:test split of the original data. (Check the code from the lecture 6a/b notes to use as an example.)
```{r out.width = "80%"}
set.seed(20220405)
crabs_split <- initial_split(crabs, prop = 0.8)
crabs_tr <- training(crabs_split)
crabs_ts <- testing(crabs_split)
tune_spec <-
decision_tree(
cost_complexity = tune(),
min_n = tune()
) %>%
set_engine("rpart") %>%
set_mode("classification")
crabs_folds <- vfold_cv(crabs_tr, 10)
tree_grid <- grid_regular(parameters(min_n(),
cost_complexity()),
levels = c(min_n=30,
cost_complexity = 1))
tree_grid <- expand.grid(min_n = 5:20,
cost_complexity = c(0.01, 0.005))
tree_wf <- workflow() %>%
add_model(tune_spec) %>%
add_formula(class~.)
tree_res <-
tree_wf %>%
tune_grid(
resamples = crabs_folds,
grid = tree_grid,
metrics = metric_set(accuracy)
)
autoplot(tree_res) +
scale_color_brewer("cost-complex", palette="Dark2")
# Show the bootstrap intervals
crabs_tune <- tree_res %>%
collect_metrics() %>%
mutate(min_n = factor(min_n)) %>%
mutate(min_n = as.numeric(as.character(min_n)),
cost_complexity = as.factor(cost_complexity))
crabs_tune %>% ggplot() +
geom_line(aes(x=min_n, y=mean,
colour = cost_complexity,
group = cost_complexity),
size = 1.5, alpha = 0.6) +
scale_color_brewer("cost-complex", palette="Dark2") +
geom_point(aes(x=min_n, y=mean,
colour = cost_complexity), size = 2) +
geom_errorbar(aes(x=min_n,
ymin = mean-std_err,
ymax = mean+std_err))
```
**The plot above shows the 10-fold cross-validation results for different choices of cost_complexity and min_n (minsplit). The different folds give multiple values, and hence the mean and a confidence interval can be computed for each parameter combination. Regardless of the choice of parameter, the accuracy is fairly similar, especially for smaller values of min_n.**
```{r out.width = "80%"}
best_tree <- tree_res %>%
select_best()
final_wf <-
tree_wf %>%
finalize_workflow(best_tree)
# Fit best
final_tree <-
final_wf %>%
fit(data = crabs_tr)
# Plot tree
final_tree %>%
extract_fit_engine() %>%
rpart.plot()
# This is a nicer tree diagram
final_tree %>%
extract_fit_engine() %>%
prp(type = 3, ni = TRUE,
nn = TRUE, extra = 2, box.palette = "RdBu")
```
c. Fit the classification tree with the recommended minsplit. Compute the test accuracy, using your 20% test set. Explain why the tree is so complicated. Compare with the accuracy from an LDA. Is this consistent with what you thought would be the best model?
```{r}
crabs_ts_pred <- augment(final_tree, crabs_ts)
conf_mat(crabs_ts_pred, class, .pred_class)
metrics(crabs_ts_pred, truth = class, estimate = .pred_class)
```
```{r}
lda_mod <- discrim_linear() %>%
set_engine("MASS") %>%
translate()
crabs_lda_fit <-
lda_mod %>%
fit(class ~ .,
data = crabs_tr)
# crabs_lda_fit
crabs_lda_pred <- augment(crabs_lda_fit, crabs_ts)
metrics(crabs_lda_pred, truth = class,
estimate = .pred_class)
```
**The LDA model outperforms the tree model substantially.**
##### © Copyright 2022 Monash University