The goal for this week is learn to fit, diagnose, assess assumptions, and predict from classification tree and random forest models.
🔧 Preparation
Make sure you have all the necessary libraries installed. There are a few new ones this week!
Exercises:
Open your project for this unit called iml.Rproj. For all the work we will use the penguins data. Start with splitting it into a training and test set, as follows.
1. Becoming a car mechanic - looking under the hood at the tree algorithm
Write down the equation for the Gini measure of impurity, for two groups, and the parameter \(p\) which is the proportion of observations in class 1. Specify the domain of the function, and determine the value of \(p\) which gives the maximum value, and report what that maximum function value is.
For two groups, how would the impurity of a split be measured? Give the equation.
Below is an R function to compute the Gini impurity for a particular split on a single variable. Work through the code of the function, and document what each step does. Make sure to include a not on what the minsplit parameter, does to prevent splitting on the edges fewer than the specified number of observations.
# This works for two classes, and one variablemygini <-function(p) { g <-0if (p>0&& p<1) { g <-2*p*(1-p) }return(g)}mysplit <-function(x, spl, cl, minsplit=5) {# Assumes x is sorted# Count number of observations n <-length(x)# Check number of classes cl_unique <-unique(cl)# Split into two subsets on the given value left <- x[x<spl] cl_left <- cl[x<spl] n_l <-length(left) right <- x[x>=spl] cl_right <- cl[x>=spl] n_r <-length(right)# Don't calculate is either set is less than minsplitif ((n_l < minsplit) | (n_r < minsplit)) impurity =NAelse {# Compute the Gini value for the split p_l <-length(cl_left[cl_left == cl_unique[1]])/n_l p_r <-length(cl_right[cl_right == cl_unique[1]])/n_rif (is.na(p_l)) p_l<-0.5if (is.na(p_r)) p_r<-0.5 impurity <- (n_l/n)*mygini(p_l) + (n_r/n)*mygini(p_r) }return(impurity)}
Apply the function to compute the value for all possible splits for the body mass (bm), setting minsplit to be 1, so that all possible splits will be evaluated. Make a plot of these values vs the variable.
Use your function to compute the first two steps of a classification tree model for separating Adelie from Chinstrap penguins, after setting minsplit to be 5. Make a scatterplot of the two variables that would be used in the splits, with points coloured by species, and the splits as line segments.
2. Digging deeper into diagnosing an error
Fit the random forest model to the full penguins data.
Report the confusion matrix.
Use linked brushing to learn which was the Gentoo penguin that the model was confused about. When we looked at the data in a tour, there was one Gentoo penguin that was an outlier, appearing to be away from the other Gentoos and closer to the Chinstrap group. We would expect this to be the penguin that the forest model is confused about. Is it?
Have a look at the other misclassifications, to understand whether they are ones we’d expect to misclassify, or whether the model is not well constructed.
Fit a random forest to the bushfire data. You can read more about the bushfire data at https://dicook.github.io/mulgar_book/A2-data.html. Examine the votes matrix using a tour. What do you learn about the confusion between fire causes?
Fit a boosted tree model using xgboost to the bushfires data. You can use the code below. Compute the confusion tables and the balanced accuracy for the test data for both the forest model and the boosted tree model, to make the comparison.
set.seed(121)bf_spec2 <-boost_tree() |>set_mode("classification") |>set_engine("xgboost")bf_fit_bt <- bf_spec2 |>fit(cause~., data = bf_tr)
👋 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.