Load the libraries and avoid conflicts, and prepare data
# Load libraries used everywherelibrary(tidyverse)library(tidymodels)library(patchwork)library(mulgar)library(GGally)library(tourr)library(plotly)library(randomForest)library(colorspace)library(ggthemes)library(conflicted)library(DALEXtra)# devtools::install_github("dandls/counterfactuals")# You need the GitHub versionlibrary(counterfactuals)library(kernelshap)library(shapviz)library(lime)library(palmerpenguins)conflicts_prefer(dplyr::filter)conflicts_prefer(dplyr::select)conflicts_prefer(dplyr::slice)conflicts_prefer(palmerpenguins::penguins)p_tidy <- penguins |>select(species, bill_length_mm:body_mass_g) |>rename(bl=bill_length_mm,bd=bill_depth_mm,fl=flipper_length_mm,bm=body_mass_g) |>na.omit()# `id` variable added to ensure we know which case# when investigating the modelsp_std <- p_tidy |>mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) |>mutate(id =1:nrow(p_tidy)) # Only use Adelie and Chinstrap, because explainers are easy to calculate with only two groupsp_sub <- p_std |>filter(species !="Gentoo") |>mutate(species =factor(species)) # Fix factor# Split intro training and test setsset.seed(821)p_split <- p_sub |>select(species:id) |>initial_split(prop =2/3, strata=species)p_train <-training(p_split)p_test <-testing(p_split)
🎯 Objectives
The goal for this week is learn to diagnose a model, and understand variable importance and local explainers.
🔧 Preparation
Make sure you have all the necessary libraries installed. There are a few new ones this week!
Exercises:
Open your project for this unit called iml.Rproj.
CHALLENGE QUESTION: In the penguins data, find an observation where you think various models might differ in their prediction. Try to base your choice on the structure of the various models, not from that observation being in an overlap area between class clusters. (The code like that below will help to identify observations by their row number.)
A scatterplot matrix is useful to get an overall look at the data. Then plot two variables, and if you use plotly to mouse over the plot you can get the row number to show. This can help find observations.
These are the ones that I selected to investigate below:
19, 28, 37, 111, 122, 129 Adelie
185, 189, 250, 253 Gentoo
281, 292, 295, 305 Chinstrap
Most of these choices are because they are outliers in their group, and model fits could possibly have different boundaries in these regions: orthogonal to axes like in a tree/forest rather than oblique like LDA, logistic and NN.
Did you find any others?
1. Create and build - construct the (non-linear) model
Fit a random forest model to a subset of the penguins data containing only Adelie and Chinstrap. Report the summaries, and which variable(s) are globally important.
set.seed(857)p_rf <-randomForest(species ~ ., data = p_train[,-6])
Solution
p_rf
Call:
randomForest(formula = species ~ ., data = p_train[, -6])
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 4.83%
Confusion matrix:
Adelie Chinstrap class.error
Adelie 96 4 0.04000000
Chinstrap 3 42 0.06666667
# A tibble: 2 × 4
# Groups: species [2]
species Adelie Chinstrap Accuracy
<fct> <int> <int> <dbl>
1 Adelie 49 2 0.961
2 Chinstrap 0 23 1
2. How does your model affect individuals?
Compute LIME, counterfactuals and SHAP for these cases: 19, 28, 37, 111, 122, 129, 281, 292, 295, 305. Report these values. (You can use this code to compute these.)
SHAP values suggest bl is most important for most of these observations
For observation 1 (19) and 10 (305), bl, bd and bm are most important
For observation 4 (111), similarly, but bm is more important than bl and bd
For observation 7 (281) and 9 (295), bm is important, along with bl
3. Putting the pieces back together
Explain what you learn about the fitted model by studying the local explainers for the selected cases. (You will want to compare the suggested variable importance of the local explainers, for an observation, and then make plots of those variables with the observation of interest marked.)
Solution
The model has non-linear boundaries because the local variable importance do have different values than the global variable importance. So we have something to investigate locally!
Here’s a summary of what is learned about each observation form the local explainers.
id
species
predicted
LIME
CF
SHAP
19
Adelie
Chinstrap
bd
bd, bm
28
Adelie
Adelie
37
Adelie
Adelie
111
Adelie
Chinstrap
bd, bm
bd, fl
bm
122
Adelie
Adelie
129
Adelie
Chinstrap
bd
281
Chinstrap
Chinstrap
bm
292
Chinstrap
Chinstrap
295
Chinstrap
Adelie
bd
fl
bm
305
Chinstrap
Adelie
bd, bm
Plotting two variables and identifying the observation could help understand the model at this point.
Focus on the misclassified observations, because this helps understand where the model is going wrong. We choose 19 here, which is an observation in the training set, so was used to build the model.
Plot the variables that are important, with the observation of interest marked. Always bl was important. For observation 19, two explainers suggested bd, with one also suggesting bm.
Penguin 19 is a slightly unusual. It has a high value of bd, slightly higher than all other Adelie (blue) penguins. It also has slightly higher bm compared to other Adelie penguins with similar bl values. We suspect that the model, in the way that it makes “boxy” boundaries carved the region in bd vs bl as a prediction region for the Chinstrap (red). (Note, bd and bm together show nothing interesting.)
Penguin 129 is in the test set. Only LIME suggested any variables other than bl were important, and these were bd and bm. We can see that in each of these variables when plotted against bl that this penguin is in the confusion region between Adelie and Chinstrap. The error is likely because the training set had no other Adelie similar bl, bd and bm characteristics, that all penguins in this region in the training set were Chinstrap.
Penguin 295 is a Chinstrap and was in the training set. It is unusual in both fl and bm relative to bl. The error most likely occurs because of the “boxy” boundaries of forests. Most of the penguins with these characteristics are Adelie, hence the misclassification.
👋 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.