Load the libraries and avoid conflicts, and prepare data
# Load libraries used everywherelibrary(tidyverse)library(tidymodels)library(patchwork)library(mulgar)library(GGally)library(tourr)library(plotly)library(randomForest)library(colorspace)library(ggthemes)library(conflicted)library(DALEXtra)# devtools::install_github("dandls/counterfactuals")# You need the GitHub versionlibrary(counterfactuals)library(kernelshap)library(shapviz)library(lime)library(palmerpenguins)conflicts_prefer(dplyr::filter)conflicts_prefer(dplyr::select)conflicts_prefer(dplyr::slice)conflicts_prefer(palmerpenguins::penguins)p_tidy <- penguins |>select(species, bill_length_mm:body_mass_g) |>rename(bl=bill_length_mm,bd=bill_depth_mm,fl=flipper_length_mm,bm=body_mass_g) |>na.omit()# `id` variable added to ensure we know which case# when investigating the modelsp_std <- p_tidy |>mutate_if(is.numeric, function(x) (x-mean(x))/sd(x)) |>mutate(id =1:nrow(p_tidy)) # Only use Adelie and Chinstrap, because explainers are easy to calculate with only two groupsp_sub <- p_std |>filter(species !="Gentoo") |>mutate(species =factor(species)) # Fix factor# Split intro training and test setsset.seed(821)p_split <- p_sub |>select(species:id) |>initial_split(prop =2/3, strata=species)p_train <-training(p_split)p_test <-testing(p_split)
🎯 Objectives
The goal for this week is learn to diagnose a model, and understand variable importance and local explainers.
🔧 Preparation
Make sure you have all the necessary libraries installed. There are a few new ones this week!
Exercises:
Open your project for this unit called iml.Rproj.
CHALLENGE QUESTION: In the penguins data, find an observation where you think various models might differ in their prediction. Try to base your choice on the structure of the various models, not from that observation being in an overlap area between class clusters. (The code like that below will help to identify observations by their row number.)
1. Create and build - construct the (non-linear) model
Fit a random forest model to a subset of the penguins data containing only Adelie and Chinstrap. Report the summaries, and which variable(s) are globally important.
set.seed(857)p_rf <-randomForest(species ~ ., data = p_train[,-6])
2. How does your model affect individuals?
Compute LIME, counterfactuals and SHAP for these cases: 19, 28, 37, 111, 122, 129, 281, 292, 295, 305. Report these values. (You can use this code to compute these.)
Explain what you learn about the fitted model by studying the local explainers for the selected cases. (You will want to compare the suggested variable importance of the local explainers, for an observation, and then make plots of those variables with the observation of interest marked.)
👋 Finishing up
Make sure you say thanks and good-bye to your tutor. This is a time to also report what you enjoyed and what you found difficult.