For trained model \(\widehat{f}\), which depends on data \({\mathbf X}\) to predict response \(y\), with loss function \(L(y, \widehat{f})\) (e.g. misclassification rate, error),
Estimate \(L(y, \widehat{f})\) on the data, \(L^{\text{orig}}\).
For each variable \(j \in {1, ..., p}\),
Generate data matrix \({\mathbf X}^{\text{perm}}\) by permuting variable \(j\). This breaks the association between variable \(j\) and observed \(y\).
Compute the \(L(y, \widehat{f})\) on the permuted data, \(L^{\text{perm}}\).
Compare \(L^{\text{orig}}\) and \(L^{\text{perm}}\), e.g. \(|L^{\text{orig}}-L^{\text{perm}}|\)
Most important variables have larger values.
Permutation variable importance (2/2)
Random forests have this baked into the model fitting (using the out-of-bag cases).
Generally, should be conducted on the test set.
# Using DALEX with tidymodels# https://www.tmwr.org/explain# https://ema.drwhy.ai/featureImportance.htmlvip_features <-colnames(p_std)[2:5]vip_train <- p_std |>select(all_of(vip_features))explainer_lda <-explain_tidymodels( lda_fit, data = vip_train, y = p_std$species,verbose =FALSE )vip_lda <-model_parts(explainer_lda,B=100)
Data with additional correlated variables
Variables with correlation still can affect results.
Variables can mask the importance of others.
Partial dependence profiles (1/2)
Partial dependence profiles show how the model prediction changes across different values of an explanatory variable.
# With DALEXpdp_lda <-model_profile( explainer_lda,N=100)
Shows what the model sees.
Partial dependence profiles (2/2)
PDP suggests LDA sees
What do we see?
Local explainability
Linear vs non-linear separation
When the difference between classes is non-linear, variable importance changes locally.
Mark a point where x1 is most important in distinguishing the classes.
Mark a point where x2 is most important in distinguishing the classes.
Why should I know about local explainers?
If you deploy a complex model, you may need to be able to explain any decision made from it.
If the decisions affect people or organisations, they might be challenged in court. You as the analyst may be expected to justify the decision, that it was made fairly, without bias, and based on specific measurements collected for the model.
Selected points to use for illustration
Which variable is most important?
obs
expect
1
x1
2
x2
3
x2 ?
4
x1, x2
5
x1, x2
6
x2
LIME
Fit a linear regression in the local neighbourhood of observation of interest.
Find the closest observation (counterfactual) that has the different class. What values of the variables would you need to change to change the observation of interest into the counterfactual.
x1o x2o clo x1 x2 cl
1 -0.5 -0.25 A -0.5000 -0.31 B
2 0.0 0.00 B -0.0057 0.00 A
3 0.2 -0.50 B 0.1358 -0.50 A
4 -0.8 0.80 A -0.1785 0.51 B
5 0.8 -0.80 B 0.1358 -0.52 A
6 0.8 0.50 A 0.8249 0.50 B
Note: If case is misclassified, the desired class needs to be the true class.
Anchors
How far can you extend from the value of the observation in each direction and still have all observations be the same class.
Note: No working R package to calculate these.
Shapley values
A Shapley value is computed from the change in prediction when all combinations of presence or absence of other variables. In the computation, for each combination, the prediction is computed by substituting absent variables with their average value.
x1 x2 cl shapAx1 shapAx2
1 -0.5 -0.25 A 0.358 0.15
2 0.0 0.00 B -0.236 -0.25
3 0.2 -0.50 B -0.164 -0.32
4 -0.8 0.80 A 0.255 0.26
5 0.8 -0.80 B -0.215 -0.27
6 0.8 0.50 A -0.059 0.57
Summary
Which variable is most important?
obs
expect
LIME
CF
SHAP
1
x1
x1
x2
x1
2
x2
x1
x1
x1, x2
3
x2 ?
x2
x1
x2
4
x1, x2
x1, x2
x1, x2
x1, x2
5
x1, x2
x2
x1, x2
x1, x2
6
x2
x2
x1
x2
They don’t all agree.
You need good visualisation of the model in the data space to fully digest the importance of the variables.
NOTE: We can use magnitude when interpreting the local explainers because we used standardised data. The interpretations are more complicated otherwise.