library(arrow)
library(tidyverse)
library(splines)
library(lightgbm)
library(hstats) # H statistics, ICE, (stratified) PDP
library(effectplots) # PDP, ALE, M plots
library(patchwork)
set.seed(1)AI TOOLS FOR ACTUARIES
Chapter 9: Explainability - Part A
Chapter 9: Explainability - Part A
1 Introduction
Building a strong GLM is a complex and time consuming task:
How should one engineer the covariates?
Which interactions should be included?
Machine learning (ML) alternatives, such as GBMs and FNNs, take care of these difficulties internally by covariate engineering.
Their ability to represent complicated interactions is a main reason for their typically superior performance over GLMs.
This comes at the cost of interpretability.
While a GLM can be exactly described by its coefficients, the behavior of a complex machine learning algorithm can only be described approximately by methods from eXplainable AI (XAI).

This leads to the modeler’s dilemma depicted in the above graph:
Do we prefer an oversimplified model that can be interpreted exactly,
or do we rather want a more realistic one that can only be interpreted approximately?
This chapter guides through the most important methods to explain machine learning models and their predictions.
The main focus is on methods that can be applied to any model, i.e., model-agnostic methods.
The methods are illustrated on the French MTPL insurance dataset introduced in the GLM chapter using both a boosted tree model – called black box model – and a GLM – called glass box model.
The boundary between glass and black box models is blurry:
A GLM can contain many interactions or features derived by complex transformations of multiple covariates, e.g., risk scores of vehicle types. This makes a GLM as hard to interpret as a black box model.
On the other hand, black box models can be designed to be intrinsically interpretable, an example is the additive boosted tree of Lou, Caruana and Gehrke (2012).
Generally, the more important explainability in a specific situation, the more efforts need to be put into covariate engineering:
E.g., some explainability methods that we present are only useful for weakly correlated/dependent covariates.
Decorrelation of covariates (e.g., by taking differences or ratios) is a common pre-processing step in this context.
Additionally, to make visualizations more informative, it helps to
clip (censor) extreme covariate values,
use log-transformations of positive, right-skewed covariates, or
collapse rare levels of high-cardinality categorical covariates.
Such transformations might not necessarily improve the model performance, but they typically also do not harm.
The model-agnostic methods presented in this chapter only require an (estimated) prediction function \(\mu(\cdot)\) and \(\widehat{\mu}(\cdot)\), respectively, and a dataset \({\cal D} = (\boldsymbol{X}_i, v_i)_{i=1}^n\), where \(v_i>0\) are optional case weights.
Permutation importance additionally requires observed responses \((Y_i)_{i=1}^n\) (because it is based on predictive performance).
Explainability is a vivid research area, and new methods are developed continuously. We present standard methods presented in this chapter.
This chapter builds up the ETH lecture notes of Mayer and Lorentzen (2025) and the book of Molnar (2019).
2 Data and models
For the theory of GLMs, we refer to Chapter 3.
For the theory of LightGBMs, we refer to Chapter 7.
- We start by loading the necessary libraries.
- Load and pre-process the French MTPL dataset.
freMTPL2freq <- read_parquet("../../Data/freMTPL2freq.parquet")
# Pre-processing of features
prep <- freMTPL2freq |>
mutate(
frequency = ClaimNb / Exposure, # compute frequencies
severity = pmin(5e4, ClaimTotal / ClaimNb), # average claim sizes
Region = fct_lump(Region, 6), # cluster less frequent levels
VehBrand = fct_lump(VehBrand, 5), # cluster less frequent levels
VehAge = pmin(25, VehAge), # censor vehicle age
lDensity = log(Density) # this log-transformation is not necessary for boosting but it gives nicer SHAP plots
)
#summary(prep)
# Partition learning-test sample
learn <- subset(prep, LearnTest == "L")
test <- subset(prep, LearnTest == "T")- Fit the glass box GLM (we do not use all covariates).
# covariates used
features <- c("DrivAge", "lDensity", "VehAge", "VehPower", "Region", "VehBrand", "VehGas")
# GLM with natural cubic spline for Driver's Age
glm_freq <- glm(frequency ~ ns(DrivAge, df = 5) + lDensity + VehAge + VehPower + Region + VehBrand + VehGas,
data = learn, weight = Exposure, family = quasipoisson())
summary(glm_freq)We model the frequencies with the
quasipoissonmodel.We use a natural cubic spline for
DrivAgewith 5 degrees of freedom.
Call:
glm(formula = frequency ~ ns(DrivAge, df = 5) + lDensity + VehAge +
VehPower + Region + VehBrand + VehGas, family = quasipoisson(),
data = learn, weights = Exposure)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.8623728 0.0795397 -23.414 < 2e-16 ***
ns(DrivAge, df = 5)1 -1.3187238 0.0503481 -26.192 < 2e-16 ***
ns(DrivAge, df = 5)2 -1.2454967 0.0599051 -20.791 < 2e-16 ***
ns(DrivAge, df = 5)3 -1.0840892 0.0659222 -16.445 < 2e-16 ***
ns(DrivAge, df = 5)4 -2.8455592 0.1317756 -21.594 < 2e-16 ***
ns(DrivAge, df = 5)5 -0.1958242 0.1404238 -1.395 0.163160
lDensity 0.1012029 0.0053987 18.746 < 2e-16 ***
VehAge -0.0125858 0.0017698 -7.111 1.15e-12 ***
VehPower 0.0250924 0.0043233 5.804 6.48e-09 ***
RegionR24 -0.0383095 0.0378984 -1.011 0.312090
RegionR52 -0.0062894 0.0460457 -0.137 0.891354
RegionR53 -0.0583939 0.0447300 -1.305 0.191732
RegionR82 0.1844766 0.0358023 5.153 2.57e-07 ***
RegionR93 0.0937505 0.0379739 2.469 0.013556 *
RegionOther 0.0005398 0.0345525 0.016 0.987534
VehBrandB2 0.0083529 0.0234681 0.356 0.721895
VehBrandB3 0.0586970 0.0326057 1.800 0.071829 .
VehBrandB5 0.1238978 0.0373757 3.315 0.000917 ***
VehBrandB12 -0.2479413 0.0299750 -8.272 < 2e-16 ***
VehBrandOther 0.0381488 0.0268881 1.419 0.155958
VehGasRegular -0.1772723 0.0174566 -10.155 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 1.681809)
Null deviance: 153852 on 610205 degrees of freedom
Residual deviance: 151032 on 610185 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 6
- Fit the black box LightGBM (we do not use all covariates).
# use the same covariates as in the GLM
X_learn <- data.matrix(learn[features])
X_test <- data.matrix(test[features])
#
dfreq <- lgb.Dataset(
X_learn,
label = learn$frequency,
weight = learn$Exposure,
categorical_feature = c("Region", "VehBrand"),
params = list(feature_pre_filter = FALSE)
)- The GLM and the LightGBM use the same covariates.
Next, we select the hyper-parameters to fit the LightGBM.
These hyper-parameters are identical to the ones in the boosting Chapter 7 (obtained by randomized 5-fold CV).
params_freq <- list(
objective = "poisson",
learning_rate = 0.02,
num_leaves = 31,
reg_lambda = 0,
reg_alpha = 2,
colsample_bynode = 0.8,
subsample = 0.8,
min_child_samples = 20,
min_split_gain = 0.1,
poisson_max_delta_step = 0.1
)
nrounds_freq <- 241
# fit LightGBM
lgb_freq <- lgb.train(params = params_freq, data = dfreq, nrounds = nrounds_freq)- Test performance (Poisson deviance loss):
# Poisson deviance loss (scaled by 10^2 for better visibility)
mean_poisson_deviance <- function(y, pred, weights = rep(1, length(y))){
10^2 * sum(stats::poisson()$dev.resids(y, pred, weights)) / sum(weights)
}
# out-of-sample predictions of null model, GLM, LightGBM
pred_null_freq <- weighted.mean(learn$frequency, learn$Exposure)
pred_glm_freq <- predict(glm_freq, test, type = "response")
pred_lgb_freq <- predict(lgb_freq, X_test)This computes the out-of-sample predictions of the null model (without covariates), the GLM and the LightGBM.
Note that our models do not consider all covariates. Therefore, they are not comparable to the most competitive models on this dataset.
# Null / intercept-only model: 47.967
round(mean_poisson_deviance(test$frequency, pred_null_freq, test$Exposure),3)[1] 47.967
# GLM: 46.974
round(mean_poisson_deviance(test$frequency, pred_glm_freq, test$Exposure),3)[1] 46.974
# LightGBM: 46.891
round(mean_poisson_deviance(test$frequency, pred_lgb_freq, test$Exposure),3)[1] 46.891
For this dataset and the selected covariates, the test performance of the GLM and the LightGBM are similar. This may indicate that the GLM is well specified and there are no strong interactions.
Can we confirm this guesswork by applying explainability methods?
3 Variable importance
3.1 Variable importance
Answering the previous questions is helpful in different ways:
It provides valuable information.
It helps to debug the model and the data pre-processing.
E.g., if a variable that should be important is not, then this could indicate a problem in the data preparation.
On the other hand, if a (potentially) unimportant variable is one of the top predictors, then this may indicate an information leakage from the response to that covariate.
In models with many covariates, variable importance helps to select covariates for a more detailed inspection, e.g., for partial dependence plots or pairwise interaction investigations.
In favor of a simpler and parsimonious model, unimportant covariates should be removed altogether.
Different modeling techniques offer different ways to assess variable importance.
For GLMs, we can study likelihood ratio tests (LRTs), Wald tests or scaled coefficients; see Chapter 3 on GLMs.
For tree-based methods, we have met split gain importance (VI-score); see Chapters 6 and 7 on regression trees and GBMs.
The two most frequently used model-agnostic methods are permutation (variable) importance (PVI) and SHAP importance.
We are going to present the following methods in this chapter:
- split gain importance is recalled in this notebook;
- permutation importance is presented in this notebook;
- SHAP importance is presented in the next notebook.
3.2 Split gain importance – tree-based models
We first recall from Chapters 6 and 7 (regression trees and GBMs) split gain importance (VI-score) that ranks the covariates by how much they contribute to the decrease of loss (during model fitting).
This measure is specific to tree-based methods, but we can also use a tree-based surrogate model to approximate a black box model; see next notebook for surrogate models.
# LightGBM: Split gain importance
lgb.importance(lgb_freq) |>
ggplot(aes(Gain, reorder(Feature, Gain))) +
geom_bar(stat = "identity", fill = "orange") +
xlab("relative split gain") +
ylab(element_blank()) +
ggtitle("LightGBM split gain importance (tree-based)")
3.3 Permutation importance
Definition. Permutation variable importance (PVI) of a fitted model \(\widehat{\mu}(\cdot)\), a covariate subset \(\boldsymbol{X}_{\cal S}=(X_s)_{s \in {\cal S}}\), indexed by \({\cal S} \subset \{1, \dots, q\}\), a dataset \({\cal D}= (Y_i,\boldsymbol{X}_i, v_i)_{i=1}^n\), and a loss function \(L\) is defined as \[ \operatorname{PVI}({\cal S}, {\cal D}) = L(\widehat \mu, {\cal D}^{({\cal S})}) - L(\widehat \mu, {\cal D}),\] with \({\cal D}^{({\cal S})}\) being a version of \({\cal D}\), where the covariates \(\boldsymbol{X}_{i,{\cal S}}=(X_{i,s})_{s \in {\cal S}}\) are (jointly) randomly permuted across all instances \(1\le i \le n\).
In case of a single covariate, we generally write \(j\) instead of \({\cal S}=\{j\}\).
Interpretation. The more the model performance deteriorates when predicting on a dataset with shuffled rows in columns \(\boldsymbol{X}_{\cal S}\), the more important the covariates \(\boldsymbol{X}_{\cal S}\) in \(\widehat{\mu}(\cdot)\) for predicting \(Y\).
One often repeats the random permutation process multiple times with different seeds, leading to multiple PVI values \(\{\operatorname{PVI}^{(k)}({\cal S}, {\cal D})\}_{k=1}^K\).
These values are then averaged to the more robust statistic \[ \widehat{\operatorname{PVI}}({\cal S}, {\cal D}) = \frac{1}{K} \sum_{k=1}^K \operatorname{PVI}^{(k)}({\cal S}, {\cal D}).\]
Resulting standard errors and confidence intervals reflect uncertainty.
Note, the model is never refitted during this process.
PVI is calculated on (a subset of) the test set \({\cal T}\) to control for over-fitting. This is an advantage over in-sample variable importance measures such as split gain importance for tree-based models.
PVI can be misleading in the presence of strong dependence between covariates, because in that case the model is likely applied to rare or even impossible covariate combinations, e.g., a car driver aged 20 cannot have a driving experience of 10 years.
By permuting rows of groups of strongly correlated covariates – i.e., jointly included in \({\cal S}\) – one can circumvent this problem by providing their joint importance.
Case weights, e.g., time exposures, should be taken into account by the loss function \(L\); this is not indicated in our notation above.
There is no unique mathematical definition of variable importance. Inconsistent results across different versions of variable importance are common and somewhat expected.
- Permutation importance PVI for GLM (including error bars).
glm_imp <- perm_importance(
glm_freq,
X = test[features],
y = test$frequency,
w = test$Exposure,
m_rep = 5,
loss = "poisson",
n_max = Inf, # use all observations
type = "response" # passed to predict()
)
p0 <- plot(glm_imp) +
ggtitle("GLM - permutation importance with standard errors") +
xlab("increase in mean Poisson deviance loss")
- Permutation importance PVI for LightGBM (including error bars).
lgb_imp <- perm_importance(
lgb_freq,
X = X_test,
y = test$frequency,
w = test$Exposure,
m_rep = 5,
loss = "poisson",
n_max = Inf # use all observations
)
p0 <- plot(lgb_imp) +
ggtitle("LightGBM - permutation importance with standard errors") +
xlab("increase in mean Poisson deviance loss")
- Group covariates to be permuted jointly.
lgb_imp_grouped <- perm_importance(
lgb_freq,
X = X_test,
v = list(Location = c("lDensity", "Region"),
Vehicle = c("VehAge", "VehPower", "VehBrand", "VehGas"),
DrivAge = c("DrivAge")),
y = test$frequency,
w = test$Exposure,
m_rep = 5,
loss = "poisson",
n_max = Inf # use all observations
)
p0 <- plot(lgb_imp_grouped) +
ggtitle("LightGBM - permutation importance with grouped covariates") +
theme(plot.title = element_text(size = 20),
axis.title = element_text(size = 18),
axis.text = element_text(size = 16)) +
xlab("increase in mean Poisson deviance loss")
The location related covariates are about equally (PVI) important as the driver’s age variable.
The vehicle related covariates seem less important.
4 Effects
4.1 Effects: the additive case
The study of main effects is particularly easy in an additive linear regression model \(\boldsymbol{X} \mapsto \mu(\boldsymbol{X})\) \[ \mu(\boldsymbol{X}) = \vartheta_0 + \vartheta_1 X_1 + \dots + \vartheta_q X_q. \]
The effect of covariate \(X_j\) on \(\mu(\boldsymbol{X})\) is given by the coefficient \(\vartheta_j \in \mathbb{R}\), namely, a one-unit increase in \(X_j\) gives an increase of \(\vartheta_j\) in \(\mu(\boldsymbol{X})\).
Similarly, when a model \(\mu\) is additive in \(X_j\), i.e., if \(\mu\) can be decomposed into a function \(\mu_j\), depending only on \(X_j\), and a function \(\mu_{\setminus j}\), depending on all other covariate components \(\boldsymbol{X}_{\setminus j}\) (excluding \(X_j\)) \[ \mu(\boldsymbol{X}) = \mu_j(X_j) + \mu_{\setminus j} (\boldsymbol{X}_{\setminus j}),\] then, \(\mu_j\) gives an accurate description of the effect of \(X_j\) on \(\mu\).
4.2 Additive recovery property
Due to many interactions, the above simplicity is typically lost in a black box model. However, the effects of covariates can still be described approximately by different methods.
A desirable property of such methods is the additive recovery property of Apley and Zhu (2020).
In its simplest form it says that if a model \(\boldsymbol{X} \mapsto \mu(\boldsymbol{X})\) is additive in \(X_j\), as defined above, then the method should be able to recover the additive effect \(\mu_j(X_j)\) up to an additive constant.
If a method satisfies this additive recovery property, the estimated effect of an additive component can be interpreted ceteris paribus.
4.3 Individual conditional expectation
A black box model often contains many complex interactions.
This implies that the effect of a covariate component differs from instance to instance.
We first present individual conditional expectation (ICE) of Goldstein et al. (2015).
ICE visualizes how the predictions of multiple observations change when sliding the value of covariate \(X_j\) over its range, keeping the values of all other covariates fixed.
Warning: ICE does not reflect a proper conditional expectation because it does not consider the dependence structure between \(X_j\) and \(\boldsymbol{X}_{\setminus j}\).
Definition. The ICE function for model \(\boldsymbol{X}\mapsto \mu(\boldsymbol{X})\), the (real-valued) covariate \(X_j\) and the \(i\)-th instance with covariates \(\boldsymbol{X}_i\) is given by \[ z \in \mathbb{R} ~\mapsto~ \mu_{j, \operatorname{ICE}}^{(i)}(z) = \mu(z, \boldsymbol{X}_{i,\setminus j}), \] where \((z,\boldsymbol{X}_{i, \setminus j})\) are the covariates \(\boldsymbol{X}_i\) with the \(j\)-th component replaced by \(z\).
The ICE curve represents the graph \((z_k, \mu_{j, \operatorname{ICE}}^{(i)}(z_k))_{k = 1}^K\) for a grid of values \(z_1 < z_2 < \dots < z_K\).
The ICE plot shows ICE curves of a sample of instances \(i\) (typically drawn at random from the learning data).
Besides giving an impression on the main effect of \(X_j\), an ICE plot visualizes how the effects vary across the instances \(i\), and it indicates the presence of interaction effects.
Property. ICE curves satisfy the additive recovery property:
If a model \(\mu\) is additive in \(X_j\), we get \[ z ~\mapsto~ \mu_{j, \operatorname{ICE}}^{(i)}(z) = \mu_j(z) + \mu_{\setminus j} (\boldsymbol{X}_{i,\setminus j}). \] Thus, an ICE curve of an additively modeled feature \(X_j\) gives an accurate description of the additive effect \(\mu_j(X_j)\).If a model \(\mu\) is additive in \(X_j\), all ICE curves of feature \(X_j\) are parallel.
- ICE curves are often vertically shifted to meet at some value of \(X_j\).
- Such centered ICE plots (called c-ICE plots) visualize the presence of interaction effects.
- If the model involves a link function, ICE curves are often plotted on the link scale to better reveal the interaction effects.
Similar to permutation importance, for strongly dependent covariates, the model \(\mu\) may be applied to rare or even impossible covariate combinations in the ICE plot which can produce misleading ICE interpretations.
- Example c-ICE plot of variable
DrivAge.
p1 <- ice(glm_freq,
v = "DrivAge",
X = learn,
n_max = 200,
trim = c(0, 0.99)) |>
plot(center = TRUE) +
ggtitle("c-ICE on log-scale (GLM)"
)
p2 <- ice(lgb_freq,
v = "DrivAge",
X = X_learn,
n_max = 200,
type = "raw",
trim = c(0, 0.99)) |>
plot(center = TRUE) +
ggtitle("c-ICE on log-scale (LightGBM)"
)
Left: we have an additive GAM component in
DrivAge(cubic spline without interactions).Right: we have interacting covariates in a LightGBM for
DrivAge.
4.4 Partial dependence plot
The point-wise average (over \(i\)) of the ICE curves of a covariate \(X_j\) gives the famous partial dependence plot (PDP) of Friedman (2001).
Thus, the PDP of \(X_j\) shows how the average prediction reacts when sliding the value of \(X_j\) over its range, ceteris paribus.
Definition. The partial dependence function of a model \(\boldsymbol{X} \mapsto \mu(\boldsymbol{X})\) and a covariate subset \({\cal S} \subset \{1, \dots, q\}\) is defined as \[ \boldsymbol{x}_{\cal S}~\mapsto~ \mu_{{\cal S}, \operatorname{PD}}(\boldsymbol{x}_{\cal S}) = \mathbb{E}\left[\mu\left(\boldsymbol{x}_{\cal S}, \boldsymbol{X}_{\setminus {\cal S}}\right)\right],\] where the expectation is over the marginal joint distribution of all variables \(\boldsymbol{X}_{\setminus {\cal S}}\) not represented in \(\boldsymbol{x}_{\cal S}\).
This does not reflect a proper conditional expectation.
Again we write \(j\) in case of a single covariate \({\cal S}=\{j\}\).
Given a dataset \({\cal D} = (\boldsymbol{X}_i, v_i)_{i = 1}^n\), the partial dependence function is estimated empirically by (we omit to write hats) \[ \mu_{{\cal S}, \operatorname{PD}}(\boldsymbol{x}_{\cal S}) = \frac{\sum_{i=1}^n v_i \, \mu(\boldsymbol{x}_{\cal S}, \boldsymbol{X}_{i, \setminus {\cal S}})}{\sum_{i=1}^n v_i},\] where \(\boldsymbol{X}_{i, \setminus {\cal S}}\) denotes the covariate vector \(\boldsymbol{X}_i\) of the \(i\)-th instance without the components given by the set \({\cal S}\).
A PDP visualizes the estimated partial dependence function over a discrete grid.
As “background” dataset \({\cal D}\), one typically uses a subset of the learning data (for computational reasons), e.g., 1000 randomly drawn instances.
The additive recovery property also holds for PDPs.
The c-ICE plot of an additively modeled covariate is a single curve which agrees with the corresponding PDP up to an additive constant.
By definition, the PDP of a covariate \(X_j\) equals the (possibly weighted) average of all ICE curves of the background dataset \({\cal D}\).
In contrast to an ICE plot, the PDP of \(X_j\) does not reveal interaction effects. In fact, it can be seen as an estimate of the main effect of the feature \(X_j\), averaged over all its interaction effects.
The above empirical PDP estimate is sometimes called the “brute force” method. For tree-based models, Friedman (2001) suggested a fast alternative that does not require a background dataset \({\cal D}\), but uses properties of the tree structure.
Like ICE plots, PDPs can be misleading in the presence of strong dependence between covariate components (as they do not consider proper conditional expectations).
- Example PDP of variable
DrivAge.
p3 <- feature_effects(
glm_freq,
data = learn,
v = "DrivAge",
w = "Exposure") |>
plot(title = "PDP and ALE for GLM/GAM", stats = c("pd", "ale"))
p4 <- feature_effects(
lgb_freq,
data = X_learn,
v = "DrivAge",
w = learn$Exposure,
type = "raw") |>
plot(title = "PDP and ALE for LightGBM", stats = c("pd", "ale"))- Remark: ALE will be explained next.

Left: additive GAM component (gray columns show the exposures).
Right: LightGBM.
Due to comparably weak interactions, these two plots look quite similar, a main difference is the extrapolation for large
DrivAge.
4.5 Accumulated local effects
Accumulated local effects (ALE) plots were introduced in Apley and Zhu (2020).
They aim at overcoming some of the problems of PDPs.
Idea. To estimate the effect of a covariate \(X_j\) at value \(z\), calculate the slope of \(\mu_{j, \operatorname{PD}}(z)\), but use only instances with \(X_j \approx z\) from the (local) background dataset in this calculation.
This ensures that the model does not need to extrapolate to rare or even impossible covariate combinations, i.e., it respects the dependence structure within \(\boldsymbol{X}\).
The next algorithm explains how to calculate \(\mu_{j, \operatorname{ALE}}(z_k)\) for a grid of values \(z_0 < z_1 < \dots < z_K\) (often quantiles) and data \({\cal D} = (\boldsymbol{X}_i, v_i)_{i=1}^n\).
Values of \(\mu_{j, \operatorname{ALE}}\) between the grid points are linearly interpolated.
Algorithm ALE.
Input: Model \(\mu\), data \({\cal D}=(\boldsymbol{X}_i, v_i)_{i=1}^n\), continuous covariate \(X_j\), breaks \(z_0 < z_1 < \dots < z_K\).
Initialize: \(\mu_{j, \operatorname{ALE}}(z_0) = 0\).
Compute ALE: For \(k = 1, \ldots, K\), set
- \({\cal N}_k = \{i \in \{1, \dots, n\} : X_{i,j} \in (z_{k-1}, z_k]\}\),
- \(n_k\) the (possibly weighted) number of observations in \({\cal N}_k\).
Calculate the local effects \[ \Delta_k = \frac{1}{n_k} \sum_{i \in {\cal N}_k} v_i \left[ \mu(z_k, \boldsymbol{X}_{i,\setminus j}) - \mu(z_{k-1}, \boldsymbol{X}_{i,\setminus j})\right].\]
Accumulate \[ \mu_{j, \operatorname{ALE}}(z_k) = \mu_{j, \operatorname{ALE}}(z_{k-1}) + \Delta_k.\]
Return: \(\{\mu_{j, \operatorname{ALE}}(z_k)\}_{k=0}^K\).
Apley and Zhu (2020) suggest to center the values of \(\mu_{j, \operatorname{ALE}}\) to have a (weighted) mean of zero. Our alternative is to shift the values to approximately the same level as the corresponding PDP.
The data \({\cal D}\) is often the full learning set, or a large subset of it.
ALE plots satisfy the additive recovery property.
The interpretation of ALE plots is not as straightforward as for PDPs, because the ceteris paribus interpretation does not hold. However, thanks to the additive recovery property, it holds for additive features.
Again thanks to the additive recovery property, the ALE plot of an additively modeled feature \(X_j\) is parallel to the corresponding PDP.
If the ALE plot is similar to the corresponding PDP over a certain data range, this indicates that the ceteris paribus interpretation of the PDP is reliable over that range.
It is straightforward to calculate sample standard deviations of local effects. They quantify local interaction strength, and can also be used to derive standard errors or approximate confidence intervals.
Some implementations offer 2D ALE plots for two continuous covariate components, and also algorithms to deal with categorical covariates, which is non-trivial.

4.6 Marginal plots
A next possibility to describe the effect of a covariate \(X_j\) is the marginal plot (M plot); see Apley and Zhu (2020).
This simple method visualizes the average predictions of a model \(\mu\) against the values of a (binned) covariate \(X_j\).
It is very closely related to the ‘actual vs. predicted’ plot of Section 4.
The quantity considered by an M plot for model \(\mu\), covariate \(X_j\) and value \(x_j\) is the conditional expectation \[ \mu_{j, M}(x_j) = \mathbb{E}[\mu(\boldsymbol{X}) \mid X_j = x_j].\]
This conditional expectation is estimated from data \({\cal D} = (\boldsymbol{X}_i, v_i)_{i=1}^n\).
For a discrete \(X_j\), a straightforward way to empirically estimate this quantity is (we omit hats in the notation) \[ \mu_{j, M}(x_j) = \frac{1}{n(x_j)}\sum_{i \in {\cal N}(x_j)} v_i \, \mu(x_j, \boldsymbol{X}_{i,\setminus j}),\] where
- \({\cal N}(x_j) = \{i \in \{1, \dots, n\} : X_{i,j} = x_j\}\), and
- \(n(x_j)\) is the (possibly weighted) number of observations in \({\cal N}(x_j)\).
For a continuous covariate \(X_j\), one approach is to partition the range of \(X_j\) into bins \(I_k = (z_k, z_{k+1}]\), for \(1 \le k \le K\) and \(z_0 < z_1 < \dots < z_K\), and compute weighted bin averages – this is called histogram binning.
The main disadvantage of M plots is that they do not show the effect of \(X_j\) alone, but rather the combined effect of \(X_j\) and all other (dependent) covariate components \(\boldsymbol{X}_{\setminus j}\).
As such, they also do not satisfy the additive recovery property.
However, they are easy to interpret, and they can directly be compared to the average responses to assess miscalibration/bias per feature.
M plots can be calculated for the learning and/or the test dataset, highlighting different aspects.
Comparison with PDPs or ALE plots gives an idea of how much effect comes from the covariate component \(X_j\) alone, and how much from dependent other components.
Standard errors and approximate confidence intervals are straightforward to calculate.
When M plots are to be compared with average responses, one usually works on the scale of the response (not the link scale).
p5 <- feature_effects(glm_freq,
data = test,
v = "DrivAge",
y = test$frequency,
w = test$Exposure,
type = "response" # response scale so that it compares to frequencies
) |>
plot(title = "GLM/GAM - average obs/pred", stats = c("y_mean", "pred_mean"))
p6 <- feature_effects(lgb_freq,
data = X_test,
v = "DrivAge",
y = test$frequency,
w = test$Exposure
) |>
plot(title = "LightGBM - average obs/pred", stats = c("y_mean", "pred_mean"))
The above M plots show good calibration, only low-exposure parts need some care in interpretation (these graphs are on the response scale).
In low-exposure parts, the M plot of a covariate may be dominated by other covariates.
- Showing all different plots of the main effects in one graph.
p7 <- feature_effects(glm_freq,
data = test,
v = "DrivAge",
y = test$frequency,
w = test$Exposure,
type = "response") |> # response scale
plot(title = "GLM/GAM")
p8 <- feature_effects(lgb_freq,
data = X_test,
v = "DrivAge",
y = test$frequency,
w = test$Exposure) |>
plot(title = "LightGBM")
All graphs show very similar shapes which indicates that
DrivAgedoes not have strong interactions with the other considered covariates.Legend: obs (observations, actuals); pred (average predictions, M plot); pd (partial dependence plot); ale (accumulated local effects).
5 Interaction effects
Typically, one only tries to describe the strongest interactions, as the number of possible interactions grows quadratically with the number of covariate components.
The most common model-agnostic method for measuring interaction strength is due to Friedman and Popescu (2008).
Their \(H\) statistics is based on partial dependence functions.
If there are no interaction effects between \(X_j\) and \(X_k\), their two-dimensional partial dependence function \(\mu_{jk, \operatorname{PD}}\) can be written as the sum of the univariate partial dependence functions, i.e., \[ \mu_{jk, \operatorname{PD}}(x_j, x_k) = \mu_{j, \operatorname{PD}}(x_j) + \mu_{k, \operatorname{PD}}(x_k), \] where all partial dependence functions are centered to have a mean of zero; we abbreviate \(jk\) for \({\cal S}=\{j,k\}\).
Given data \({\cal D} = (\boldsymbol{X}_i, v_i)_{i=1}^n\), for each instance \(i\), the deviation from the above equality can be empirically measured by \[ \delta_{i,jk} = \mu_{jk, \operatorname{PD}}(X_{i,j}, X_{i,k}) - \left( \mu_{j, \operatorname{PD}}(X_{i,j}) + \mu_{k, \operatorname{PD}}(X_{i,k})\right).\]
The weighted sum of the squared deviations \[ \Delta_{jk}^2 = \sum_{i=1}^n v_i\, \delta_{i,jk}^2, \] may serve as a measure of interaction strength between \(X_j\) and \(X_k\).
To turn \(\Delta_{jk}^2\) into a relative measure, Friedman and Popescu (2008) use the normalization \[ H_{jk}^2 = \frac{\Delta^2_{jk}}{\sum_{i=1}^n v_i \left[\mu_{jk, \operatorname{PD}}(X_{i,j}, X_{i,k})\right]^2}. \]
A value of zero means there is no interaction, and the bigger the value of \(H_{jk}^2\), the stronger the interaction.
For weak joint effects of two covariate components, the nominator is small. In this case, even a weak interaction can lead to a large value of \(H_{jk}^2\).
To overcome this problem, a good strategy is to focus on the \(\tilde q \ll q\) most important covariate components.
Another strategy to overcome this issue is to use an absolute measure of interaction strength, e.g., \[ \tilde H_{jk} = \frac{\sqrt{\Delta_{jk}^2}}{\sqrt{\sum_{i=1}^n v_i}}, \] as suggested in Mayer and Lorentzen (2025).
\(H\) statistics are based on partial dependence estimates and, thus, are as good or bad as these.
- The following code gives the two versions \(H_{jk}^2\) and \(\tilde H_{jk}\).
#
# interaction strengths will be measured among the following covariates
some_features <- c("DrivAge", "lDensity", "VehAge", "Region")
H <- hstats(
lgb_freq,
X = X_learn,
v = some_features,
w = learn$Exposure
)
p1 <- plot(h2_pairwise(H))
p2 <- plot(h2_pairwise(H, normalize = FALSE, squared = FALSE))p1 + p2
- These figures show \(H^2_{jk}\) and \(\tilde H_{jk}\) for the four selected covariates for the LightGBM based on a random sample of 500 instances from the learning dataset.
The left-hand side shows that even the strongest relative interaction explains less than \(3\%\) of the variability of the joint pairwise effect.
Thus, all interactions are very weak. This partly explains why the LightGBM does not perform significantly better than the GLM.
The right-hand side shows that the strongest absolute interaction occurs between the two most important covariates
DrivAgeandlDensity.Since the statistics have been evaluated on the link scale, the corresponding values for the GLM are all zero (we have an additive structure in the GLM on the link scale).
Once having identified the strongest interactions, one can visualize them in different ways:
By two-dimensional PDPs of the two covariate components, either as a heatmap, a 3D wireframe (landscape), or as a grouped line plot with one component on the \(x\)-axis and the other (binned) as grouping factor.
When interactions are weak, it might be difficult to distinguish the pure interaction effect from the main effects of the two covariates. In this case, we suggest to subtract the values of the two univariate PDPs from the two-dimensional PDP and center the resulting values to mean zero, see definition of \(\delta_{i,jk}\).
As a PDP of one covariate component, with the background dataset for its construction being stratified by the values of the other (binned) covariate.
As an ICE plot of one covariate component, using the values of the other one to colorize the ICE curves.
- Interaction effect as a heatmap and a grouped line plot.
v1 <- "DrivAge"
v2 <- "lDensity"
pdp_2d <- partial_dep(
lgb_freq,
v = c(v1, v2),
X = X_learn,
w = learn$Exposure,
type = "raw", # passed to predict()
grid_size = 50^2,
trim = c(0, 0.99))
p1 <- plot(pdp_2d) +
ggtitle("2D PDP as heatmap")
p2 <- plot(pdp_2d, d2_geom = "line", show_points = FALSE) +
ggtitle("2D PDP as grouped line plot")
Two-dimensional PDP of the two covariates with the strongest interaction
DrivAgeandlDensity.Left: heatmap; right: grouped line plot by
lDensity
- Interaction effect as stratified PDP and colorized c-ICE plot.
p3 <- partial_dep(lgb_freq,
v = v1,
X = X_learn,
BY = v2,
w = learn$Exposure,
type = "raw", # passed to predict()
trim = c(0, 0.99)) |>
plot(show_points = FALSE) +
ggtitle("Stratified PDP")
p4 <- ice(lgb_freq,
v = v1,
X = X_learn,
BY = v2,
type = "raw", # passed to predict()
trim = c(0, 0.99)) |>
plot(center = TRUE) +
ggtitle("Colorized c-ICE plot")
Left: PDP of
DrivAge, background dataset stratified bylDensitybins.Right: c-ICE plot of
DrivAgecolored bylDensity.
Copyright
© The Authors
This notebook and these slides are part of the project “AI Tools for Actuaries”. The lecture notes can be downloaded from:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162304
\(\,\)
- This material is provided to reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution and credit is given to the original authors and source, and if you indicate if changes were made. This aligns with the Creative Commons Attribution 4.0 International License CC BY-NC.