AI TOOLS FOR ACTUARIES
Chapter 4: Interlude

Author

Marco Maggi, Ronald Richman, Salvatore Scognamiglio, Mario V. Wüthrich

Published

March 2, 2026

Abstract
This chapter presents different properties and tools such as unbiasedness, the balance property and auto-calibration which all refer to model calibration. Next, this chapter presents two general purpose regression tools called local regression and isotonic regression. Finally, this chapter discusses the Gini score for model selection, lift charts and Murphy’s score decomposition. This chapter is not strictly necessary to dive into the machine learning and the AI tools, and the fast reader can also come back to it at a later stage.

1 Statistical bias and the balance property

Overview
  • It is crucial for insurance pricing models (schemes) to be unbiased, meaning that the average price level over the entire portfolio needs to be correctly specified.

  • This section introduces the notion of bias, and it presents the balance property as a feasible tool to assess the average price level.

1.1 Biases in a broader sense

Generally, unbiasedness is an important property in actuarial pricing, regardless of what specific meaning one underpins unbiasedness.

  • An in-sample bias needs to be avoided in model selection, otherwise the predictor generalizes poorly to new data.

  • An estimated regression model should be void of a statistical bias to ensure that the average price level is correctly specified.

  • Most regression models include an intercept (see GLM chapter). This intercept is called bias term in the machine learning literature.

  • There is some concern about unfair discrimination in insurance pricing. Any kind of unfair treatment of individuals or groups with similar features is related to a bias, coined unfair discrimination bias.

1.2 Statistical bias

  • The regression function \(\mu:\boldsymbol{X}\mapsto \mu(\boldsymbol{X})\) is (globally) statistically unbiased for \((Y,\boldsymbol{X}, v)\), if \[ {\mathbb E}[v\mu(\boldsymbol{X})] = {\mathbb E}[vY].\]

  • If we work with an estimated model \(\widehat{\mu}_{\cal L}\), being fitted on a learning sample \({\cal L}=(Y_i,\boldsymbol{X}_i, v_i)_{i=1}^n\), we typically also average over \({\cal L}\) for a global unbiasedness assessment \[ {\mathbb E}\left[v\widehat{\mu}_{\cal L}(\boldsymbol{X})\right] = {\mathbb E}[vY],\] and we generally assume that the test sample \((Y,\boldsymbol{X},v)\) is independent of the learning sample \({\cal L}\).

  • Difficulty: (Empirical) verification of unbiasedness needs knowledge of the mean \({\mathbb E}[vY]\) and the possibility to re-sample \({\cal L}\) and \((\boldsymbol{X},v)\).

1.3 Remarks on the statistical bias

  • For global unbiasedness, one re-samples \((\boldsymbol{X},v)\). An insurer may (only) want to be unbiased for its specific portfolio \({\cal T}=(Y_t,\boldsymbol{X}_t, v_t)_{t=1}^m\).

  • Conditional global unbiasedness considers \[ \sum_{t=1}^m v_t{\mathbb E}\left[\left.\widehat{\mu}_{\cal L}(\boldsymbol{X}_t)\right| \boldsymbol{X}_t, v_t \right] = \sum_{t=1}^m v_t {\mathbb E}\left[\left.Y_t \right|\boldsymbol{X}_t, v_t \right].\]

  • There are more similar versions of conditional global unbiasedness.

  • Generally, global unbiasedness is difficult to verify, because the average claims level on the right-hand side is unknown.

1.4 The balance property

  • The balance property is an in-sample property that is easy to verify.

  • The balance property was introduced by Bühlmann and Gisler (2005) in the context of credibility; see Lindholm and Wüthrich (2025).

  • Definition. A regression model fitting procedure \({\cal L} \mapsto \widehat{\mu}_{\cal L}\) satisfies the balance property if for almost every (a.e.) realization of the learning sample \({\cal L}=(Y_i,\boldsymbol{X}_i, v_i)_{i=1}^n\) the following identity holds \[ \sum_{i=1}^n v_i\, \widehat{\mu}_{\cal L}(\boldsymbol{X}_i) = \sum_{i=1}^n v_iY_i.\]

  • The balance property is an in-sample property that does not require the knowledge of the true expected (mean) level \({\mathbb E}[vY]\).

1.5 Remarks on the balance property

  • The balance property is a re-allocation of the total (aggregate, portfolio) claim \(\sum_{i=1}^n v_iY_i\) to all insurance policyholders \(1\le i \le n\), such that this collective bears the entire aggregate claim.

  • Generally, in actuarial science, model fitting procedures that have this balance property are preferable.

  • Theorem. MLE estimated GLMs using the canonical link fulfill with the balance property, for non-canonical link GLMs the balance property fails to hold; see Lindholm and Wüthrich (2025).

  • If the balance property is not fulfilled, a correction should be applied. In most cases, one adjusts the bias term/intercept correspondingly.

  • A correction may also be necessary if the future claims level changes, e.g., because of inflation (non-stationarity) – we forecast a future calendar year.

1.6 French MTPL GLM example: balance property

  • We revisit the Poisson log-link GLM on the French MTPL claims data.
## fit Poisson log-link GLM on learning sample
d.glm  <- glm(ClaimNb ~ DrivAgeGLM + VehBrand + VehGas + DensityGLM + AreaGLM, data=learn, offset=log(Exposure), family=poisson())

## predict in-sample and out-of-sample
learn$GLM <- fitted(d.glm)
test$GLM  <- predict(d.glm, newdata=test, type="response")

## verify the (in-sample) balance property
c(sum(learn$GLM)/sum(learn$Exposure), sum(learn$ClaimNb)/sum(learn$Exposure))
[1] 0.07363081 0.07363081
  • Note: The log-link is the canonical link in the Poisson GLM.

  • Otherwise, adjust the bias parameter d.glm$coefficients[1].

2 Auto-calibration

Overview
  • Auto-calibration is an important property that insurance pricing schemes should fulfill.

  • The above bias is a global property – auto-calibration is a local property.

  • Auto-calibration implies that there is no systematic cross-financing between the different price cohorts in the selected tariff scheme.

2.1 Definition of auto-calibration

  • Definition. A regression function \(\mu:{\cal X}\to {\mathbb R}\) is auto-calibrated for the tuple \((Y,\boldsymbol{X})\) if, a.s., \[ \mu(\boldsymbol{X}) = {\mathbb E} \left[ \left. Y \right| \mu(\boldsymbol{X}) \right].\]

  • Auto-calibration implies that every price cohort \(\mu(\boldsymbol{X})\) is on average self-financing for its corresponding claims \(Y\).

  • Auto-calibrated pricing schemes avoid systematic cross-financing.

  • The following graph shows a violation of auto-calibration.


  • Price cohort 1 is systematically cross-financed by price cohort 3.

2.2 Properties of auto-calibration

  • The true regression function \(\mu^*\) is auto-calibrated. This implies that any non-auto-calibrated regression function cannot be the true one.

  • The global mean \(\mu_0= {\mathbb E}[Y]\) is auto-calibrated.

  • Typically, there are infinitely many auto-calibrated regression models.

  • Consider any regression function \(\mu:{\cal X}\to {\mathbb R}\). The re-calibration step gives an auto-calibrated regression function \[ \mu_{\rm rc}(\boldsymbol{X}) :={\mathbb E} \left[ \left. Y \right| \mu(\boldsymbol{X}) \right].\]

  • This re-calibration step can be performed empirically by an isotonic regression; see section below.

2.3 MTPL GLM example, continued: auto-calibration

  • Consider decile binning to verify/reject auto-calibration.
# we construct an actual vs. predicted plot - using decile binning 
library(tidyverse)

# decile binning approach (out-of-sample on test data)
test$freq <- test$GLM/test$Exposure  # GLM predictor from above
qq        <- quantile(test$freq, probs = c(0:10)/10)
test$qq   <- 1

for (t0 in 2:10){test$qq <- test$qq + as.integer(test$freq>qq[t0])}
dd <- data.frame(test %>%  group_by(qq) %>%
                           summarize(yy = sum(ClaimNb),
                                     mm = sum(GLM),
                                     vv = sum(Exposure)))
#
dd$yy <- dd$yy/dd$vv     # bin averages    -> y-axis of next graph (actuals)
dd$mm <- dd$mm/dd$vv     # bin barycenters -> x-axis (average predictor)

  • Are the blue dots on the orange diagonal line?
  • Observe the non-monotonicity.
  • Is this auto-calibrated?

2.4 Actuals vs. predicted plot

  • The above plot is known as actuals vs. predicted plot.

  • It shows the observations (actuals) \(Y\) on the \(y\)-axis against the predictors \(\widehat{\mu}(\boldsymbol{X})\) on the \(x\)-axis.

  • The quantile binning reduces volatility, and one builds empirical means on these bins for actuals and predictors.

  • The rectangles show the decile bounds. The \(x\)-locations of the blue dots correspond to the barycenters of the predictors in these decile bins.

  • Often, for illustration, one plots the axes on the log-scale.

  • If the regression function \(\widehat{\mu}_{\cal L}(\cdot)\) has been estimated on a learning sample \({\cal L}\), this plot should be performed on the test sample \({\cal T}\).

3 Local regression

Overview
  • Local regression is a non-parametric regression method.

  • It is a general purpose tool that is often used to smooth graphs.

\(\,\)

Literature: Loader (1999); owner of the R library locfit.

\(\,\)

  • Goal: Locally fit a polynomial regression function to a sample \((Y_i,X_i, v_i)_{i=1}^n\) with one-dimensional covariates \(X_i \in {\mathbb R}\).

3.1 Smoothing window

  • Fit a regression value \(\widehat{\mu}^{\rm loc}(X)\) in a fixed covariate value \(X\in {\mathbb R}\) by only considering the instances \(i\) with covariates \(X_i\) in the neighborhood of \(X\).

  • Select bandwidth \(\delta(X)>0\) to define the smoothing window around \(X\) \[\Delta(X)=\Big(X-\delta(X),\, X+\delta(X)\Big).\]

  • Only instances with \(X_i \in \Delta(X)\) are considered for estimating \(\widehat{\mu}^{\rm loc}(X)\).

  • Select a weighting function; often tricube weighting is used on \(u\in [-1,1]\) \[ w(u)=(1-|u|^3)^3.\]

  • This weighs the instances \(i\) within the smoothing window w.r.t. their relative distances \(u_i:=(X_i-X)/\delta(X)\) to \(X\).

3.2 Local spline regression

  • Fit a spline to the weighted observations in this smoothing window.

  • For illustration, select a quadratic spline \[\begin{equation*} x ~\mapsto~ \mu_{\vartheta}(x;X)=\vartheta_0 + \vartheta_1(x-X) + \vartheta_2 (x-X)^2, \end{equation*}\] with regression parameter \(\vartheta=(\vartheta_j)_{j=0}^2 \in {\mathbb R}^3\).

  • Consider the weighted local regression least squares problem \[ \widehat{\vartheta} = \underset{\vartheta}{\arg\min}\, \sum_{i=1}^n v_i \,\mathbf{1}_{\{X_i \in \Delta(X)\}}\, w\left(\frac{X_i-X}{\delta(X)}\right) \Big( Y_i - \mu_{\vartheta}(X_i;X)\Big)^2.\]

  • The fitted local regression value in \(X\) is obtained by setting \[ \widehat{\mu}^{\rm loc}(X):= \mu_{\widehat{\vartheta}}(X;X)= \widehat{\vartheta}_0.\]

3.3 MTPL GLM example: auto-calibration with local regression

  • We continue the French MTPL Poisson GLM example from above (using the log-link).

  • Above we used decile binning to analyze auto-calibration.

  • We now test for auto-calibration of this GLM by using a local regression plot.

  • For the local regression \(Y_i \sim X_i:=\widehat{\mu}^{\rm GLM}(\boldsymbol{X}_i)\) we use quadratic splines. The bandwidth \(\delta(X_i)\) is chosen such that the smoothing window \(\Delta(X_i)\) contains a nearest neighbor fraction of \(\alpha=10\%\) of the data.

3.4 Auto-calibration - local regression

## we construct an actual vs. predicted plot - using local regression
library(locfit)

# select only finitely many samples for the following plot (illustration)
set.seed(100)   
kk <- sample(c(1:nrow(test)), size=1000)

# local regression fit
spline0 <- predict(
             locfit(test$ClaimNb/test$Exposure ~ test$freq,
                               weights=test$Exposure, 
                               alpha=0.1, deg=2),
                               newdata=test[kk,]$freq
             )

  • The hyper-parameters of the nearest neighbor fraction \(\alpha \in (0,1]\) and the degree of the selected splines have a crucial influence on the results.

  • The tails seem fine, but the fluctuations in the middle seem large.

4 Isotonic regression

Overview
  • Isotonic regression is a non-parametric regression.

  • An isotonic regression does not involve any hyper-parameters, but is based on the assumption of isotonicity.

  • An isotonic regression solution is (empirically) auto-calibrated.

Isotonic regression goes back to Ayer et al. (1955), Brunk, Ewing and Utz (1957), Miles (1959), Kruskal (1964), Barlow et al. (1972), Barlow and Brunk (1972).

\(\,\)

Tip

The fast reader can skip this section and come back to it later.

4.1 Isotonic regression

  • A disadvantage of the previous local regression is that it is very sensitive in the hyper-parameter choice.

  • Isotonic regression is a non-parametric regression method that does not involve any hyper-parameters.

  • It is based on the assumption of isotonicity between the one-dimensional real-valued covariates \(X_i\) and the regression values \(\mu^*(X_i):={\mathbb E}[Y_i|X_i]\).

  • That is, for an isotonic regression we assume for all \(1 \le i, i' \le n\) \[ X_i \le X_{i'} \quad \Longleftrightarrow \quad \mu^*(X_i) \le \mu^*(X_{i'}).\]

  • Thus, the isotonic regression is rank-preserving for a given ranking \((X_i)_{i=1}^n\).

4.2 Preliminary remark on isotonic regression

  • For performing an isotonic regression on a sample \((Y_i,X_i, v_i)_{i=1}^n\) with one-dimensional real-valued covariates \(X_i \in {\mathbb R}\), we assume w.l.o.g. that there are no ties in the ranks \(X_1 < X_2 < \ldots < X_n\); otherwise the corresponding instances are merged in the sense of sufficient statistics \[\begin{eqnarray*} v_{i} &\leftarrow& v_i+\ldots +v_{i+k}, \\ Y_{i} &\leftarrow& \frac{v_i Y_i+\ldots +v_{i+k} Y_{i+k}}{v_i+\ldots + v_{i+k}}, \end{eqnarray*}\] in case of a tie \(X_{i}=\ldots = X_{i+k}\).

  • This is w.l.o.g. because the algorithm to solve the following isotonic regression problem precisely uses this type of merging (binning).

4.3 Isotonic regression solution

  • The isotonic regression estimate \(\widehat{\boldsymbol{\mu}}^{\rm iso}\in {\mathbb R}^n\) under the ranking assumption \(X_1 < X_2 < \ldots < X_n\) is the constraint solution to \[ \widehat{\boldsymbol{\mu}}^{\rm iso} ~=~\underset{\boldsymbol{\mu} \in {\mathbb R}^n}{\arg\min}~ \sum_{i=1}^n v_i \left(Y_i - \mu_i\right)^2, \qquad \text{subject to $\mu_1 \le \ldots \le \mu_n$.}\]

  • Remarks.

    • Isotonic regression uses the (strictly consistent) square loss. But, any strictly consistent loss function gives the identical solution \(\widehat{\boldsymbol{\mu}}^{\rm iso}\).

    • The pool adjacent violators (PAV) algorithm gives a fast implementation to solve this isotonic regression; see Leeuw, Hornik and Mair (2009).

    • From this PAV algorithm one sees that the solution \(\widehat{\boldsymbol{\mu}}^{\rm iso}\) is empirically auto-calibrated; this is the isotonic re-calibration step suggested in Wüthrich and Ziegel (2024).

    • Interpolate \(\widehat{\mu}^{\rm iso}(X_i):=\widehat{\mu}^{\rm iso}_i\), \(1\le i \le n\), by a step-function.

4.4 MTPL GLM example: isotonic regression

  • We continue the French MTPL Poisson GLM example from above (using the log-link).

  • We now want to test for auto-calibration of this GLM using an isotonic regression plot.

## we construct an actual vs. predicted plot - using isotonic regression
library(monotone)

## the following algorithm requires ordered samples
test$kk       <- 0     # sub-sample for plots
test[kk,"kk"] <- 1     # sub-sample for plots (same as local regression)
test          <- test[order(test$freq),]

## isotonic regression
isotonic  <- monotone(x=test$ClaimNb/test$Exposure, w=test$Exposure)
## prediction (include 0)
fit       <- stepfun(test$freq, c(0, isotonic))
test$iso  <- fit(test$freq)
## isotonic regression isn't strictly positive -> merge 3 smallest classes
(mm <- sort(unique(test$iso))[1:5])
[1] 0.00000000 0.02180905 0.03647219 0.04187979 0.04317795
k1 <- 3
vv <- sum(test[which(test$iso <= mm[k1]),]$ClaimNb) / sum(test[which(test$iso <= mm[k1]),]$Exposure)
test[which(test$iso <= mm[k1]),]$iso <- vv

Over-fitting at the upper end may require the same procedure, we refrain from doing so – however, the following plot supports merging largest values.


  • This is a monotone regression function, no hyper-parameters involved.

  • Over-fitting at both ends should be taken care off, see previous slide.

  • Graph seems to support auto-calibration, except maybe in the tails.

4.5 Isotonic re-calibration

  • Isotonic regression gives an empirically auto-calibrated regression.

  • This motivates the following two-step fitting procedure:

    1. Fit a regression model to find the estimated regression function \(\widehat{\mu}(\cdot)\).

    2. Apply an isotonic regression using the real-valued covariates (ranks) \(X_i:=\widehat{\mu}(\boldsymbol{X}_i)\) to receive an auto-calibrated regression model.

  • The first step needs to get the ranking correct, the second step lifts these ranks to the right level.

  • To find the correct ranking in the first step, one can also work on logged data (e.g., for claim sizes).

  • The second fitting step is called isotonic re-calibration, and it provides auto-calibration; see Wüthrich and Ziegel (2024).

5 Gini score

Overview
  • The Gini score is a very popular method for model selection frequently used in the machine learning community.

  • The Gini score is a purely rank based statistics.

  • The Gini score needs some care because it is only a valid model selection tool among auto-calibrated regression functions.

\(\,\)

The Gini (1936) score goes back to a concept in economics trying to measure the disparity of the wealth distribution in a given population.

It is based on the Lorenz (1905) curve.

5.1 Lorenz curve

  • We start from an observed sample \((Y_i,\boldsymbol{X}_i, v_i \equiv 1)_{i=1}^n\) with ordering \[\begin{equation*} \mu(\boldsymbol{X}_{1})< \mu(\boldsymbol{X}_{2}) < \ldots < \mu(\boldsymbol{X}_{n}), \end{equation*}\] for the given regression function \(\mu:{\cal X}\to {\mathbb R}\). For simplicity, assume there are no ties in this ordering.

  • The empirical Lorenz curve is for \(\alpha \in (0,1)\) given by\[ \widehat{L}_\mu\left(\alpha\right)= \frac{1}{\frac{1}{n}\sum_{i=1}^n \mu(\boldsymbol{X}_i)} \, \frac{1}{n} \sum_{i=\lceil (1- \alpha) n \rceil + 1}^n \mu(\boldsymbol{X}_{i});\] note that this is a mirrored version of the classical definition.

  • Interpretation: \(\widehat{L}_\mu(\alpha)\) quantifies the contribution of the largest regression values \((\mu(\boldsymbol{X}_{i}))_{i=\lceil (1- \alpha) n \rceil + 1}^n\) to the portfolio average \(n^{-1}\sum_{i=1}^n \mu(\boldsymbol{X}_i)\).

5.2 Cumulative accuracy profile

  • The cumulative accuracy profile is for \(\alpha \in (0,1)\) given by \[ \widehat{C}_\mu (\alpha)~=~ \frac{1}{\frac{1}{n}\sum_{i=1}^n Y_i} \, \frac{1}{n} \sum_{i=\lceil (1- \alpha) n \rceil + 1}^{n} Y_i.\]

  • Compared to the Lorenz curve \(\widehat{L}_\mu\left(\alpha\right)\), we replace the predictors \(\mu(\boldsymbol{X}_i)\) by the actuals \(Y_i\), but we keep the order of the regression values \(\mu(\boldsymbol{X}_{1})< \mu(\boldsymbol{X}_{2}) < \ldots < \mu(\boldsymbol{X}_{n})\), i.e., the labeling \(1\le i \le n\).

  • Idea: The better the discrimination of the selected regression model \(\mu\) for the responses \(Y\), the bigger the area under the curve (AUC) implied by the function \[ \alpha~\mapsto~\widehat{C}_\mu (\alpha);\] see next plot for the AUC.

5.3 Area under the curve

5.4 Definition Gini score

  • The orange dotted line shows the cumulative accuracy profile of perfectly aligned actuals \((Y_i)_{i=1}^n\) and predictors \((\mu^\dagger(\boldsymbol{X}_i))_{i=1}^n\): this gives the maximal \(\text{area}({\rm A}+{\rm B})\).

  • The red line shows the cumulative accuracy profile \(\alpha\mapsto\widehat{C}_\mu (\alpha)\) w.r.t. the selected predictors \((\mu(\boldsymbol{X}_i))_{i=1}^n\); this gives the \(\text{area}({\rm A})\).

  • The dotted blue line corresponds to the null model \(\widehat{\mu}_0=n^{-1}\sum_{i=1}^n Y_i\), not considering any covariates, but the global empirical mean \(\widehat{\mu}_0\) instead; this gives the minimal area being zero.

  • The Gini score of the selected regression model \(\mu\) is defined by \[ {\rm Gini}({\mu}) ~=~ \frac{\text{area}({\rm A})}{\text{area}({\rm A}+{\rm B})} ~\le ~1.\]

5.5 Interpretation Gini score

  • Generally, a bigger Gini score is interpreted as a better discrimination of the selected regression model \(\mu(\cdot)\) for the actuals \(Y\).

  • The Gini score (defined by the AUC) is equivalent to the so-called receiver operating curve (ROC) method for binary responses \(Y\); see Tasche (2006).

  • Careful: The Gini score is not a strictly consistent model selection tool. It is fully rank based, but it does not consider whether the model lives on the right scale. However, if we can additionally ascertain that all considered regression functions \(\mu(\cdot)\) are auto-calibrated for \((Y,\boldsymbol{X})\), the Gini score is a sensible model selection tool; see Wüthrich (2023).

  • For integration of case weights and dealing with ties in the Gini score see Brauer and Wüthrich (2025).

6 Lift charts

Overview
  • Often modified versions of the ‘actual’ vs. ‘predicted’ plot are used for model selection.

  • This section presents lift charts and double lift charts, which are popular actuarial tools for model selection.

\(\,\)

To be able to perform a model comparison, we fit a second GLM to the French MTPL claims data. This second GLM uses different covariates compared to the first one, to have two different GLMs d.glm and d.glm2.

6.1 Second regression model

## second Poisson log-link GLM with different covariates (we drop AreaGLM, and we add BonusMalusGLM and VehAgeGLM compared to the previous GLM)

d.glm2  <- glm(ClaimNb ~ DrivAgeGLM + BonusMalusGLM + VehBrand + VehGas + DensityGLM + VehAgeGLM, 
               data=learn, offset=log(Exposure), family=poisson())

## predict in-sample and out-of-sample
learn$GLM2 <- predict(d.glm2, newdata=learn, type="response")
test$GLM2  <- predict(d.glm2, newdata=test,  type="response")
  • We now have two different regression models d.glm and d.glm2 that we can compare.

6.2 In-sample and out-of-sample Poisson deviance losses

Poisson.Deviance <- function(pred, obs, weights){  # scaled with 10^2
  10^2 * 2*(sum(pred)-sum(obs)+sum(log((obs/pred)^(obs))))/sum(weights)}

# first GLM (in- and out-of-sample losses)
round(c(Poisson.Deviance(learn$GLM, learn$ClaimNb, learn$Exposure), Poisson.Deviance(test$GLM, test$ClaimNb, test$Exposure)), 3)
[1] 46.954 47.179
# second GLM (in- and out-of-sample losses)
round(c(Poisson.Deviance(learn$GLM2, learn$ClaimNb, learn$Exposure), Poisson.Deviance(test$GLM2, test$ClaimNb, test$Exposure)), 3)
[1] 45.706 45.669
  • The second GLM seems better. Can we verify this?

6.3 Decile binning of the second GLM

# decile binning for the second GLM (out-of-sample on test data)

test$freq2 <- test$GLM2/test$Exposure  # predictor GLM2
qq2 <- quantile(test$freq2, probs = c(0:10)/10)

test$qq2 <- 1
for (t0 in 2:10){test$qq2 <- test$qq2 + as.integer(test$freq2>qq2[t0])}

dd2 <- data.frame(test %>%  group_by(qq2) %>%
                           summarize(yy = sum(ClaimNb),
                                     mm = sum(GLM2),
                                     vv = sum(Exposure)))
#
dd2$yy <- dd2$yy/dd2$vv     # bin averages    -> y-axis (actuals)
dd2$mm <- dd2$mm/dd2$vv     # bin barycenters -> x-axis (predictor averages)

6.4 Actuals vs. predicted plot using decile binning

From these plots it is difficult to draw conclusions about model selection.

6.5 Lift chart

A lift chart shows the same statistics but in a different graph: it plots the binned prediction averages and the binned response averages both on the \(y\)-axis, and on the \(x\)-axis it shows the bin labels.

There is the following interpretation; see Goldburd et al. (2020).

  • Auto-calibration. Under auto-calibration the predictors (orange circles) and the actuals (blue triangles) will approximately coincide; see next graph.

  • Monotonicity. The actuals (blue triangles) should be monotone under auto-calibration (up to the pure noise).

  • Discrimination. The better the regression model can discriminate the claims, the bigger the difference between the lowest and the highest quantiles; this is called the lift.

6.6 Lift chart example

  • These two charts use the same scale on the \(y\)-axis.

  • Auto-calibration. Model 1 seems fairly well auto-calibrated, Model 2 seems to have a bias in the lower tail (small predictions subsidize large prediction values).

  • Monotonicity. Model 2 provides monotonicity, Model 1 does not.

  • Discrimination. Model 2 has a clearly bigger lift than Model 1.

In conclusion, we give preference to Model 2 because it better discriminates the claims (has a bigger lift), and we should still take care of the bias (miscalibration) in the lower and upper tails.

This result also indicates that there needs to exist a (clearly) better regression model that mitigates the weaknesses mentioned.

6.7 Double lift chart

  • A double lift chart is a lift plot that considers both regression models in the same graph.

  • Assume that all predictions are strictly positive.

  • Denote the out-of-sample predictions of Model 1 by \((\widehat{\mu}_1(\boldsymbol{X}_t))_{t=1}^m\), and the ones of Model 2 by \((\widehat{\mu}_2(\boldsymbol{X}_t))_{t=1}^m\).

  • For the double lift chart, compute the ratios \[ \kappa_t = \frac{\widehat{\mu}_2(\boldsymbol{X}_t)}{\widehat{\mu}_1(\boldsymbol{X}_t)} \qquad \text{ for $1\le t \le m$.}\]

  • Use these ratios \((\kappa_t)_{t=1}^m\) for quantile binning:

    • Smallest bin: Model 2 judges the risks more positively than Model 1.
    • Largest bin: Model 2 judges the risks more negatively than Model 1.

  • For the double lift chart, we use this binning w.r.t. the ratios \((\kappa_t)_{t=1}^m\) to compute:

    • the weighted bin averages of the predictions \((\widehat{\mu}_1(\boldsymbol{X}_t))_{t=1}^m\) of Model 1,

    • the weighted bin averages of the predictions \((\widehat{\mu}_2(\boldsymbol{X}_t))_{t=1}^m\) of Model 2,

    • and weighted the bin averages of the actuals \((Y_t)_{t=1}^m\).

  • For model selection, we prefer the regression model whose binned prediction averages follow closer the ones of the actuals.

6.8 Double lift chart example

# decile binning for the double lift plot (out-of-sample on test data)
test$kappa <- test$GLM2/test$GLM  
qq <- quantile(test$kappa, probs = c(0:10)/10)

test$qq <- 1
for (t0 in 2:10){test$qq <- test$qq + as.integer(test$kappa>qq[t0])}
dd <- data.frame(test %>%  group_by(qq) %>%
                           summarize(yy = sum(ClaimNb),
                                     mm1 = sum(GLM),
                                     mm2 = sum(GLM2),
                                     vv = sum(Exposure)))
#
dd$yy  <- dd$yy/dd$vv         # bin averages actuals
dd$mm1 <- dd$mm1/dd$vv        # bin predictor GLM1
dd$mm2 <- dd$mm2/dd$vv        # bin predictor GLM2

  • We give a preference to Model 2, but it still needs a correction for auto-calibration.

7 Murphy’s score decomposition

Overview
  • Murphy’s score decomposition allows one to decompose the total prediction uncertainty into different (interpretable) parts.

  • Though useful, Murphy’s score decomposition is not used very frequently.

  • Murphy’s score decomposition can be seen as a generalization of the sums of squares to strictly consistent loss functions for mean estimation (and additionally assessing auto-calibration).

7.1 Murphy’s score decomposition

  • Choose a strictly consistent loss function \(L\) for mean estimation.

  • Murphy (1973)’s score decomposition is given by \[ {\mathbb E} \left[ L(Y, {\mu}(\boldsymbol{X}))\right] ~=~ {\sf UNC}_L - {\sf DSC}_L +{\sf MSC}_L,\] with uncertainty, discrimination (resolution) and miscalibration defined by, respectively, \[\begin{eqnarray*} {\sf UNC}_L &= & {\mathbb E} \left[ L\left(Y, \mu_0 \right)\right] ~\ge~0, \\ {\sf DSC}_L &= & {\mathbb E} \left[ L\left(Y, \mu_0 \right)\right] -{\mathbb E} \left[ L\left(Y , {\mu}_{\rm rc}(\boldsymbol{X}) \right)\right]~\ge~0, \\ {\sf MSC}_L &= & {\mathbb E} \left[ L \left(Y, {\mu}(\boldsymbol{X}) \right)\right] -{\mathbb E} \left[L \left(Y, {\mu}_{\rm rc}(\boldsymbol{X}) \right)\right]~\ge~0. \end{eqnarray*}\]

7.2 Uncertainty, discrimination (resolution) and miscalibration

  • The uncertainty term \({\sf UNC}_L\) quantifies the total prediction uncertainty not using any covariates in its prediction \(\mu_0={\mathbb E}[Y]\); this is the null model.

  • The discrimination (resolution) term \({\sf DSC}_L\) quantifies the reduction in prediction uncertainty if we use the auto-calibrated regression function \(\boldsymbol{X}\mapsto{\mu}_{\rm rc}(\boldsymbol{X})={\mathbb E}[Y|\mu(\boldsymbol{X})]\).

  • The miscalibration term \({\sf MSC}_L\) quantifies the auto-calibration misspecification (see re-calibration step in previous item).

  • In applications we need to compute these quantities empirically on the out-of-sample sample data \({\cal T}\); the auto-calibrated model \(\mu_{\rm rc}\) can be determined by an isotonic re-calibration step from \(\mu\).

  • There are other decompositions of a similar nature.

7.3 French MTPL example, revisited

  • We revisit the previously fitted two GLMs d.glm and d.glm2.

  • We empirically compute Murphy’s score decompositions for these two GLMs on the test sample \({\cal T}\).

  • We could implement this ourselves, which requires the isotonic regression for the re-calibration step (basically all steps have already been presented above).

  • There is the Python package model_diagnostics that provides an implementation of this empirical Murphy’s score decomposition.


# Python code

from model_diagnostics.scoring import (
    PoissonDeviance,
    decompose,
    HomogeneousExpectileScore,
)
from model_diagnostics.calibration import compute_bias

scoring_function = PoissonDeviance()
scoring_function = HomogeneousExpectileScore(degree=1.000000001, level=0.5)
df_pred_test = pd.DataFrame({"GLM_2": test["GLM_2"], "GLM": test["GLM"]})
y_test = test["ClaimNb"] / test["Exposure"]

df = decompose(
    y_obs=y_test,
    y_pred=df_pred_test,
    scoring_function=scoring_function,
    weights=test["Exposure"],
)

df.sort(["score"])

  • Results of Murphy’s score decomposition using the Poisson deviance loss as scoring function.

  • We again scale the above results by \(10^2\) to make them comparable to all previous results.

  • d.glm2 has a (much) better discrimination 2.4886 vs. 0.8966, giving a clear preference to the second GLM.

  • d.glm has the smaller miscalibration 0.1078 vs. 0.1904 saying that the second GLM should be corrected for auto-calibration.

References

Ayer, M. et al. (1955) “An empirical distribution function for sampling with incomplete information,” The Annals of Mathematical Statistics, 26(4), pp. 641–647. Available at: https://doi.org/10.1214/aoms/1177728423.
Barlow, R.E. et al. (1972) Statistical inference under order restrictions. John Wiley & Sons.
Barlow, R.E. and Brunk, H.D. (1972) “The isotonic regression problem and its dual,” Journal of the American Statistical Association, 67(337), pp. 140–147. Available at: http://www.jstor.org/stable/2284712.
Brauer, A. and Wüthrich, M.V. (2025) “Gini score under ties and case weights,” arXiv preprint arXiv:2511.15446 [Preprint]. Available at: https://arxiv.org/abs/2511.15446.
Brunk, H.D., Ewing, G.M. and Utz, W.R. (1957) Minimizing integrals in certain classes of monotone functions. Pacific Journal of Mathematics, 7(1), pp. 833–847.
Bühlmann, H. and Gisler, A. (2005) A course in credibility theory and its applications. Springer. Available at: https://doi.org/10.1007/3-540-29273-X.
Gini, C. (1936) “On the Measure of Concentration with Special Reference to Income and Statistics,” Colorade College Publication, 208, pp. 73–79.
Goldburd, M. et al. (2020) Generalized linear models for insurance rating. 2nd ed. Casualty Actuarial Society (CAS monograph series, 5). Available at: https://www.casact.org/sites/default/files/2021-01/05-Goldburd-Khare-Tevet.pdf.
Kruskal, J.B. (1964) “Nonmetric multidimensional scaling,” Psychometrica, 29, pp. 115–129. Available at: https://doi.org/10.1007/BF02289694.
Leeuw, J. de, Hornik, K. and Mair, P. (2009) “Isotone optimization in R: Pool-adjacent-violators algorithm (PAVA) and active set methods,” Journal of Statistical Software, 32(5), pp. 1–24. Available at: https://www.jstatsoft.org/index.php/jss/article/view/v032i05.
Lindholm, M. and Wüthrich, M.V. (2025) “The balance property in insurance pricing,” Scandinavian Actuarial Journal, 2025. Available at: https://ssrn.com/abstract=4925165.
Loader, C. (1999) Local regression and likelihood. Springer. Available at: https://doi.org/10.1007/b98858.
Lorenz, M.O. (1905) “Methods of measuring the concentration of wealth,” Publications of the American Statistical Association, 9(70), pp. 209–219. Available at: https://doi.org/10.2307/2276207.
Miles, R.E. (1959) “The complete amalgamation into blocks, by weighted means, of a finite set of real numbers,” Biometrika, 46(3/4), pp. 317–327. Available at: http://www.jstor.org/stable/2333529.
Murphy, A.H. (1973) “A new vector partition of the probability score,” Journal of Applied Meteorology and Climatology, 12(4), pp. 595–600. Available at: https://journals.ametsoc.org/view/journals/apme/12/4/1520-0450_1973_012_0595_anvpot_2_0_co_2.xml.
Tasche, D. (2006) “Validation of internal rating systems and PD estimates.” Available at: https://arxiv.org/abs/physics/0606071.
Wüthrich, M.V. (2023) “Model selection with Gini indices under auto-calibration,” European Actuarial Journal, 13(1), pp. 469–477. Available at: https://doi.org/10.1007/s13385-022-00339-9.
Wüthrich, M.V. and Ziegel, J. (2024) “Isotonic recalibration under a low signal-to-noise ratio,” Scandinavian Actuarial Journal, 2024(3), pp. 279–299. Available at: https://doi.org/10.1080/03461238.2023.2246743.