AI TOOLS FOR ACTUARIES
Chapter 2: Regression Models

Author

Ronald Richman, Salvatore Scognamiglio, Mario V. Wüthrich

Published

February 23, 2026

Abstract
This chapter gives a general introduction to regression modeling. This includes the introduction of the exponential dispersion family (EDF). This is the main family of distributions considered for regression modeling, and it defines the deviance losses for mean estimation. Moreover, this chapter discusses covariate pre-processing and regularization for model fitting.

1 Introduction

1.1 Desirable properties of predictive models

  • Desirable characteristics of predictive models in insurance:

    1. provide accurate forecasts;
    2. smoothness properties so that forecasts do not drastically change by slightly perturbing the inputs;
    3. sparsity and simplicity, i.e., one aims for a parsimonious model;
    4. inner functioning of the model should be intuitive and explainable;
    5. good finite sample properties with credible parameter estimation;
    6. quantifiable prediction uncertainty;
    7. (manually) adaptable to expert knowledge;
    8. compliant with regulation, and one should be able to verify this.

    \(\rhd\) Typically, one needs to compromise among these requirements.

  • Distributions play a surprisingly marginal role in regression modeling, but they are important to get good forecasts on finite samples.

2 Exponential dispersion family

Overview

The exponential dispersion family (EDF) is the main family of distributions for regression modeling:

  1. It contains many popular examples, e.g., the Gaussian, Poisson, binomial, negative binomial, gamma, inverse Gaussian and Tweedie’s compound Poisson models.

  2. The EDF provides a unified parametrization for MLE and regression modeling.

  3. Each member of the EDF translates into a strictly consistent loss function for mean estimation, inheriting the properties of the corresponding EDF distribution.

EDF literature: Jørgensen (1997) and Barndorff-Nielsen (2014). Our notation is taken from Wüthrich and Merz (2023).

2.1 Definition of the exponential dispersion family

A random variable \(Y \sim {\rm EDF}(\textcolor{blue}{\theta}, \varphi/v; \textcolor{red}{\kappa})\) belongs to the EDF if its density/probability weights take the form \[ Y ~\sim~ f_\theta(y) = \exp\left(\frac{y\textcolor{blue}{\theta}-\textcolor{red}{\kappa}(\textcolor{blue}{\theta})}{\varphi/v}+c(y, \varphi/v)\right),\] with

  • cumulant function \(\textcolor{red}{\kappa} : \boldsymbol{\Theta} \to {\mathbb R}\) on the effective domain \(\boldsymbol{\Theta} \subseteq {\mathbb R}\),

  • canonical parameter \(\textcolor{blue}{\theta} \in \boldsymbol{\Theta}\),

  • weight/volume \(v>0\),

  • dispersion parameter \(\varphi>0\), and

  • a normalizing function \(c(\cdot, \cdot)\).


Remarks on the EDF

The EDF definition looks a bit complicated, but only the following items are important for the sequel.

  • The cumulant function \(\textcolor{red}{\kappa}\) defines the type of selected distribution for \(Y\), e.g., Poisson or gamma model; see table below.

  • The cumulant function \(\textcolor{red}{\kappa}\) determines the mean and variance structure of \(Y\); see next slide.

  • The cumulant function \(\textcolor{red}{\kappa}\) can directly be mapped to a strictly consistent loss function for mean estimation; see below.

  • The canonical parameter \(\textcolor{blue}{\theta}\) is the model parameter of interest; the regression structure will enter this canonical parameter.

  • The EDF contains many popular statistical models, e.g., the Bernoulli, Gaussian, gamma, inverse Gaussian, Poisson or negative binomial models. We give \(\textcolor{red}{\kappa}\) for these examples below.

2.2 Mean and variance within the EDF

  • In any non-trivial setting we have the following properties:

    • the effective domain \(\boldsymbol{\Theta} \subseteq {\mathbb R}\) is a (possibly infinite) interval with a non-empty interior \(\mathring{\boldsymbol{\Theta}}\), and

    • the cumulant function \(\kappa\) is smooth and strictly convex on \(\mathring{\boldsymbol{\Theta}}\).

  • The mean and the variance for given \(\theta \in \mathring{\boldsymbol{\Theta}}\) are \[\begin{eqnarray*} \mu_0~=~{\mathbb E}\left[ Y\right]&=&\kappa'(\theta), \\ {\rm Var}\left( Y\right)&=&\frac{\varphi}{v}\kappa''(\theta)~>~0 .\end{eqnarray*}\]

  • This highlights the crucial role played by the cumulant function \(\kappa\) and the canonical (model) parameter \(\theta\).

  • E.g., the Poisson model has cumulant function \(\kappa(\theta)=\exp(\theta)\) and the gamma model \(\kappa(\theta)=-\log(-\theta)\).

2.4 Maximum likelihood estimation (MLE)

  • The log-likelihood is given by \[ \log( f_\theta(Y)) = \frac{v}{\varphi}\left( Y\theta-\kappa(\theta)\right)+c(Y, \varphi/v).\]

  • The EDF parametrization allows one to easily compute the MLE of \(\theta\). For a single observation \(Y\) it is \[ \widehat{\theta}^{\rm MLE} = h(Y) \quad \Longleftrightarrow \quad \widehat{\mu}_0^{\rm MLE} = Y.\]

  • Mathematical small print: The above statement requires that the observation \(Y\) is in the interior of the mean parameter space. Apart from its boundary, this is always the case for steep cumulant functions. We omit these technical details and refer to Barndorff-Nielsen (2014) and Wüthrich and Merz (2023).

2.5 Deviance loss function

  • The deviance loss function of the selected EDF is defined by \[ L(y,m) = 2\, \frac{\varphi}{v}\left( \log \left(f_{h(y)}(y)\right) - \log \left(f_{h(m)}(y)\right)\right) \ge 0,\] with \(m \in \kappa'(\mathring{\boldsymbol{\Theta}})\) in the mean parameter space and \(y\) in the interior of the convex closure of the support of \(Y\).
Deviance losses and MLE
  • Minimizing the deviance loss in \(m\) is equivalent to MLE of \(\theta\).

  • Deviance losses are strictly consistent losses for mean estimation.

  • Interesting examples for practice are given in the table below.

2.6 Best asymptotically normal

The deviance loss of \(Y \sim {\rm EDF}(\theta, \varphi/v; \kappa)\) aligns with the properties of \(Y\).

  • Minimizing the deviance loss of \({\rm EDF}(\theta, \varphi/v; \kappa)\) implies that we estimate the EDF mean \(\mu_0\) with MLE.

  • Gourieroux, Monfort and Trognon (1984) proved that this estimation procedure is on average optimal on finite samples, supposed that \(Y \sim {\rm EDF}(\theta, \varphi/v; \kappa)\) is correctly specified.

  • Gourieroux, Monfort and Trognon (1984) call this (optimal) estimation procedure best asymptotically normal.

Finite sample estimation

Select the deviance loss for model fitting that aligns with the EDF properties of \(Y\) (i.e., select the correct cumulant function \(\kappa\) for \(Y\)).

2.7 Examples of deviance losses

EDF distribution cumulant \(\kappa(\theta)\) deviance loss \(L(y,m)\)
Gaussian \(\theta^2/2\) \((y-m)^2\)
gamma \(-\log(-\theta)\) \(2\left((y-m)/m+\log(m/y)\right)\)
inverse Gaussian \(-\sqrt{-2\theta}\) \((y-m)^2/(m^2y)\)
Poisson \(e^\theta\) \(2\left(m-y-y\log(m/y)\right)\)
Tweedie \(p\in (1,2)\) \(\frac{((1-p)\theta)^{\frac{2-p}{1-p}}}{2-p}\) \(2\left(y\frac{y^{1-p}-m^{1-p}}{1-p}-\frac{y^{2-p}-m^{2-p}}{2-p}\right)\)
Bernoulli \(\log(1+e^\theta)\) \(2\left(-y \log (m) - (1-y)\log(1-m)\right)\)

Some of these deviance losses have support restrictions, see next slide.

2.8 Supports of Tweedie’s family

\(p\) EDF distribution support \(Y\) \(\boldsymbol{\Theta}\) \(\kappa'(\mathring{\boldsymbol{\Theta}})\)
\(p=0\) Gaussian \({\mathbb R}\) \({\mathbb R}\) \({\mathbb R}\)
\(p=1\) Poisson \({\mathbb N}_0\) \({\mathbb R}\) \((0,\infty)\)
\(1<p<2\) Tweedie’s CP \([0,\infty)\) \((-\infty, 0)\) \((0,\infty)\)
\(p=2\) gamma \((0,\infty)\) \((-\infty, 0)\) \((0,\infty)\)
\(p>2\) generated by positive stable \((0,\infty)\) \((-\infty,0]\) \((0,\infty)\)
\(p=3\) inverse Gaussian \((0,\infty)\) \((-\infty,0]\) \((0,\infty)\)
  • Tweedie’s family is characterized by the power variance property \[ {\rm Var}(Y) = \frac{\varphi}{v}\, \mu_0^p \qquad \text{ with } \mu_0 = \kappa'(\theta);\] see Jørgensen (1997).

3 Regression model

Overview
  • We are now ready to discuss regression models.

  • Their fitting procedure will use the strictly consistent loss function for mean estimation that reflects the properties of the observed responses. In particular, we select deviance losses.

  • This section gives a general (generic) introduction to regression modeling, and the remaining notebooks are dedicated to different types of regression models.

3.1 Regression model fitting

  • Consider a candidate model class \({\cal M}=\{\mu\}\) of sufficiently nice regression functions \(\mu:{\cal X} \to {\mathbb R}\).

  • Based on an i.i.d. learning sample \({\cal L}=(Y_i,\boldsymbol{X}_i, v_i)_{i=1}^n\), minimize \[ \widehat{\mu} ~\in~ \underset{ \mu \in {\cal M}}{\arg\min}\, \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu(\boldsymbol{X}_i)),\] for a strictly consistent loss function \(L\) for mean estimation and for given weights \(v_i/\varphi\).

  • For a parametrized model class \({\cal M}=\{\mu_{\vartheta}\}_\vartheta\), we rather solve \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)).\]

4 Covariate pre-processing

Overview

An important part of regression modeling concerns covariate pre-processing.

  • First, all covariates need to be in a numerical form. This requires to transform the vast number of categorical covariates in actuarial problems to real-valued representations.

  • Second, it might be that continuous covariates are not in the right functional form or they are very skewed. This may require that also continuous covariates need to be transformed.

4.1 Notation and design matrix

  • The \(q\)-dimensional real-valued covariate information of instance \(i\) is denoted by \(\boldsymbol{X}_i\). It takes values in the covariate space \({\cal X}\subseteq {\mathbb R}^q\).

  • We add an intercept/bias component being equal to 1 \[\boldsymbol{X}_i = (1, X_{i,1}, \ldots, X_{i,q})^\top \in {\mathbb R}^{q+1};\] this uses a slight abuse of notation not indicated in \(\boldsymbol{X}_i\). From the context it will always be clear whether \(\boldsymbol{X}_i\) includes the bias or not.

  • The design matrix is defined by \[\mathfrak{X}=\begin{pmatrix}1 & X_{1,1} & \cdots & X_{1,q} \\\vdots & \vdots & \ddots & \vdots \\1 & X_{n,1} & \cdots & X_{n,q} \\\end{pmatrix} \in {\mathbb R}^{n\times(q+1)}.\]

  • Note: the different fonts \(\boldsymbol{X}\), \(X\), \({\cal X}\) and \({\mathfrak X}\) have different meanings.

4.2 Categorical covariates

  • Categorical covariates (nominal or ordinal) need pre-processing to bring them into a numerical form. This is done by an entity embedding.

  • Consider a categorical covariate \(X\) that takes values in a finite set \({\cal A}=\{a_1, \ldots, a_K \}\) having \(K\) levels.

  • The running example in this section has \(K=6\) levels: \[\begin{eqnarray*} {\cal A} &=& \Big\{\,\text{accountant}, \text{ actuary}, \text{ economist}, \\&&\hspace{8ex} \text{\qquad\qquad quant}, \text{ statistician}, \text{ underwriter}\Big\}. \end{eqnarray*}\]

  • We need to bring this into a numerical form.


4.2.1 Ordinal categorical covariates

For ordinal (ordered) levels \((a_k)_{k=1}^K\), use a 1-dimensional entity embedding \[ X \in {\cal A} ~\mapsto ~\sum_{k=1}^K k \, \mathbf{1}_{\{X=a_k\}}.\]

Our running example has an alphabetical ordering.

level embedding
accountant 1
actuary 2
economist 3
quant 4
statistician 5
underwriter 6

One may argue that the alphabetical order is risk insensitive (not useful).


4.2.2 One-hot encoding

One-hot encoding maps each level \(a_k\) to a basis vector in \({\mathbb R}^K\) \[ X \in {\cal A} ~\mapsto ~ \left(\mathbf{1}_{\{X=a_1\}}, \ldots, \mathbf{1}_{\{X=a_K\}}\right)^\top \in {\mathbb R}^K.\]

level
accountant 1 0 0 0 0 0
actuary 0 1 0 0 0 0
economist 0 0 1 0 0 0
quant 0 0 0 1 0 0
statistician 0 0 0 0 1 0
underwriter 0 0 0 0 0 1

One-hot encoding does not lead to full rank design matrices \({\mathfrak X}\) because there is a redundancy (the first \(K-1\) components are sufficient).


4.2.3 Dummy coding

Dummy coding selects a reference level, e.g., \(a_2=\text{actuary}\). Based on this selection, all other levels are measured relative to this reference level \[ X \in {\cal A} ~\mapsto ~ \left(\mathbf{1}_{\{X=a_1\}}, \mathbf{1}_{\{X=a_3\}}, \mathbf{1}_{\{x=a_4\}},\ldots, \mathbf{1}_{\{X=a_K\}}\right)^\top \in {\mathbb R}^{K-1}. \]

level
accountant 1 0 0 0 0
actuary 0 0 0 0 0
economist 0 1 0 0 0
quant 0 0 1 0 0
statistician 0 0 0 1 0
underwriter 0 0 0 0 1

This leads to full rank, but also sparsity in the design matrix \({\mathfrak X}\), i.e., most of the entries are zero (for \(K\) large). This may be problematic numerically.


4.2.4 Entity embedding

  • Borrowing ideas from natural language processing (NLP), use low dimensional entity embeddings, with proximity related to similarity; see Brébisson et al. (2015), Guo and Berkhahn (2016), Richman (2021).

  • Choose an embedding dimension \(b \in {\mathbb N}\) – this is a hyper-parameter selected by the modeler – typically \(b\ll K\).

  • An entity embedding (EE) is defined by \[ \boldsymbol{e}^{\rm EE}: {\cal A} \to {\mathbb R}^b, \qquad X \mapsto \boldsymbol{e}^{\rm EE}(X).\]

  • In total this entity embedding involves \(b\cdot K\) embedding weights (parameters).

  • Embedding weights need to be determined either by the modeler (manually) or during the model fitting procedure (algorithmically), and proximity in embedding should reflect similarity in (risk) behavior.


  • Manually chosen example with \(b\cdot K=24\) embedding weights for embedding dimension \(b=4\).
level finance maths stats liability
accountant 0.5 0 0 0
actuary 0.5 0.3 0.5 0.5
economist 0.5 0.2 0.5 0
quant 0.7 0.3 0.3 0
statistician 0 0.5 0.8 0
underwriter 0 0.1 0.1 0.8
  • Note: the weights do not need to be normalized, i.e., only proximity in \({\mathbb R}^b\) is relevant.

4.2.5 Target encoding

  • Especially for regression trees, one uses target encoding.

  • Consider a sample \((Y_i, X_i, v_i)_{i=1}^n\) with categorical covariates \(X_i \in {\cal A}\), real-valued responses \(Y_i\) and weights \(v_i>0\).

  • Compute the weighted sample means on all levels \(a_k \in {\cal A}\) by \[ \overline{y}_k = \frac{\sum_{i=1}^n v_iY_i\, \mathbf{1}_{\{X_i=a_k\}}} {\sum_{i=1}^n v_i \mathbf{1}_{\{X_i=a_k\}}}.\]

  • These weighted sample means \((\overline{y}_k)_{k=1}^K\) are used like ordinal levels \[ X \in {\cal A} ~\mapsto ~\sum_{k=1}^K \overline{y}_k \, \mathbf{1}_{\{X=a_k\}}.\]

  • Though convincing at the first sight, this does not consider interactions within the covariates, e.g., for scarce levels it may happen that a high or low value is mainly implied by another covariate component.

  • Be careful: target encoding may lead to information leakage.


4.2.6 Credibilitized target embedding

  • For scarce levels, one should credibilitize the target encoding; see Micci-Barreca (2001) and Bühlmann (1967).

  • Goal: Assess how credible the individual estimates \(\overline{y}_k\) are.

  • Improve unreliable ones by mixing them with the global weighted empirical mean \(\overline{y}=\sum_{i=1}^n v_iY_i/\sum_{i=1}^n v_i\), providing \[ \overline{y}^{\rm cred}_k = \alpha_k\,\overline{y}_k + \left(1-\alpha_k\right) \overline{y},\] with credibility weights for \(1\le k \le K\) \[ \alpha_k = \frac{\sum_{i=1}^n v_i\mathbf{1}_{\{X_i=a_k\}}}{\sum_{i=1}^n v_i\mathbf{1}_{\{X_i=a_k\}}+ \tau}~\in~[0,1].\]

  • The shrinkage parameter \(\tau \ge 0\) is a hyper-parameter selected by the modeler; it is also called credibility coefficient.

4.3 Continuous covariates

  • In theory, continuous covariates do not need any pre-processing.

  • In practice, it might be that the continuous covariates do not provide the right functional form, or they may live on the wrong scale.

  • E.g., we may replace a positive covariate \(X>0\) by a 4-dimensional pre-processed covariate \[ X ~\mapsto ~ (X, \log(X), \exp\{X\}, (X-10)^2)^\top.\] This has a linear, a logarithmic, an exponential and a quadratic term.

  • In GLMs, one often discretizes continuous covariates, e.g., one builds age classes. For a finite partition \((I_k)_{k=1}^K\) of the support of the continuous covariate \(X\), assign the categorical labels \(a_k \in {\cal A}\) to \(X\) by \[ X ~\mapsto ~ \sum_{k=1}^K a_k\, \mathbf{1}_{\{X \in I_k\}}.\]


4.3.1 Standardization and MinMaxScaler

  • Gradient descent fitting methods require that all covariate components live on the same scale.

  • Assume to have \(n\) instances with a continuous covariate \((X_i)_{i=1}^n\).

  • Standardization considers the transformation \[ X ~\mapsto ~ \frac{X - \widehat{m}}{\widehat{s}},\] with empirical mean \(\widehat{m}\) and empirical standard deviation \(\widehat{s}>0\) of \((X_i)_{i=1}^n\).

  • The MinMaxScaler is given by the transformation \[ X ~\mapsto ~ 2\,\frac{X - \min_{1\le i \le n}X_i}{\max_{1\le i \le n}X_i-\min_{1\le i \le n}X_i} -1.\]

  • Important: The identical transformation needs to be applied to all samples (learning data, test data, training data, validation data,…)! This requires storage of the transformation parameters.

5 Regularization

Overview

Regularization is a model fitting technique that aims at variable shrinkage and variable selection during their estimation procedure. This can be achieved by penalizing extreme parameter values, making models generally more smooth (in some sense). This section discusses different regularization methods.

\(\,\)

Tip

The fast reader can skip this section and come back to it later.

5.1 Regularization: Introduction

  • Generally speaking, a regression model \(\boldsymbol{X} \mapsto \mu(\boldsymbol{X})\) is correctly specified if it integrates all relevant covariates in the correct form, and if there is no redundant, missing or wrong covariate in the regression function:

    • redundant means that we include multiple covariates that essentially present the same information (e.g., being collinear);

    • missing means that we have forgotten important covariate information that would be available; and

    • wrong means that the covariate does not impact the response.

  • In real world problems, with unknown regression functions, this is part of covariate pre-processing, model selection and covariate selection.


  • One might be tempted to include any available information into the regression model, to ensure that nothing gets forgotten. However, fitting on finite samples, this is usually not a good strategy, because one may easily run into over-fitting issues, difficulties with collinearity in covariates, in identifiability issues, and, generally, in a poor predictive model because model fitting has not been successful.

  • Model complexity control is crucial in a successful model fitting and predictive modeling procedure. Typically, one aims for a sparse model, meaning that one aims for parsimony having the fewest necessary variables but still providing accurate predictions.

  • In practice, neither knowing the true model nor the (causal) regression structure and the factors that impact the responses, one often starts with slightly too large models, that one tries to shrink by regularization. Regularization penalizes model complexity and/or extreme regression coefficients.

5.2 Regularization techniques

We describe the most popular regularization techniques in this section; see Hastie, Tibshirani and Wainwright (2015).

  • Ridge regularization (also known as Tikhonov (1943) regularization or \(L^2\)-regularization);

  • LASSO regularization of Tibshirani (1996) (\(L^1\)-regularization);

  • Elastic net regularization of Zou and Hastie (2005);

  • Group LASSO regularization of Yuan and Lin (2006);

  • Fused LASSO regularization of Tibshirani et al. (2005);

  • Smoothly clipped absolute deviation (SCAD) regularization of Fan and Li (2001); we skip the details of SCAD in this notebook.

5.3 Regularization preparation

  • Consider a parametric regression estimation \(\mu_{\widehat{\vartheta}}(\cdot)\) by selecting from a class of candidate models \({\cal M}=\{\mu_\vartheta\}_\vartheta\) being parametrized by \(\vartheta\).

  • The parameter \(\vartheta\) is assumed to be a \((r+1)\)-dimensional vector \[ \vartheta = (\vartheta_0, \vartheta_1, \ldots, \vartheta_r)^\top \in {\mathbb R}^{r+1}.\] Typically, \((\vartheta_{k})_{k=1}^r\) parametrizes the terms in \(\mu_{\vartheta}(\boldsymbol{X})\) involving the covariates \(\boldsymbol{X}\), and \(\vartheta_0\) is a parameter for the covariate-free part of the regression function determining the overall level of the regression function; \(\vartheta_0\) is the intercept/bias term, see GLM above.

  • This bias term \(\vartheta_0\) needs to be excluded from regularization to be able to calibrate the overall level of the model.

  • Denote by \(\vartheta_{\setminus 0}= (\vartheta_1, \ldots, \vartheta_r)^\top \in {\mathbb R}^{r}\) the parameter vector excluding the bias term \(\vartheta_0\).

5.4 Regularization: penalty function

  • Regularized parameter estimation is achieved by\[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \, R(\vartheta_{\setminus 0})\right),\] where \(R:{\mathbb R}^r\to {\mathbb R}_+\) is a penalty function and \(\eta \ge 0\) is the regularization parameter.

  • We give the most common examples of penalty functions, below.

  • Noteworthy, we do not consider any scaling \(1/n\) in front of the sum. This scaling does not affect the optimal solution in the non-regularized case, but it makes the regularization parameter \(\eta\) sample size independent.

5.5 Ridge regularization/regression

  • Ridge regression selects a squared \(L^2\)-norm penalty function for \(R\) \[ \widehat{\vartheta}^{\rm ridge} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \, \|\vartheta_{\setminus 0}\|^2_2 \right).\]

  • Ridge regularization generally punishes large values in \((\vartheta_{k})_{k=1}^r\), this is called shrinkage.

  • The level of shrinkage is determined by the regularization parameter \(\eta \ge 0\), and an optimal choice is determined, e.g., by cross-validation.

  • This intuition of shrinkage indicates why we exclude the bias term \(\vartheta_0\) from regularization. Otherwise, we receive a statistically biased model.

5.6 LASSO regularization/regression

  • LASSO regularization (least absolute shrinkage and selection operator regularization) selects a \(L^1\)-norm penalty function for \(R\) \[ \widehat{\vartheta}^{\rm LASSO} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \, \|\vartheta_{\setminus 0}\|_1\right).\]

  • The behavior of LASSO regularization is fundamentally different from ridge regularization. This difference results from the fact that the \(L^2\)-norm is differentiable in the origin and the \(L^1\)-norm is not.



  • The consequence of this degeneracy of the \(L^1\)-norm in the origin is that some of the components of \((\vartheta_{k})_{k=1}^r\) may be optimal in LASSO regularization if they are exactly equal to zero.

  • The more we increase the regularization parameter \(\eta \uparrow \infty\), the more components of \((\vartheta_{k})_{k=1}^r\) are optimal in precisely setting them to zero.

  • Increasing the regularization parameter \(\eta\) leads to sparsity of non-zero values in \((\vartheta_{k})_{k=1}^r\), and, consequently, there results a sparse regression model for large regularization parameters \(\eta \gg 0\).

  • This can be interpreted as variable importance, i.e., for large \(\eta\) only important covariate components survive.

  • Often, one performs variable selection with LASSO regularization in a first step, and in a second step one fits a non-regularized model on the selected components.

5.7 Geometric interpretation of ridge vs. LASSO

  • Generally, these regularized regression problems can be interpreted geometrically using the method of Lagrange:

    • the feasible set of solutions in the ridge regression case is a \(L^2\)-ball, and

    • in the LASSO regression case a \(L^1\)-cube;

    we refer to the next graph.

  • This geometric difference (having corners or not) precisely distinguishes sparsity of LASSO from shrinkage of ridge regression, i.e., this is related to the difference of these convex geometric sets having differentiable boundaries or not.

  • In fact, one considers orthogonal projections to the ball and cube, respectively, and sparsity in LASSO results from corner solutions in the cube case.


5.8 Best-subset selection regularization

  • Best-subset selection regularization selects a \(L^0\)-norm penalty function for \(R\) \[ \widehat{\vartheta}^{\rm BSS} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \, \sum_{k=1}^r \mathbf{1}_{\{\vartheta_k \neq 0 \}}\right).\]

  • This version is not used very often in applications mainly because optimizing the above is difficult. In fact, LASSO regularization is considered as a tractable version instead.

5.9 Elastic regularization

  • Elastic net regularization combines LASSO and ridge regularization \[ \widehat{\vartheta}^{\rm net} \in \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \left( \alpha \|\vartheta_{\setminus 0}\|_1 +(1-\alpha) \|\vartheta_{\setminus 0}\|^2_2\right)\right),\] with \(\alpha \in [0,1]\).

  • Elastic net regularization overcomes some issues of LASSO, e.g., LASSO has the tendency to group effects by assigning similar weights to correlated covariate components.

5.10 Grouped covariates

  • The previous regularization proposals have been focusing on shrinking the regression parameters \((\vartheta_{k})_{k=1}^r\) and/or to make regression models sparse by setting some of these parameters \(\vartheta_k\) exactly to zero.

  • In these regularization methods all terms and parameters \((\vartheta_{k})_{k=1}^r\) are considered individually.

  • If we want to treat some of them jointly in a group, e.g., for a dummy coded categorical covariate, regularization should act simultaneously on all parameters that belong to that group.

  • We group the (categorical) covariates. Assume we have \(G\) groups \[ \vartheta_{\setminus 0} = (\vartheta^{(1)}, \ldots, \vartheta^{(G)})^\top \in {\mathbb R}^{d_1 + \ldots + d_G}={\mathbb R}^{r},\] where each group \(\vartheta^{(k)} \in {\mathbb R}^{d_k}\) contains \(d_k\) components of \(\vartheta_{\setminus 0}\).

5.11 Group LASSO regularization

  • Group LASSO regularization is obtained by solving \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \sum_{k=1}^G \eta_k \, \|\vartheta^{(k)}\|_2 \right),\] for regularization parameters \(\eta_k\ge 0\).

  • For increasing regularization parameters \(\eta_k\), group LASSO regularization leads to sparsity in setting the entire block of covariates \(\vartheta^{(k)} \in {\mathbb R}^{d_k}\) simultaneously equal to zero.

5.12 Fused LASSO regularization

  • Assume there is an adjacency relation on the covariate components, i.e., the covariate component \(X_j\) is naturally embedded between the components \(X_{j-1}\) and \(X_{j+1}\), and we try to achieve a certain smoothness of the regression parameters for these covariates.

  • Fused LASSO regularization is obtained by solving \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \sum_{j=2}^r \eta_j \, |\vartheta_j-\vartheta_{j-1}|\right),\] for regularization parameters \(\eta_j \ge 0\).

  • The fused LASSO proposal enforces sparsity in the non-zero components, but also sparsity in the different regression parameter values for adjacent variables. It considers first order differences, and one could also consider second (or higher) order differences.

5.13 Positivity and monotonicity regularization

  • One may want to enforce that the parameters are positive \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \sum_{j=1}^r \eta_j \, (0-\vartheta_j)_+\right).\]

  • Enforcement of a monotone increasing property \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \sum_{j=1}^r \eta_j \, (\vartheta_{j-1}-\vartheta_{j})_+\right),\] with rectified linear unit (ReLU) function \[ u \mapsto (u)_+ =\max\{ u, 0 \} = u\, \mathbf{1}_{\{u>0\}}.\]

  • The above regularizes parameters \(\vartheta_{\setminus 0}\), but one could also enforce smoothness and monotonicity in the regression function components (e.g., mortality graduation); see Richman and Wüthrich (2024).

References

Barndorff-Nielsen, O. (2014) Information and exponential families: In statistical theory. John Wiley & Sons. Available at: https://onlinelibrary.wiley.com/doi/book/10.1002/9781118857281.
Brébisson, A. de et al. (2015) “Artificial neural networks applied to taxi destination prediction.” Available at: https://arxiv.org/abs/1508.00021.
Bühlmann, H. (1967) “Experience rating and credibility,” ASTIN Bulletin - The Journal of the IAA, 4(3), pp. 199–207. Available at: https://www.cambridge.org/core/product/B9A879CE4A73397C653963B474A7F954.
Fan, J. and Li, R. (2001) “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96(456), pp. 1348–1360. Available at: https://doi.org/10.1198/016214501753382273.
Gourieroux, C., Monfort, A. and Trognon, A. (1984) “Pseudo maximum likelihood methods: theory,” Econometrica, 52(3), pp. 681–700. Available at: https://www.jstor.org/stable/1913471?seq=1.
Guo, C. and Berkhahn, F. (2016) “Entity embeddings of categorical variables.” Available at: https://arxiv.org/abs/1604.06737.
Hastie, T., Tibshirani, R. and Wainwright, M. (2015) Statistical learning with sparsity: The LASSO and generalizations. Chapman & Hall/CRC. Available at: https://doi.org/10.1201/b18401.
Jørgensen, B. (1997) The theory of dispersion models. Chapman & Hall.
Micci-Barreca, D. (2001) “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems,” SIGKDD Explor. Newsl., 3(1), pp. 27–32. Available at: https://doi.org/10.1145/507533.507538.
Richman, R. (2021) “AI in actuarial science – a review of recent advances – part 1,” Annals of Actuarial Science, 15(2), pp. 207–229. Available at: https://www.cambridge.org/core/product/65D135D28505261F431EBEC0220DF0B0.
Richman, R. and Wüthrich, M.V. (2024) “Smoothness and monotonicity constraints for neural networks using ICEnet,” Annals of Actuarial Science, 18(3), pp. 712–739. Available at: https://www.cambridge.org/core/product/465BBBA65B70D62FEE9CE838C6E88B58.
Tibshirani, R. (1996) “Regression shrinkage and selection via the LASSO,” Journal of the Royal Statistical Society. Series B (Methodological), 58(1), pp. 267–288. Available at: http://www.jstor.org/stable/2346178.
Tibshirani, R. et al. (2005) “Sparsity and smoothness via the fused LASSO,” Journal of the Royal Statistical Society Series B, 67(1), pp. 91–108. Available at: https://doi.org/10.1111/j.1467-9868.2005.00490.x.
Tikhonov, A.N. (1943) “On the stability of inverse problems,” Doklady Akademii Nauk SSSR, 39(5), pp. 195–198.
Wüthrich, M.V. and Merz, M. (2023) Statistical foundations of actuarial learning and its applications. Springer. Available at: https://doi.org/10.1007/978-3-031-12409-9.
Yuan, M. and Lin, Y. (2006) “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society Series B, 68(1), pp. 49–67. Available at: https://doi.org/10.1111/j.1467-9868.2005.00532.x.
Zou, H. and Hastie, T. (2005) “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), pp. 301–320. Available at: https://doi.org/10.1111/j.1467-9868.2005.00503.x.