
AI TOOLS FOR ACTUARIES
Chapter 2: Regression Models
Chapter 2: Regression Models
1 Introduction
1.1 Desirable properties of predictive models
Desirable characteristics of predictive models in insurance:
- provide accurate forecasts;
- smoothness properties so that forecasts do not drastically change by slightly perturbing the inputs;
- sparsity and simplicity, i.e., one aims for a parsimonious model;
- inner functioning of the model should be intuitive and explainable;
- good finite sample properties with credible parameter estimation;
- quantifiable prediction uncertainty;
- (manually) adaptable to expert knowledge;
- compliant with regulation, and one should be able to verify this.
\(\rhd\) Typically, one needs to compromise among these requirements.
Distributions play a surprisingly marginal role in regression modeling, but they are important to get good forecasts on finite samples.
2 Exponential dispersion family
EDF literature: Jørgensen (1997) and Barndorff-Nielsen (2014). Our notation is taken from Wüthrich and Merz (2023).
2.1 Definition of the exponential dispersion family
A random variable \(Y \sim {\rm EDF}(\textcolor{blue}{\theta}, \varphi/v; \textcolor{red}{\kappa})\) belongs to the EDF if its density/probability weights take the form \[ Y ~\sim~ f_\theta(y) = \exp\left(\frac{y\textcolor{blue}{\theta}-\textcolor{red}{\kappa}(\textcolor{blue}{\theta})}{\varphi/v}+c(y, \varphi/v)\right),\] with
cumulant function \(\textcolor{red}{\kappa} : \boldsymbol{\Theta} \to {\mathbb R}\) on the effective domain \(\boldsymbol{\Theta} \subseteq {\mathbb R}\),
canonical parameter \(\textcolor{blue}{\theta} \in \boldsymbol{\Theta}\),
weight/volume \(v>0\),
dispersion parameter \(\varphi>0\), and
a normalizing function \(c(\cdot, \cdot)\).
- The EDF contains many popular statistical models, e.g., the Bernoulli, Gaussian, gamma, inverse Gaussian, Poisson or negative binomial models. We give \(\textcolor{red}{\kappa}\) for these examples below.
2.2 Mean and variance within the EDF
In any non-trivial setting we have the following properties:
the effective domain \(\boldsymbol{\Theta} \subseteq {\mathbb R}\) is a (possibly infinite) interval with a non-empty interior \(\mathring{\boldsymbol{\Theta}}\), and
the cumulant function \(\kappa\) is smooth and strictly convex on \(\mathring{\boldsymbol{\Theta}}\).
The mean and the variance for given \(\theta \in \mathring{\boldsymbol{\Theta}}\) are \[\begin{eqnarray*} \mu_0~=~{\mathbb E}\left[ Y\right]&=&\kappa'(\theta), \\ {\rm Var}\left( Y\right)&=&\frac{\varphi}{v}\kappa''(\theta)~>~0 .\end{eqnarray*}\]
This highlights the crucial role played by the cumulant function \(\kappa\) and the canonical (model) parameter \(\theta\).
E.g., the Poisson model has cumulant function \(\kappa(\theta)=\exp(\theta)\) and the gamma model \(\kappa(\theta)=-\log(-\theta)\).
2.3 Link function and mean parameter space
The canonical link of the chosen EDF is defined by \(h=(\kappa')^{-1}\).
The canonical link \(h\) identifies the mean \(\mu_0\) and the canonical parameter \(\theta\) of the selected EDF by \[ \mu_0={\mathbb E}\left[ Y\right]=\kappa'(\theta)\quad\Longleftrightarrow\quad h(\mu_0)=h\left({\mathbb E}\left[ Y\right]\right)=\theta.\]
This motivates to define the mean parameter space \[ \kappa'(\mathring{\boldsymbol{\Theta}})=\{\kappa'(\theta);\, \theta \in \mathring{\boldsymbol{\Theta}}\}.\]
There is a one-to-one correspondence between the mean parameter space \(\kappa'(\mathring{\boldsymbol{\Theta}})\) and the interior of the effective domain \(\mathring{\boldsymbol{\Theta}}\).
\(\Longrightarrow\) Thus, the EDF can either be parametrized by \(\theta\) or by \(\mu_0\).
2.4 Maximum likelihood estimation (MLE)
The log-likelihood is given by \[ \log( f_\theta(Y)) = \frac{v}{\varphi}\left( Y\theta-\kappa(\theta)\right)+c(Y, \varphi/v).\]
The EDF parametrization allows one to easily compute the MLE of \(\theta\). For a single observation \(Y\) it is \[ \widehat{\theta}^{\rm MLE} = h(Y) \quad \Longleftrightarrow \quad \widehat{\mu}_0^{\rm MLE} = Y.\]
- Mathematical small print: The above statement requires that the observation \(Y\) is in the interior of the mean parameter space. Apart from its boundary, this is always the case for steep cumulant functions. We omit these technical details and refer to Barndorff-Nielsen (2014) and Wüthrich and Merz (2023).
2.5 Deviance loss function
- The deviance loss function of the selected EDF is defined by \[ L(y,m) = 2\, \frac{\varphi}{v}\left( \log \left(f_{h(y)}(y)\right) - \log \left(f_{h(m)}(y)\right)\right) \ge 0,\] with \(m \in \kappa'(\mathring{\boldsymbol{\Theta}})\) in the mean parameter space and \(y\) in the interior of the convex closure of the support of \(Y\).
- Interesting examples for practice are given in the table below.
2.6 Best asymptotically normal
The deviance loss of \(Y \sim {\rm EDF}(\theta, \varphi/v; \kappa)\) aligns with the properties of \(Y\).
Minimizing the deviance loss of \({\rm EDF}(\theta, \varphi/v; \kappa)\) implies that we estimate the EDF mean \(\mu_0\) with MLE.
Gourieroux, Monfort and Trognon (1984) proved that this estimation procedure is on average optimal on finite samples, supposed that \(Y \sim {\rm EDF}(\theta, \varphi/v; \kappa)\) is correctly specified.
Gourieroux, Monfort and Trognon (1984) call this (optimal) estimation procedure best asymptotically normal.
2.7 Examples of deviance losses
| EDF distribution | cumulant \(\kappa(\theta)\) | deviance loss \(L(y,m)\) |
|---|---|---|
| Gaussian | \(\theta^2/2\) | \((y-m)^2\) |
| gamma | \(-\log(-\theta)\) | \(2\left((y-m)/m+\log(m/y)\right)\) |
| inverse Gaussian | \(-\sqrt{-2\theta}\) | \((y-m)^2/(m^2y)\) |
| Poisson | \(e^\theta\) | \(2\left(m-y-y\log(m/y)\right)\) |
| Tweedie \(p\in (1,2)\) | \(\frac{((1-p)\theta)^{\frac{2-p}{1-p}}}{2-p}\) | \(2\left(y\frac{y^{1-p}-m^{1-p}}{1-p}-\frac{y^{2-p}-m^{2-p}}{2-p}\right)\) |
| Bernoulli | \(\log(1+e^\theta)\) | \(2\left(-y \log (m) - (1-y)\log(1-m)\right)\) |
Some of these deviance losses have support restrictions, see next slide.
2.8 Supports of Tweedie’s family
| \(p\) | EDF distribution | support \(Y\) | \(\boldsymbol{\Theta}\) | \(\kappa'(\mathring{\boldsymbol{\Theta}})\) |
|---|---|---|---|---|
| \(p=0\) | Gaussian | \({\mathbb R}\) | \({\mathbb R}\) | \({\mathbb R}\) |
| \(p=1\) | Poisson | \({\mathbb N}_0\) | \({\mathbb R}\) | \((0,\infty)\) |
| \(1<p<2\) | Tweedie’s CP | \([0,\infty)\) | \((-\infty, 0)\) | \((0,\infty)\) |
| \(p=2\) | gamma | \((0,\infty)\) | \((-\infty, 0)\) | \((0,\infty)\) |
| \(p>2\) | generated by positive stable | \((0,\infty)\) | \((-\infty,0]\) | \((0,\infty)\) |
| \(p=3\) | inverse Gaussian | \((0,\infty)\) | \((-\infty,0]\) | \((0,\infty)\) |
- Tweedie’s family is characterized by the power variance property \[ {\rm Var}(Y) = \frac{\varphi}{v}\, \mu_0^p \qquad \text{ with } \mu_0 = \kappa'(\theta);\] see Jørgensen (1997).
3 Regression model
3.1 Regression model fitting
Consider a candidate model class \({\cal M}=\{\mu\}\) of sufficiently nice regression functions \(\mu:{\cal X} \to {\mathbb R}\).
Based on an i.i.d. learning sample \({\cal L}=(Y_i,\boldsymbol{X}_i, v_i)_{i=1}^n\), minimize \[ \widehat{\mu} ~\in~ \underset{ \mu \in {\cal M}}{\arg\min}\, \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu(\boldsymbol{X}_i)),\] for a strictly consistent loss function \(L\) for mean estimation and for given weights \(v_i/\varphi\).
For a parametrized model class \({\cal M}=\{\mu_{\vartheta}\}_\vartheta\), we rather solve \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)).\]
3.2 Example: Poisson log-link GLM
For \(Y_i\) conditionally Poisson, choose the Poisson deviance loss for \(L\).
Select a log-link GLM regression function with parameters \(\vartheta \in {\mathbb R}^{q+1}\) \[ \boldsymbol{X}~\mapsto~ \log(\mu_{\vartheta}(\boldsymbol{X})) = \vartheta_0 +\sum_{j=1}^q \vartheta_j X_j.\]
Solve the (MLE-) Poisson deviance loss minimization \[\begin{eqnarray*}\nonumber \widehat{\vartheta}^{\rm MLE} & = &\underset{\vartheta \in {\mathbb R}^{q+1}}{\arg\min}~\sum_{i=1}^n 2v_i\left(\mu_{\vartheta}(\boldsymbol{X}_i)-Y_i-Y_i\,\log\left(\frac{\mu_{\vartheta}(\boldsymbol{X}_i)}{Y_i}\right)\right). \end{eqnarray*}\]
- \(v_i>0\) are the time exposures, and \(Y_i=N_i/v_i\) are the claim frequencies of the claim counts \(N_i \in {\mathbb N}_0\).
- In the Poisson model the dispersion is \(\varphi=1\).
4 Covariate pre-processing
4.1 Notation and design matrix
The \(q\)-dimensional real-valued covariate information of instance \(i\) is denoted by \(\boldsymbol{X}_i\). It takes values in the covariate space \({\cal X}\subseteq {\mathbb R}^q\).
We add an intercept/bias component being equal to 1 \[\boldsymbol{X}_i = (1, X_{i,1}, \ldots, X_{i,q})^\top \in {\mathbb R}^{q+1};\] this uses a slight abuse of notation not indicated in \(\boldsymbol{X}_i\). From the context it will always be clear whether \(\boldsymbol{X}_i\) includes the bias or not.
The design matrix is defined by \[\mathfrak{X}=\begin{pmatrix}1 & X_{1,1} & \cdots & X_{1,q} \\\vdots & \vdots & \ddots & \vdots \\1 & X_{n,1} & \cdots & X_{n,q} \\\end{pmatrix} \in {\mathbb R}^{n\times(q+1)}.\]
Note: the different fonts \(\boldsymbol{X}\), \(X\), \({\cal X}\) and \({\mathfrak X}\) have different meanings.
4.2 Categorical covariates
Categorical covariates (nominal or ordinal) need pre-processing to bring them into a numerical form. This is done by an entity embedding.
Consider a categorical covariate \(X\) that takes values in a finite set \({\cal A}=\{a_1, \ldots, a_K \}\) having \(K\) levels.
The running example in this section has \(K=6\) levels: \[\begin{eqnarray*} {\cal A} &=& \Big\{\,\text{accountant}, \text{ actuary}, \text{ economist}, \\&&\hspace{8ex} \text{\qquad\qquad quant}, \text{ statistician}, \text{ underwriter}\Big\}. \end{eqnarray*}\]
We need to bring this into a numerical form.
4.2.1 Ordinal categorical covariates
For ordinal (ordered) levels \((a_k)_{k=1}^K\), use a 1-dimensional entity embedding \[ X \in {\cal A} ~\mapsto ~\sum_{k=1}^K k \, \mathbf{1}_{\{X=a_k\}}.\]
Our running example has an alphabetical ordering.
| level | embedding |
|---|---|
| accountant | 1 |
| actuary | 2 |
| economist | 3 |
| quant | 4 |
| statistician | 5 |
| underwriter | 6 |
One may argue that the alphabetical order is risk insensitive (not useful).
4.2.2 One-hot encoding
One-hot encoding maps each level \(a_k\) to a basis vector in \({\mathbb R}^K\) \[ X \in {\cal A} ~\mapsto ~ \left(\mathbf{1}_{\{X=a_1\}}, \ldots, \mathbf{1}_{\{X=a_K\}}\right)^\top \in {\mathbb R}^K.\]
| level | ||||||
|---|---|---|---|---|---|---|
| accountant | 1 | 0 | 0 | 0 | 0 | 0 |
| actuary | 0 | 1 | 0 | 0 | 0 | 0 |
| economist | 0 | 0 | 1 | 0 | 0 | 0 |
| quant | 0 | 0 | 0 | 1 | 0 | 0 |
| statistician | 0 | 0 | 0 | 0 | 1 | 0 |
| underwriter | 0 | 0 | 0 | 0 | 0 | 1 |
One-hot encoding does not lead to full rank design matrices \({\mathfrak X}\) because there is a redundancy (the first \(K-1\) components are sufficient).
4.2.3 Dummy coding
Dummy coding selects a reference level, e.g., \(a_2=\text{actuary}\). Based on this selection, all other levels are measured relative to this reference level \[ X \in {\cal A} ~\mapsto ~ \left(\mathbf{1}_{\{X=a_1\}}, \mathbf{1}_{\{X=a_3\}}, \mathbf{1}_{\{x=a_4\}},\ldots, \mathbf{1}_{\{X=a_K\}}\right)^\top \in {\mathbb R}^{K-1}. \]
| level | |||||
|---|---|---|---|---|---|
| accountant | 1 | 0 | 0 | 0 | 0 |
| actuary | 0 | 0 | 0 | 0 | 0 |
| economist | 0 | 1 | 0 | 0 | 0 |
| quant | 0 | 0 | 1 | 0 | 0 |
| statistician | 0 | 0 | 0 | 1 | 0 |
| underwriter | 0 | 0 | 0 | 0 | 1 |
This leads to full rank, but also sparsity in the design matrix \({\mathfrak X}\), i.e., most of the entries are zero (for \(K\) large). This may be problematic numerically.
4.2.4 Entity embedding
Borrowing ideas from natural language processing (NLP), use low dimensional entity embeddings, with proximity related to similarity; see Brébisson et al. (2015), Guo and Berkhahn (2016), Richman (2021).
Choose an embedding dimension \(b \in {\mathbb N}\) – this is a hyper-parameter selected by the modeler – typically \(b\ll K\).
An entity embedding (EE) is defined by \[ \boldsymbol{e}^{\rm EE}: {\cal A} \to {\mathbb R}^b, \qquad X \mapsto \boldsymbol{e}^{\rm EE}(X).\]
In total this entity embedding involves \(b\cdot K\) embedding weights (parameters).
Embedding weights need to be determined either by the modeler (manually) or during the model fitting procedure (algorithmically), and proximity in embedding should reflect similarity in (risk) behavior.
- Manually chosen example with \(b\cdot K=24\) embedding weights for embedding dimension \(b=4\).
| level | finance | maths | stats | liability | |
|---|---|---|---|---|---|
| accountant | 0.5 | 0 | 0 | 0 | |
| actuary | 0.5 | 0.3 | 0.5 | 0.5 | |
| economist | 0.5 | 0.2 | 0.5 | 0 | |
| quant | 0.7 | 0.3 | 0.3 | 0 | |
| statistician | 0 | 0.5 | 0.8 | 0 | |
| underwriter | 0 | 0.1 | 0.1 | 0.8 |
- Note: the weights do not need to be normalized, i.e., only proximity in \({\mathbb R}^b\) is relevant.
4.2.5 Target encoding
Especially for regression trees, one uses target encoding.
Consider a sample \((Y_i, X_i, v_i)_{i=1}^n\) with categorical covariates \(X_i \in {\cal A}\), real-valued responses \(Y_i\) and weights \(v_i>0\).
Compute the weighted sample means on all levels \(a_k \in {\cal A}\) by \[ \overline{y}_k = \frac{\sum_{i=1}^n v_iY_i\, \mathbf{1}_{\{X_i=a_k\}}} {\sum_{i=1}^n v_i \mathbf{1}_{\{X_i=a_k\}}}.\]
These weighted sample means \((\overline{y}_k)_{k=1}^K\) are used like ordinal levels \[ X \in {\cal A} ~\mapsto ~\sum_{k=1}^K \overline{y}_k \, \mathbf{1}_{\{X=a_k\}}.\]
Though convincing at the first sight, this does not consider interactions within the covariates, e.g., for scarce levels it may happen that a high or low value is mainly implied by another covariate component.
Be careful: target encoding may lead to information leakage.
4.2.6 Credibilitized target embedding
For scarce levels, one should credibilitize the target encoding; see Micci-Barreca (2001) and Bühlmann (1967).
Goal: Assess how credible the individual estimates \(\overline{y}_k\) are.
Improve unreliable ones by mixing them with the global weighted empirical mean \(\overline{y}=\sum_{i=1}^n v_iY_i/\sum_{i=1}^n v_i\), providing \[ \overline{y}^{\rm cred}_k = \alpha_k\,\overline{y}_k + \left(1-\alpha_k\right) \overline{y},\] with credibility weights for \(1\le k \le K\) \[ \alpha_k = \frac{\sum_{i=1}^n v_i\mathbf{1}_{\{X_i=a_k\}}}{\sum_{i=1}^n v_i\mathbf{1}_{\{X_i=a_k\}}+ \tau}~\in~[0,1].\]
The shrinkage parameter \(\tau \ge 0\) is a hyper-parameter selected by the modeler; it is also called credibility coefficient.
4.3 Continuous covariates
In theory, continuous covariates do not need any pre-processing.
In practice, it might be that the continuous covariates do not provide the right functional form, or they may live on the wrong scale.
E.g., we may replace a positive covariate \(X>0\) by a 4-dimensional pre-processed covariate \[ X ~\mapsto ~ (X, \log(X), \exp\{X\}, (X-10)^2)^\top.\] This has a linear, a logarithmic, an exponential and a quadratic term.
In GLMs, one often discretizes continuous covariates, e.g., one builds age classes. For a finite partition \((I_k)_{k=1}^K\) of the support of the continuous covariate \(X\), assign the categorical labels \(a_k \in {\cal A}\) to \(X\) by \[ X ~\mapsto ~ \sum_{k=1}^K a_k\, \mathbf{1}_{\{X \in I_k\}}.\]
4.3.1 Standardization and MinMaxScaler
Gradient descent fitting methods require that all covariate components live on the same scale.
Assume to have \(n\) instances with a continuous covariate \((X_i)_{i=1}^n\).
Standardization considers the transformation \[ X ~\mapsto ~ \frac{X - \widehat{m}}{\widehat{s}},\] with empirical mean \(\widehat{m}\) and empirical standard deviation \(\widehat{s}>0\) of \((X_i)_{i=1}^n\).
The MinMaxScaler is given by the transformation \[ X ~\mapsto ~ 2\,\frac{X - \min_{1\le i \le n}X_i}{\max_{1\le i \le n}X_i-\min_{1\le i \le n}X_i} -1.\]
Important: The identical transformation needs to be applied to all samples (learning data, test data, training data, validation data,…)! This requires storage of the transformation parameters.
5 Regularization
\(\,\)
The fast reader can skip this section and come back to it later.
5.1 Regularization: Introduction
Generally speaking, a regression model \(\boldsymbol{X} \mapsto \mu(\boldsymbol{X})\) is correctly specified if it integrates all relevant covariates in the correct form, and if there is no redundant, missing or wrong covariate in the regression function:
redundant means that we include multiple covariates that essentially present the same information (e.g., being collinear);
missing means that we have forgotten important covariate information that would be available; and
wrong means that the covariate does not impact the response.
In real world problems, with unknown regression functions, this is part of covariate pre-processing, model selection and covariate selection.
One might be tempted to include any available information into the regression model, to ensure that nothing gets forgotten. However, fitting on finite samples, this is usually not a good strategy, because one may easily run into over-fitting issues, difficulties with collinearity in covariates, in identifiability issues, and, generally, in a poor predictive model because model fitting has not been successful.
Model complexity control is crucial in a successful model fitting and predictive modeling procedure. Typically, one aims for a sparse model, meaning that one aims for parsimony having the fewest necessary variables but still providing accurate predictions.
In practice, neither knowing the true model nor the (causal) regression structure and the factors that impact the responses, one often starts with slightly too large models, that one tries to shrink by regularization. Regularization penalizes model complexity and/or extreme regression coefficients.
5.2 Regularization techniques
We describe the most popular regularization techniques in this section; see Hastie, Tibshirani and Wainwright (2015).
Ridge regularization (also known as Tikhonov (1943) regularization or \(L^2\)-regularization);
LASSO regularization of Tibshirani (1996) (\(L^1\)-regularization);
Elastic net regularization of Zou and Hastie (2005);
Group LASSO regularization of Yuan and Lin (2006);
Fused LASSO regularization of Tibshirani et al. (2005);
Smoothly clipped absolute deviation (SCAD) regularization of Fan and Li (2001); we skip the details of SCAD in this notebook.
5.3 Regularization preparation
Consider a parametric regression estimation \(\mu_{\widehat{\vartheta}}(\cdot)\) by selecting from a class of candidate models \({\cal M}=\{\mu_\vartheta\}_\vartheta\) being parametrized by \(\vartheta\).
The parameter \(\vartheta\) is assumed to be a \((r+1)\)-dimensional vector \[ \vartheta = (\vartheta_0, \vartheta_1, \ldots, \vartheta_r)^\top \in {\mathbb R}^{r+1}.\] Typically, \((\vartheta_{k})_{k=1}^r\) parametrizes the terms in \(\mu_{\vartheta}(\boldsymbol{X})\) involving the covariates \(\boldsymbol{X}\), and \(\vartheta_0\) is a parameter for the covariate-free part of the regression function determining the overall level of the regression function; \(\vartheta_0\) is the intercept/bias term, see GLM above.
This bias term \(\vartheta_0\) needs to be excluded from regularization to be able to calibrate the overall level of the model.
Denote by \(\vartheta_{\setminus 0}= (\vartheta_1, \ldots, \vartheta_r)^\top \in {\mathbb R}^{r}\) the parameter vector excluding the bias term \(\vartheta_0\).
5.4 Regularization: penalty function
Regularized parameter estimation is achieved by\[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \, R(\vartheta_{\setminus 0})\right),\] where \(R:{\mathbb R}^r\to {\mathbb R}_+\) is a penalty function and \(\eta \ge 0\) is the regularization parameter.
We give the most common examples of penalty functions, below.
Noteworthy, we do not consider any scaling \(1/n\) in front of the sum. This scaling does not affect the optimal solution in the non-regularized case, but it makes the regularization parameter \(\eta\) sample size independent.
5.5 Ridge regularization/regression
Ridge regression selects a squared \(L^2\)-norm penalty function for \(R\) \[ \widehat{\vartheta}^{\rm ridge} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \, \|\vartheta_{\setminus 0}\|^2_2 \right).\]
Ridge regularization generally punishes large values in \((\vartheta_{k})_{k=1}^r\), this is called shrinkage.
The level of shrinkage is determined by the regularization parameter \(\eta \ge 0\), and an optimal choice is determined, e.g., by cross-validation.
This intuition of shrinkage indicates why we exclude the bias term \(\vartheta_0\) from regularization. Otherwise, we receive a statistically biased model.
5.6 LASSO regularization/regression
LASSO regularization (least absolute shrinkage and selection operator regularization) selects a \(L^1\)-norm penalty function for \(R\) \[ \widehat{\vartheta}^{\rm LASSO} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \, \|\vartheta_{\setminus 0}\|_1\right).\]
The behavior of LASSO regularization is fundamentally different from ridge regularization. This difference results from the fact that the \(L^2\)-norm is differentiable in the origin and the \(L^1\)-norm is not.
The consequence of this degeneracy of the \(L^1\)-norm in the origin is that some of the components of \((\vartheta_{k})_{k=1}^r\) may be optimal in LASSO regularization if they are exactly equal to zero.
The more we increase the regularization parameter \(\eta \uparrow \infty\), the more components of \((\vartheta_{k})_{k=1}^r\) are optimal in precisely setting them to zero.
Increasing the regularization parameter \(\eta\) leads to sparsity of non-zero values in \((\vartheta_{k})_{k=1}^r\), and, consequently, there results a sparse regression model for large regularization parameters \(\eta \gg 0\).
This can be interpreted as variable importance, i.e., for large \(\eta\) only important covariate components survive.
Often, one performs variable selection with LASSO regularization in a first step, and in a second step one fits a non-regularized model on the selected components.
5.7 Geometric interpretation of ridge vs. LASSO
Generally, these regularized regression problems can be interpreted geometrically using the method of Lagrange:
the feasible set of solutions in the ridge regression case is a \(L^2\)-ball, and
in the LASSO regression case a \(L^1\)-cube;
we refer to the next graph.
This geometric difference (having corners or not) precisely distinguishes sparsity of LASSO from shrinkage of ridge regression, i.e., this is related to the difference of these convex geometric sets having differentiable boundaries or not.
In fact, one considers orthogonal projections to the ball and cube, respectively, and sparsity in LASSO results from corner solutions in the cube case.

5.8 Best-subset selection regularization
Best-subset selection regularization selects a \(L^0\)-norm penalty function for \(R\) \[ \widehat{\vartheta}^{\rm BSS} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \, \sum_{k=1}^r \mathbf{1}_{\{\vartheta_k \neq 0 \}}\right).\]
This version is not used very often in applications mainly because optimizing the above is difficult. In fact, LASSO regularization is considered as a tractable version instead.
5.9 Elastic regularization
Elastic net regularization combines LASSO and ridge regularization \[ \widehat{\vartheta}^{\rm net} \in \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \eta \left( \alpha \|\vartheta_{\setminus 0}\|_1 +(1-\alpha) \|\vartheta_{\setminus 0}\|^2_2\right)\right),\] with \(\alpha \in [0,1]\).
Elastic net regularization overcomes some issues of LASSO, e.g., LASSO has the tendency to group effects by assigning similar weights to correlated covariate components.
5.10 Grouped covariates
The previous regularization proposals have been focusing on shrinking the regression parameters \((\vartheta_{k})_{k=1}^r\) and/or to make regression models sparse by setting some of these parameters \(\vartheta_k\) exactly to zero.
In these regularization methods all terms and parameters \((\vartheta_{k})_{k=1}^r\) are considered individually.
If we want to treat some of them jointly in a group, e.g., for a dummy coded categorical covariate, regularization should act simultaneously on all parameters that belong to that group.
We group the (categorical) covariates. Assume we have \(G\) groups \[ \vartheta_{\setminus 0} = (\vartheta^{(1)}, \ldots, \vartheta^{(G)})^\top \in {\mathbb R}^{d_1 + \ldots + d_G}={\mathbb R}^{r},\] where each group \(\vartheta^{(k)} \in {\mathbb R}^{d_k}\) contains \(d_k\) components of \(\vartheta_{\setminus 0}\).
5.11 Group LASSO regularization
Group LASSO regularization is obtained by solving \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \sum_{k=1}^G \eta_k \, \|\vartheta^{(k)}\|_2 \right),\] for regularization parameters \(\eta_k\ge 0\).
For increasing regularization parameters \(\eta_k\), group LASSO regularization leads to sparsity in setting the entire block of covariates \(\vartheta^{(k)} \in {\mathbb R}^{d_k}\) simultaneously equal to zero.
5.12 Fused LASSO regularization
Assume there is an adjacency relation on the covariate components, i.e., the covariate component \(X_j\) is naturally embedded between the components \(X_{j-1}\) and \(X_{j+1}\), and we try to achieve a certain smoothness of the regression parameters for these covariates.
Fused LASSO regularization is obtained by solving \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \sum_{j=2}^r \eta_j \, |\vartheta_j-\vartheta_{j-1}|\right),\] for regularization parameters \(\eta_j \ge 0\).
The fused LASSO proposal enforces sparsity in the non-zero components, but also sparsity in the different regression parameter values for adjacent variables. It considers first order differences, and one could also consider second (or higher) order differences.
5.13 Positivity and monotonicity regularization
One may want to enforce that the parameters are positive \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \sum_{j=1}^r \eta_j \, (0-\vartheta_j)_+\right).\]
Enforcement of a monotone increasing property \[ \widehat{\vartheta} ~\in~ \underset{ \vartheta}{\arg\min}\, \left( \sum_{i=1}^n \frac{v_i}{\varphi}~L(Y_i,\mu_\vartheta(\boldsymbol{X}_i)) + \sum_{j=1}^r \eta_j \, (\vartheta_{j-1}-\vartheta_{j})_+\right),\] with rectified linear unit (ReLU) function \[ u \mapsto (u)_+ =\max\{ u, 0 \} = u\, \mathbf{1}_{\{u>0\}}.\]
The above regularizes parameters \(\vartheta_{\setminus 0}\), but one could also enforce smoothness and monotonicity in the regression function components (e.g., mortality graduation); see Richman and Wüthrich (2024).
Copyright
© The Authors
This notebook and these slides are part of the project “AI Tools for Actuaries”. The lecture notes can be downloaded from:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162304
\(\,\)
- This material is provided to reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution and credit is given to the original authors and source, and if you indicate if changes were made. This aligns with the Creative Commons Attribution 4.0 International License CC BY-NC.