options(keras.view_metrics = FALSE)
# Load libraries
library(data.table)
library(dplyr)
library(ggplot2)
library(reshape2)
library(HMDHFDplus)
library(forecast)
library(patchwork)AI TOOLS FOR ACTUARIES
Chapter 8: Example Mortality Rate Forecasting with CNNs, RNNs and Transformers
Chapter 8: Example Mortality Rate Forecasting with CNNs, RNNs and Transformers
1 Overview
This notebook applies deep learning to mortality rate forecasting.
We consider the Swiss Male population taken from the Human Mortality Database (HMD) (2024).
We forecast mortality rates using convolutional neural networks (CNNs), recurrent neural networks (RNNs) and Transformers. These deep learning architectures were introduced in previous notebooks.
The results are benchmarked against the classical Lee and Carter (1992) mortality model.
The numerical results in this notebook can clearly be improved. Our main goal is to explain the functioning of these deep learning architectures, and we do not put too much emphasis on trying to find the best possible predictive model.
Out-of-time forecasting set-up
We perform a proper out-of-time forecast:
Model fitting is done on calendar years \({\cal L}=\{1900,\ldots, 1999\}\).
Forecasting is done on calendar years \({\cal T}=\{2000,\ldots, 2021\}\).
Since there are observations available for the calendar years \({\cal T}\), we can benchmark the forecasts against their true outcomes.
Notation
Let \({\cal X} = \{0, \dots, 89\}\) denote the set of ages under consideration. Set for the age range \(q=|{\cal X}|=90\).
\({\cal L} = \{1900, \dots, 1999\}\) is the set of calendar years used for learning.
\({\cal T} = \{2000, \dots, 2021\}\) is the set of calendar years to be forecast.
Let \(m_{x,t}\) be the observed mortality rate of age group \(x \in {\cal X}\) in calendar years \(t \in {\cal L}\cup {\cal T}\).
We denote by \(\widehat{m}_{x,t}\) the estimates of the expected mortality rates for the calendar years \(t \in {\cal L}\) and the forecasts of the mortality rates for the calendar years \(t \in {\cal T}\):
there is a distinct difference between estimating an expected value (in our case past expected mortality); and
using an estimated expected value for forecasting an event in the future (random mortality rate).
2 Exploratory data analysis
We select the Swiss male population for our example.
We typically study mortality data on the log-scale: \(\log m_{x,t}\).
We start by loading the necessary libraries.
- Next, we load and pre-process the observed mortalities.
mort_data <- readHMD("../Data/CHE.Mx_1x1.txt") %>% data.table()
mort_data <- mort_data[Age < 90 & Year > 1899 & Year < 2022]
mort_data <- melt.data.table(mort_data, id.vars = c("Year", "Age"), measure.vars = c("Female", "Male"), variable.name = "Sex", value.name = "mx")
# log-mortality rates (add a small number for zero deaths)
mort_data[, logmx := log(pmax(mx, exp(-11)))]
# plot data
p0 = ggplot(mort_data, aes(x = Age, y = logmx, group = Year, color = Year)) + geom_line(alpha = 0.6) + facet_wrap(~ Sex) + scale_color_viridis_c(option = "D") +
labs(x = "age", y = "log-mortality rates", color = "year") +
theme_minimal()- The next plots illustrate the log-mortality rates at different ages and in different calendar years of the Swiss female (lhs) and the Swiss male (rhs) populations.


These heatmaps show the observed log-mortality rates \(\log m_{x,t}\) of the Swiss female (lhs) and the Swiss male (rhs) populations.
Especially, the Swiss male population shows non-linear structure in log-mortality rates in both year and age axes; e.g., there is the HIV increase before the millennium. Naturally, this poses some challenges in statistical modeling.
3 Lee-Carter (LC) model
The Lee and Carter (1992) model describes how mortality rates vary by age \(x\) and evolve over calendar time \(t\).
It assumes the following model for the log-mortality rates \(\log m_{x,t}\) \[ \log m_{x,t} = a_x + b_x k_t + \varepsilon_{x,t},\] where
- \(a_x\) is the average log-mortality rate at age \(x\),
- \(k_t\) captures the time trend in mortality developments across all ages,
- \(b_x\) represents how responsive mortality rates at age \(x\) are to changes in the overall mortality level,
- \(\varepsilon_{x,t}\) is the (random) error term.
The parameters are estimated with principal component analysis (PCA); PCA is discussed in the notebooks on unsupervised learning.
3.1 Lee-Carter model fitting
- We use the learning-test split as described above: \[\begin{eqnarray*} {\cal L} &=& \{t_0,\ldots, t'\}= \{1900, \dots, 1999\}, \\ {\cal T} &=& \{t'+1, \ldots, t_1\}= \{2000, \dots, 2021\}. \end{eqnarray*}\]
# Define the last (learning) observation year t'=1999
obs.year <- 1999
subset_data <- mort_data[Sex == "Male"] # select the Swiss male population
learn <- subset_data[Year <= obs.year]
test <- subset_data[Year > obs.year]
first.year <- min(subset_data$Year)
last.year <- max(subset_data$Year)- We guide through the fitting procedure of the Lee and Carter (1992) model step-by-step.
Center the log-mortality rates: Compute the empirical mean of \(\log m_{x,t}\) over time for each age cohort \(x \in {\cal X}\) \[ \widehat{a}_x = \frac{1}{t'-t_0+1} \sum_{t=t_0}^{t'} \log m_{x,t}.\]
Compute the centered log-mortality rates \[ \widetilde{m}_{x,t} := \log m_{x,t} - \widehat{a}_x.\]
learn[, ax := mean(logmx), by = Age]
learn[, mx_adj := logmx - ax]
ax <- unique(learn$ax)
# bring centered log-mortality rates into 2D tensor structure
rates_mat <- dcast(learn, Age ~ Year, value.var = "mx_adj")[, -1] %>% as.matrix()Apply the PCA to the centered log-mortality tensor \(\widetilde{M} = (\widetilde{m}_{x,t})_{x,t}\) to extract the first principal component. Technically, this is done by the singular value decomposition (SVD); see notebook on PCA.
The SVD gives \[ \widetilde{M} = U \Lambda V^\top, \] with left-singular and right-singular matrices \(U\) and \(V\), respectively, and diagonal matrix \(\Lambda\) containing the (ordered) singular values \(\lambda_j\).
The first singular vectors of \(U\) and \(V\), called \(u^{(1)}\) and \(v^{(1)}\), are extracted to receive the least-squares estimation, for \(x \in {\cal X}\) and \(t \in {\cal L}\), \[ \widehat{b}_x = u^{(1)}_x \, \lambda_1 \qquad \text{ and } \qquad \widehat{k}_t = v^{(1)}_t. \]
svd_fit <- svd(rates_mat)
bx <- svd_fit$u[,1] * svd_fit$d[1]
kt <- svd_fit$v[,1]- This approximates \(\widetilde{M}=(\widetilde{m}_{x,t})_{x,t}\) by \((\widehat{b}_x \widehat{k}_t)_{x,t}=\lambda_1 u^{(1)}(v^{(1)})^\top\).
For identifiability reasons, one usually uses the normalization \[ \sum_{x \in {\cal X}} \widehat{b}_x = 1 \quad \text{ and } \quad \sum_{t \in {\cal L}} \widehat{k}_t = 0. \]
This centers \(\widehat{k}_t\) and scales \(\widehat{b}_x\) accordingly.
#Normalize components for identifiability reasons
c1 <- mean(kt)
c2 <- sum(bx)
ax <- ax + c1 * bx # translate to center the kt's
bx <- bx / c2 # scale to normalize the bx's
kt <- (kt - c1) * c2 # center the kt's\((k_t)_{t \in {\cal L}}\) can only be estimated on the learning set \({\cal L}\). In fact, because every calendar year \(t\) has its own parameter \(k_t\), calendar years are treated as categorical covariates in the LC model.
For forecasting future mortality rates, one needs to extrapolate \((k_t)_{t \in {\cal L}}\) to the future calendar years \(t \in {\cal T}\). For this, one typically assumes that \((k_t)_t\) follows a random walk with drift, i.e., for \(h>0\) we assume \[ k_{t+h} = k_{t+h-1} + \theta + \eta_h, \] with drift \(\theta\) and i.i.d. centered error terms \(\eta_h\).
This gives a recursive extrapolation method beyond the latest observed calendar year \(t'\).
# Random walk (with drift) extrapolation
t_forecast <- length(unique(test$Year))
k_forecast <- rwf(kt, h = t_forecast, drift = TRUE)$mean- Estimating and forecasting the log-mortality rates \[ \log \widehat{m}_{x,t} = \widehat{a}_x + \widehat{b}_x \widehat{k}_{t}.\]
learn$LC <- ((ax + bx %*% t(kt)) %>% melt())$value # 1900-1999
test$LC <- ((ax + bx %*% t(k_forecast)) %>% melt())$value # 2000-2021
4 1D convolutional neural network
- There are two different ways in using CNN architectures for mortality rate prediction:
Interpret the mortality surface \((m_{x,t})_{x,t}\) as an image. This suggests to use a 3D mortality tensor with \((x,t)\) describing the spatial components. The channel(s) either describe the mortality intensity (\(q=1\)) or (as in the figure above) the RGB color channels (\(q=3\)). This 3D tensor approach was studied in Wang, Zhang and Zhu (2021) using a 2D CNN architecture.
Alternatively, understand \((m_{x,t})_{x,t}\) as a 2D tensor with time axis \(t\), and the ages \(x \in {\cal X}\) reflect the different channels (\(q=|{\cal X}|\)). This 2D tensor approach was studied in Perla et al. (2021) using a 1D CNN architecture.
- This notebook applies the second approach of a 1D CNN architecture, because the same 2D input tensor can also be used for RNN and Transformer architectures.
We present a simple 1D CNN architecture for mortality rate prediction.
This simple 1D CNN only uses linear layers (activation functions) on the log-mortality rate data.
We will conclude that this (first) linear 1D CNN approach is not competitive. To build a more powerful CNN, we need to study non-linear variants. The main purpose of this simple 1D CNN is to explain the functioning and implementation of CNNs.
For most powerful architectures, one should apply multi-population learning.
Generally, deep learning architectures need some care when used for extrapolation. One should carefully analyze the slope and curvature at the observation boundary for the extrapolation (resulting from the non-linear activation functions).
We start by introducing the 1D CNN architecture.
This 1D CNN architecture takes as input a 2D tensor \(\boldsymbol{X}_{1:T} \in \mathbb{R}^{T\times q}\); \(q=|{\cal X}|=90\) denotes the age range considered.
The 1D CNN layer runs \(q_1\) filters with kernel size \(K\) and stride \(\delta=1\) along the time axis of this 2D input tensor.
This results in a new 2D tensor of dimension \((T-K+1) \times q_1\).
This new 2D tensor is flattened and mapped (decoded) by a FNN layer to an output vector of dimension \(q=|{\cal X}|\). Thus, we predict the mortality on all \(q\) ages in one go.
The next code gives this 1D CNN architecture.
library(tensorflow); library(keras)
# A simple 1D CNN model
CNN1D_Model <- function(seed, input_size, filters, kernel_size, stride = 1, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
#
Design <- layer_input(shape = input_size, dtype = 'float32')
#
Response <- Design %>%
layer_conv_1d(filters = filters, stride = stride, kernel_size = kernel_size, activation = 'linear') %>%
layer_flatten() %>%
layer_dense(output_size, activation = "linear")
#
keras_model(inputs = Design, outputs = Response)
}
# we will set output_size = input_size[2]The first modeling step involves feature pre-processing.
We apply the MinMaxScaler to the log-mortality rates \[ \widetilde{m}_{x,t} := \frac{\log m_{x,t} - \min_{x,t} \log m_{x,t}}{\max_{x,t} \log m_{x,t} - \min_{x,t} \log m_{x,t}}.\]
The minimum and maximum values are stored to equally treat the test data and to back-transform the predictions to the original scale.
# MinMaxScaler pre-processing
min_mx <- min(subset_data$logmx)
max_mx <- max(subset_data$logmx)
subset_data[, logmx_scaled := (logmx - min_mx) / (max_mx - min_mx)]In the next step, we build the learning data. For this we construct 2D input tensors \(\boldsymbol{X}_{1:T}\in \mathbb{R}^{T\times q}\), which are used to predict the scaled log-mortality rates \(\boldsymbol{Y} \in \mathbb{R}^{90}\) in the next calendar year.
We use a lookback period of \(T=10\) years, this defines the first dimension of the 2D input tensor \(\boldsymbol{X}_{1:T} \in \mathbb{R}^{10\times 90}\), and the number of channels is set equal to the age range \(q=|{\cal X}|=90\).
This motivates the definition of the response-covariate pairs \[\begin{eqnarray*} \boldsymbol{Y}^{(t)}&=&(\widetilde{m}_{x,t})_{x \in {\cal X}} \in \mathbb{R}^q, \\ \boldsymbol{X}_{1:T}^{(t)}&=&\Big[(\widetilde{m}_{x,t-T})_{x \in {\cal X}}, \ldots, (\widetilde{m}_{x,t-1})_{x \in {\cal X}}\Big]^\top \in \mathbb{R}^{T\times q}, \end{eqnarray*}\]
The following code constructs all pairs \((\boldsymbol{Y}^{(t)}, \boldsymbol{X}^{(t)}_{1:T})\) for the calendar years \(t \in \{1910, \ldots, 1999\}\); this is the considered learning set for training the 1D CNN architecture.
The first available response is \(\boldsymbol{Y}^{(1910)}\) because of the required lookback of \(T=10\) years for the input tensor \(\boldsymbol{X}_{1:T}^{(1910)}\).
lookback <- 10 # lookback T
n.age <- length(unique(subset_data$Age)) # age range q
# number of pairs (Y^{(t)}, X^{(t)}_{1:T})
n.sample <- length((first.year + lookback):obs.year)
# Prepare learning data
x_learn <- array(NA, dim = c(n.sample, lookback, n.age)) # inputs X_{1:T}
y_learn <- array(NA, dim = c(n.sample, n.age)) # responses Y
tt <- 0
for (year in (first.year + lookback):obs.year) {
tt <- tt + 1
x_learn[tt,,] <- dcast(subset_data[Year > (year - lookback - 1) & Year < year], Year ~ Age, value.var = "logmx_scaled")[, -1] %>% as.matrix()
y_learn[tt, ] <- subset_data[Year == year]$logmx_scaled
}- Finally, we randomize the order of the learning sample; this is important for SGD to work properly, i.e., the inputs should not be ordered chronologically; see also FNN notebooks.
# Shuffle order in learning sample for SGD training
set.seed(100)
ll <- sample(c(1:(dim(x_learn)[1])))
x_learn0 <- x_learn[ll,,]
y_learn0 <- y_learn[ll,]- We are now ready to define the 1D CNN architecture.
model <- CNN1D_Model(seed = 100, input_size = c(lookback, n.age), filters = 3, kernel_size = 3, stride = 1, output_size = n.age)
summary(model)- This 1D CNN maps the input \(\boldsymbol{X}_{1:T} \in \mathbb{R}^{10\times 90}\) to a 2D tensor in \(\mathbb{R}^{8\times 3}\) – the kernel of size \(K=3\) and stride \(\delta=1\) reduce the time dimension from 10 to 8, and we select \(q_1=3\) filters – the flattened \(8\cdot 3=24\) dimensional encoding is then decoded by a linear FNN layer.
Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 90)] 0
conv1d (Conv1D) (None, 8, 3) 813
flatten (Flatten) (None, 24) 0
dense (Dense) (None, 90) 2250
================================================================================
Total params: 3063 (11.96 KB)
Trainable params: 3063 (11.96 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
- This 1D CNN architecture has 3063 trainable parameters, most of them are used for the decoding back to the age range \(q=|{\cal X}|=90\) (as this predicts all ages \(x\) in one go).
# compile with mean squared error loss
model %>% compile(loss = 'mse', optimizer = optimizer_adam(learning_rate = 0.001))
## define callback for early stopping (if necessary)
if (!dir.exists("./Networks")){dir.create("./Networks")}
path1 <- paste0("./Networks/CNN1.weights.h5")
model_write = callback_model_checkpoint(path1, save_best_only = T, verbose = 1, save_weights_only = T)
learn_rate = callback_reduce_lr_on_plateau(factor = 0.9, patience = 5, cooldown = 0, verbose = 1)We use the mean squared error loss; in analogy to LC fitting.
We install a callback for retrieving the best network weights of SGD fitting (if necessary). There is not much over-fitting potential here because everything is linear in the selected 1D CNN architecture.
fit <- model %>% keras::fit(x = x_learn0,
y = y_learn0,
epochs = 1000,
batch_size = 4,
validation_split = 0.1,
callbacks = list(model_write, learn_rate),
verbose = 2)
load_model_weights_hdf5(model, path1)
plot(fit)We fit this 1D CNN architecture and the SGD performance is shown in the upper plot of the next graph.
The lower plot shows the choice of the learning rate (lr).

- In-sample estimation on \({\cal L}\) is straightforward, see next code.
## in-sample estimation
logmx_hat_learn = (model %>% predict(list(x_learn), batch_size = 10^6))
tt = 0
for (year in (first.year + lookback):obs.year) {
tt = tt + 1
results[Year == year, CNN := (logmx_hat_learn[tt,]*(max_mx-min_mx)+min_mx)]
}- The first years need to be excluded from this estimation because we need an observation history (
lookback) of \(T=10\) years to build the input tensors \(\boldsymbol{X}_{1:T}\).
- The following code performs forecasting on \({\cal T}\) recursively.
input.matrix = array(NA, dim = c(1, lookback, n.age))
input.matrix[1,,] = dcast(subset_data[Year > (obs.year + 1-lookback-1) & Year < obs.year + 1], Age ~ Year, value.var = "logmx_scaled")[,-1] %>% t
# recursive extrapolation
for (year in (obs.year+1):last.year){
y_test = as.numeric(model %>% predict(list(input.matrix))) # predict
results[Year == year, CNN := (y_test*(max_mx-min_mx)+min_mx)]
input.matrix[1,,] = rbind(input.matrix[1,-1,], y_test) # update input tensor
}- This recursive extrapolation is controversial. The inputs in the learning procedure are observations \(m_{x,t}\). In the recursion, they are successively replaced by the new predictions \(\widehat{m}_{x,t}\). Predictions are on the level of expected values, i.e., they do not contain irreducible risk as the observations \(m_{x,t}\) (they have a lower volatility). Thus, there is an inconsistency between training and forecast input features; see also binary example in Section 1 of Richman and Wüthrich (2026).

The means of the first 10 calendar years cannot be estimated because we need a
lookbackperiod of \(T=10\) years.This 1D CNN does not seem to be fully competitive, because the patterns look a bit different.
This 1D CNN does not explicitly encode calendar year \(t\). Therefore, it cannot encode special years, like the LC model through the variables \(k_t\).
- We compute the in-sample and the out-of-sample rooted mean squared errors (RMSEs).
# in-sample RMSE results
results[(Year >= (first.year + lookback)) & (Year <= obs.year), .(
LC = round(sqrt(mean((LC - logmx)^2)), 4),
CNN = round(sqrt(mean((CNN - logmx)^2)), 4))] LC CNN
1: 0.1555 0.2108
# out-of-sample RMSE results
results[Year > obs.year, .(
LC = round(sqrt(mean((LC - logmx)^2)), 4),
CNN = round(sqrt(mean((CNN - logmx)^2)), 4))] LC CNN
1: 0.3416 0.3531
The above results show that the simple CNN approach is not fully competitive, and it is outperformed by the LC forecast (out-of-sample).
Clearly, we should use non-linear CNNs to improve the forecast.
We can equip the input with additional information, e.g., the calendar year variable – this will mainly improve the in-sample description.
The LC model has a categorical calendar year variable \(k_t\), which makes in-sample estimation very accurate, but which poses some challenges in extrapolation (we used a random walk with drift). The simple 1D CNN does not have any similar mechanism, but we could change the architecture correspondingly.
CNNs can handle multiple populations, and cross-population learning significantly improves the predictive models.
In Perla et al. (2021) a non-linear 1D CNN architecture is the most powerful method of the network predictors (all networks being based on cross-population learning).
5 Recurrent neural network – LSTM
Next, we present a RNN solution using a long short-term memory (LSTM) network architecture.
We only explicitly present the code of the parts that differ from the CNN one. The remaining code is executed hidden in the background.
The data preparation to build the response-covariate pairs \((\boldsymbol{Y}^{(t)}, \boldsymbol{X}^{(t)}_{1:T})\), \(t \in \{1910, \ldots, 1999\}\), is identical to above. That is, we consider a lookback period of \(T=10\) years to build the 2D input tensors \(\boldsymbol{X}^{(t)}_{1:T} \in \mathbb{R}^{T \times q}\) over all ages, \(q=|{\cal X}|\).
The next code shows the definition of the LSTM architecture. We first compress the age range by a time-distributed FNN layer to dimension \(q_1\).
LSTM_Model <- function(seed, input_size, q1, units, return_sequences = FALSE, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
#
Design <- layer_input(shape = input_size, dtype = 'float32')
#
TS.Represent <- Design %>% time_distributed(layer_dense(units = q1))
#
Features <- TS.Represent %>% layer_lstm(units = units, activation = 'linear', return_sequences = return_sequences)
#
if (return_sequences) {Features <- Features %>% layer_flatten()}
#
Response <- Features %>% layer_dense(output_size, activation = "linear")
#
keras_model(inputs = Design, outputs = Response)
}model <- LSTM_Model(seed = 100,
input_size = c(lookback, n.age),
q1 = 20,
units = 20,
return_sequences = FALSE,
output_size = n.age)
summary(model)The first time-distributed FNN layer maps the input \(\boldsymbol{X}_{1:T} \in \mathbb{R}^{10\times 90}\) to a 2D tensor in \(\mathbb{R}^{10\times 20}\). This tensor is run through a LSTM layer with 20 linear units, and we only output the last cell of the sequence (
return_sequence=FALSE). Finally, a FNN layer maps these 20 units back to the entire age range \(q=|{\cal X}|=90\).This LSTM architecture has 6990 trainable parameters; the CNN had 3063.
Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 90)] 0
time_distributed (TimeDistributed (None, 10, 20) 1820
)
lstm (LSTM) (None, 20) 3280
dense_1 (Dense) (None, 90) 1890
================================================================================
Total params: 6990 (27.30 KB)
Trainable params: 6990 (27.30 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
# compile with mean squared error loss
model %>% compile(loss = 'mse', optimizer = optimizer_adam(learning_rate = 0.001))
## define callback for early stopping (if necessary)
path1 <- paste0("./Networks/LSTM1.weights.h5")
model_write = callback_model_checkpoint(path1, save_best_only = T, verbose = 1, save_weights_only = T)
learn_rate = callback_reduce_lr_on_plateau(factor = 0.9, patience = 5, cooldown = 0, verbose = 1)We use the mean squared error loss; in analogy to LC fitting.
We install a callback for retrieving the best network weights of SGD fitting (if necessary).
fit <- model %>% keras::fit(x = x_learn0,
y = y_learn0,
epochs = 1000,
batch_size = 4,
validation_split = 0.1,
callbacks = list(model_write, learn_rate),
verbose = 2)
load_model_weights_hdf5(model, path1)
plot(fit)- We fit this LSTM architecture and the fitting performance is shown in the next plot. Again there is no sign of over-fitting.

- In-sample estimation and out-sample prediction are computed analogously to the CNN case.

Also this LSTM does not seem to be fully competitive.
We also clearly see the difficulty in recursive extrapolation; there is some horizontal structure in the extrapolation.
# in-sample RMSE results
results[(Year >= (first.year + lookback)) & (Year <= obs.year), .(
LC = round(sqrt(mean((LC - logmx)^2)), 4),
CNN = round(sqrt(mean((CNN - logmx)^2)), 4),
LSTM = round(sqrt(mean((LSTM - logmx)^2)), 4))] LC CNN LSTM
1: 0.1555 0.2108 0.1746
# out-of-sample RSME results
results[Year > obs.year, .(
LC = round(sqrt(mean((LC - logmx)^2)), 4),
CNN = round(sqrt(mean((CNN - logmx)^2)), 4),
LSTM = round(sqrt(mean((LSTM - logmx)^2)), 4))] LC CNN LSTM
1: 0.3416 0.3531 0.3559
- The conclusions are essentially the same as for the CNN.
6 Transformer
This section presents a simple Transformer architecture.
We only explicitly present the code of the parts that differ from the CNN one. The remaining code is executed hidden in the background.
Also for the Transformer architecture we only use linear layers, in analogy to the CNN and the RNN architectures. Consequently, also this Transformer is likely not fully competitive.
TRANSF_Model <- function(seed, input_size, units, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
Design <- layer_input(shape = input_size, dtype = 'float32')
#
Repr <- Design %>% time_distributed(layer_dense(units = units))
query <- Repr %>% time_distributed(layer_dense(units = units))
key <- Repr %>% time_distributed(layer_dense(units = units))
value <- Repr %>% time_distributed(layer_dense(units = units))
#
attention_output <- list(query, value, key) %>% layer_attention(use_scale = TRUE, trainable = TRUE)
#
skip_1 <- layer_add(list(attention_output, Repr))
skip_2 <- skip_1 %>% layer_layer_normalization() %>% time_distributed(layer_dense(units = units))
#
Transformer <- layer_add(list(skip_2, skip_1)) %>% layer_flatten()
#
Response <- Transformer %>% layer_dense(units = output_size, activation = "linear")
#
keras_model(inputs = Design, outputs = Response)
}model <- TRANSF_Model(seed = 100,
input_size = c(lookback, n.age),
units = 3,
output_size = n.age)
summary(model)This Transformer architecture first maps the input \(\boldsymbol{X}_{1:T} \in \mathbb{R}^{10\times 90}\) to a 2D tensor in \(\mathbb{R}^{10\times 3}\), i.e., with three units. This significantly reduces the size of the 2D tensor, and we only apply the attention mechanism to this smaller 2D tensor.
This architecture has 3118 trainable parameters.
The LSTM architecture had 6990 trainable parameters, and the CNN had 3063.
Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param Connected to
#
================================================================================
input_1 (InputLayer) [(None, 10, 90)] 0 []
time_distributed (Tim (None, 10, 3) 273 ['input_1[0][0]']
eDistributed)
time_distributed_1 (T (None, 10, 3) 12 ['time_distributed[0][0]
imeDistributed) ']
time_distributed_3 (T (None, 10, 3) 12 ['time_distributed[0][0]
imeDistributed) ']
time_distributed_2 (T (None, 10, 3) 12 ['time_distributed[0][0]
imeDistributed) ']
attention (Attention) (None, 10, 3) 1 ['time_distributed_1[0][
0]',
'time_distributed_3[0][
0]',
'time_distributed_2[0][
0]']
add (Add) (None, 10, 3) 0 ['attention[0][0]',
'time_distributed[0][0]
']
layer_normalization ( (None, 10, 3) 6 ['add[0][0]']
LayerNormalization)
time_distributed_4 (T (None, 10, 3) 12 ['layer_normalization[0]
imeDistributed) [0]']
add_1 (Add) (None, 10, 3) 0 ['time_distributed_4[0][
0]',
'add[0][0]']
flatten (Flatten) (None, 30) 0 ['add_1[0][0]']
dense_5 (Dense) (None, 90) 2790 ['flatten[0][0]']
================================================================================
Total params: 3118 (12.18 KB)
Trainable params: 3118 (12.18 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
# compile with mean squared error loss
model %>% compile(loss = 'mse', optimizer = optimizer_adam(learning_rate = 0.001))
## define callback for early stopping (if necessary)
path1 <- paste0("./Networks/Transformer1.weights.h5")
model_write = callback_model_checkpoint(path1, save_best_only = T, verbose = 1, save_weights_only = T)
learn_rate = callback_reduce_lr_on_plateau(factor = 0.9, patience = 5, cooldown = 0, verbose = 1)We use the mean squared error loss; in analogy to LC fitting.
We install a callback for retrieving the best network weights of SGD fitting (if necessary).
fit <- model %>% keras::fit(x = x_learn0,
y = y_learn0,
epochs = 1000,
batch_size = 4,
validation_split = 0.1,
callbacks = list(model_write, learn_rate),
verbose = 2)
load_model_weights_hdf5(model, path1)
plot(fit)
- In-sample estimation and out-sample prediction are computed analogously to the CNN case.

The Transformer does not seem to be fully competitive.
However, the color structure looks quite reasonable, except the outlier years in the 1920s cannot be properly estimated without a calendar year factor (however this is ‘only’ in-sample).
# in-sample RMSE results
results[(Year >= (first.year + lookback)) & (Year <= obs.year), .(
LC = round(sqrt(mean((LC - logmx)^2)), 4),
CNN = round(sqrt(mean((CNN - logmx)^2)), 4),
LSTM = round(sqrt(mean((LSTM - logmx)^2)), 4),
TRANS = round(sqrt(mean((TRANS - logmx)^2)), 4))] LC CNN LSTM TRANS
1: 0.1555 0.2108 0.1746 0.2277
# out-of-sample RMSE results
results[Year > obs.year, .(
LC = round(sqrt(mean((LC - logmx)^2)), 4),
CNN = round(sqrt(mean((CNN - logmx)^2)), 4),
LSTM = round(sqrt(mean((LSTM - logmx)^2)), 4),
TRANS = round(sqrt(mean((TRANS - logmx)^2)), 4))] LC CNN LSTM TRANS
1: 0.3416 0.3531 0.3559 0.3624
The above 3 examples have shown how to implement a 1D CNN, a LSTM and a Transformer architecture on 2D input tensors.
These 3 architectures only use linear layers and, not surprisingly, their predictive performance is limited.
These 3 architectures should now be lifted to non-linear layers, they could be more deep, and certainly they should consider multi-population learning to receive good predictive accuracy.
A crucial and critical issue is extrapolation – this needs to be carefully checked. Network extrapolation may not be very robust across different SGD runs, because the learned regression function may be rather different at the observation boundary (for the different SGD runs).
Having a human in the loop, it might be advisable to integrate a calendar year factor into the network architecture which can be controlled manually during the extrapolation, this is similar to the LC model using a manual random walk with drift extrapolation.
7 Non-linear CNN architectures
Just for fun, we present a non-linear network architecture to demonstrated that predictive performance can be improved by using suitable deep networks.
We still only consider one population, but this should be extended to multi-population learning; see, e.g., Perla et al. (2021).
The above reference also verifies that transfer learning can be beneficial. It shows an example of network architectures being trained on many different countries of the world, and they perform rather well (with the same parametrization/without further training) on the US states.
CNN1D_Model <- function(seed, input_size, qq, filters, kernel_size, stride, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
#
Design <- layer_input(shape = input_size, dtype = 'float32')
#
TS.Represent <- Design %>% time_distributed(layer_dense(units = qq[1], activation = "tanh"))
Response <- TS.Represent %>%
layer_conv_1d(filters = filters, stride = stride, kernel_size = kernel_size, activation = 'linear') %>%
layer_flatten() %>%
layer_dense(units = qq[2], activation = "tanh") %>%
layer_dense(output_size, activation = "linear")
#
keras_model(inputs = Design, outputs = Response)
}model <- CNN1D_Model(seed = 100,
input_size = c(lookback, n.age),
qq = c(20, 50),
filters = 3,
kernel_size = 3,
stride = 1,
output_size = n.age)
summary(model)- This architecture has 7843 trainable parameters, the previous CNN had 3063.
Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 90)] 0
time_distributed (TimeDistributed (None, 10, 20) 1820
)
conv1d (Conv1D) (None, 8, 3) 183
flatten (Flatten) (None, 24) 0
dense_2 (Dense) (None, 50) 1250
dense_1 (Dense) (None, 90) 4590
================================================================================
Total params: 7843 (30.64 KB)
Trainable params: 7843 (30.64 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________

The following RMSEs show that this CNN outperforms the LC forecast on the future calendar years \(t \in {\cal T}\).
In-sample, the LC estimates are closer to the observations, this is again due to the fact that the LC model has categorical calendar year variables \(k_t\) that allow for precise calendar year mortality rate estimations (in-sample).
# in-sample RMSE results
results[(Year >= (first.year + lookback)) & (Year <= obs.year), .(
LC = round(sqrt(mean((LC - logmx)^2)), 4),
CNN = round(sqrt(mean((CNN - logmx)^2)), 4))] LC CNN
1: 0.1555 0.1866
# out-of-sample RMSE results
results[Year > obs.year, .(
LC = round(sqrt(mean((LC - logmx)^2)), 4),
CNN = round(sqrt(mean((CNN - logmx)^2)), 4))] LC CNN
1: 0.3416 0.3189
We should remark that these results are not very robust, unfortunately.
This closes this notebook on time-series mortality modeling.
Copyright
© The Authors
This notebook and these slides are part of the project “AI Tools for Actuaries”. The lecture notes can be downloaded from:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162304
\(\,\)
- This material is provided to reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution and credit is given to the original authors and source, and if you indicate if changes were made. This aligns with the Creative Commons Attribution 4.0 International License CC BY-NC.