
AI TOOLS FOR ACTUARIES
Chapter 8: Convolutional Neural Network - Part B
Chapter 8: Convolutional Neural Network - Part B
1 Convolutional neural networks
- CNNs were introduced by LeCun and Bengio (1998).
1.1 Recap: tensor input data
Input data to networks is usually in tensor form.
For single instances, the input information is either:
- a vector \(\boldsymbol{X} \in \mathbb{R}^q\) (1D tensor) for tabular input data,
- a matrix \(\boldsymbol{X}_{1:t} \in \mathbb{R}^{t \times q}\) (2D tensor) for time-series and text data,
- a 3D tensor \(\boldsymbol{X}_{1:t, 1:s} \in \mathbb{R}^{t \times s \times q}\) for images and spatial data.
These input formats have been described in the previous notebook.
Typically, the last dimension denotes the different channels having \(q\) components. The previous dimensions have a time-series (2D) or a spatial (3D) structure that should be preserved by the network encoder.
1.2 Types of network layers
When data is represented as a 2D or 3D tensor, and if it is characterized by large size, applying traditional FNN layers can be inappropriate:
- FNNs ignore time-series and/or spatial structure in the data,
- FNNs require a large number of parameters,
- FNNs cannot deal with time-series observations that are increasing over time.
To solve these issues, specialized network architectures have been designed. This notebook introduces
- convolutional neural networks (CNNs), and
- locally-connected networks (LCNs)
In subsequent notebooks we discuss recurrent neural networks (RNNs) and Transformers.
1.3 Dimension of CNN layers
Different CNN layers are available, examples are 1D and 2D CNN layers.
These CNN layers differ in the number of dimensions over which the convolution operation (local filter) is applied to:
- for 2D tensors (time-series, text data) one uses a 1D CNN layer;
- for 3D tensors (images, spatial data) one uses a 2D CNN layer.
Thus, the choice of the CNN dimension depends on the characteristics of the input data, e.g., a RGB image has a 2D spatial structure and 3 color channels resulting in a 3D tensor \[ \boldsymbol{X}_{1:t, 1:s}= (X_{u,v,j})_{1\le u \le t,\,1\le v \le s,\, 1\le j \le 3}~\in ~{\mathbb R}^{t \times s \times 3}.\] For this 3D tensor we typically use a 2D CNN to preserves the 2D spatial structure.
Similarly a 1D CNN preserves the time-series structure in a 2D tensor.
2 1D convolutional neural networks
2.1 Hyper-parameters of a 1D CNN layer
- For a 1D CNN layer the modeler needs to specify the hyper-parameters:
- Kernel size \(K \in \mathbb{N}\): defines the length (window size) of the filter used in the convolution operation.
- Stride \(\delta \in \mathbb{N}\): specifies the step size of the filter movement (translation).
- \(K\) and \(\delta\) need to balance computational cost and feature resolution:

- In the above pictures: The filter of kernel size \(K=3\) moves along the time axis \(t\ge 1\) with stride \(\delta=1\) (lhs) and stride \(\delta=3\) (rhs).
2.2 1D CNNs: single filter
Let \(\boldsymbol{X}_{1:t} \in \mathbb{R}^{t\times q}\) be the input matrix (tensor of order 2).
A 1D CNN layer with a single filter of size \(K\) and stride \(\delta\) is a mapping \[\begin{eqnarray*} \boldsymbol{z}_1^{(1)} &:& \mathbb{R}^{t\times q} \to \mathbb{R}^{t^\prime}, \\&& {\boldsymbol{X}_{1:t}} \mapsto \boldsymbol{z}_1^{(1)} ({\boldsymbol{X}_{1:t}}) = \left(z^{(1)}_{u,1}({\boldsymbol{X}_{1:t}}) \right)_{1 \leq u \leq t^\prime}, \end{eqnarray*}\] with \(t^\prime= \lfloor \frac{t - K}{\delta} + 1 \rfloor \in \mathbb{N}\) representing the number of receptive fields.
Unit \(z_{u,1}^{(1)} (\boldsymbol{X}_{1:t})\) convolves the filter with the \(u\)-th receptive field \[ z^{(1)}_{u,1}(\boldsymbol{X}_{1:t}) = \phi \left (w^{(1)}_{0,1} + \sum_{k = 1}^{K} \left\langle \boldsymbol{w}^{(1)}_{k,1}, \boldsymbol{X}_{(u-1)\delta+k} \right \rangle \right), \] with bias \(w^{(1)}_{0,1} \in \mathbb{R}\), filter weights \(\boldsymbol{w}^{(1)}_{k,1} \in \mathbb{R}^q\) and activation \(\phi: \mathbb{R} \to \mathbb{R}\).
Rewriting this in vector form for the indices \(1\le u \le t'\) gives \[\begin{eqnarray*} \boldsymbol{z}^{(1)}_{1}(\boldsymbol{X}_{1:t}) &=& \begin{pmatrix} \phi \left (w^{(1)}_{0,1} + \sum_{k = 1}^{K} \left\langle \boldsymbol{w}^{(1)}_{k,1}, \boldsymbol{X}_{k} \right \rangle \right)\\ \vdots \\ \phi \left (w^{(1)}_{0,1} + \sum_{k = 1}^{K} \left\langle \boldsymbol{w}^{(1)}_{k,1}, \boldsymbol{X}_{(t'-1)\delta+k} \right \rangle \right) \end{pmatrix}\\&~& \\&=:& \phi \left (w^{(1)}_{0,1} + W_1^{(1)} \ast \boldsymbol{X}_{1:t} \right) ~\in~\mathbb{R}^{t^\prime}, \end{eqnarray*}\] with filter weights \(W_1^{(1)}=[\boldsymbol{w}^{(1)}_{1,1},\ldots, \boldsymbol{w}^{(1)}_{K,1}]^\top \in {\mathbb R}^{K\times q}\), and where the bias \(w^{(1)}_{0,1}\) is added to every component \(1\le u \le t'\).
Thus, the filter \(W_1^{(1)} \in \mathbb{R}^{K \times q}\) moves like a rolling window along the time axis of \(\boldsymbol{X}_{1:t}\) (with stride/step size \(\delta\)), and we abbreviate this by the convolution operator \(\ast\) (having stride \(\delta\)).
This results in a new time-series \(\boldsymbol{z}^{(1)}_{1}(\boldsymbol{X}_{1:t}) \in \mathbb{R}^{t'}\) of length \(t'\), where every component \(u\) summarizes the \(u\)-th receptive field of the input time-series \(\boldsymbol{X}_{1:t}\).
2.3 Illustration of 1D CNN filter

The filter with \(K=3\) applies the scalar products across all channels \(q\).
The filter moves down the time axis by \(\delta=1\) steps.
2.4 1D CNNs: multiple filter
In practice, a single filter may not be sufficient to capture the complexity of the features in the input data; this is equivalent to the number of neurons in a FNN layer.
A 1D CNN layer with \(q_1 \in \mathbb{N}\) filters is a mapping \[\begin{eqnarray*} \boldsymbol{z}^{(1)} &:& \mathbb{R}^{t\times q} \to \mathbb{R}^{t^\prime \times q_1} \\&& {\boldsymbol{X}_{1:t}} \mapsto \boldsymbol{z}^{(1)} ({\boldsymbol{X}_{1:t}}) = \left(\boldsymbol{z}^{(1)}_{1}({\boldsymbol{X}_{1:t}}),\ldots, \boldsymbol{z}^{(1)}_{q_1}({\boldsymbol{X}_{1:t}})\right), \end{eqnarray*}\] with for \(1 \le j \le q_1\) \[ \boldsymbol{z}^{(1)}_{j}(\boldsymbol{X}_{1:t}) = \phi \left (w^{(1)}_{0,j} + W_j^{(1)} \ast \boldsymbol{X}_{1:t} \right)~\in ~ \mathbb{R}^{t'},\] for biases \(w^{(1)}_{0,j} \in \mathbb{R}\), filter weights \(W_j^{(1)} \in {\mathbb R}^{K\times q}\), and where the convolution operator \(\ast\) has stride \(\delta\).
2.5 Illustration of 1D CNN layer with multiple filters

- The output \(\boldsymbol{z}^{(1)} ({\boldsymbol{X}_{1:t}}) \in \mathbb{R}^{t'\times q_1}\) is again a 2D tensor and time-causality along the vertical axis is preserved by the rolling window mechanism.
The \(j\)-th column of \(\boldsymbol{z}^{(1)} \in {\mathbb R}^{t'\times q_1}\), containing the elements \[\left(z_{1,j}^{(1)}, z_{2,j}^{(1)}, \dots, z_{t^{\prime},j}^{(1)}\right)^\top ~\in ~\mathbb{R}^{t'},\] represents a set of features extracted by applying the same filter to the different receptive fields - this is the time-causal part.
The \(u\)-th row of \(\boldsymbol{z}^{(1)}\in {\mathbb R}^{t'\times q_1}\), containing the elements \[z_{u,1}^{(1)}, z_{u,2}^{(1)}, \dots, z_{u,q_1}^{(1)},\] represents a set of features obtained by applying \(q_1\) different filters to the same \(u\)-th receptive field - these are the different data compressions (representations) of the \(u\)-th receptive field.
The total number of parameters to be learned in a 1D CNN layer with \(q_1\) filters of size \(K\) is \((1+Kq)q_1\).
2.6 Flatten layer
A CNN typically outputs a tensor of the same order as the input. For time-series input data, this is 2D tensor \(\boldsymbol{z}^{(1)}(\boldsymbol{X}_{1:t}) \in \mathbb{R}^{t' \times q_1}\).
To use \(\boldsymbol{z}^{(1)}(\boldsymbol{X}_{1:t})\) for prediction (e.g., forecasting an insurance claim), further processing is needed by decoding it, e.g., with a FNN layer.
This requires flattening \(\boldsymbol{z}^{(1)}(\boldsymbol{X}_{1:t})\) into a vector \(\boldsymbol{z}_{\rm vec}^{(1)}(\boldsymbol{X}_{1:t})\) of length \(t' q_1\) using a flatten layer.
2.7 Example of a 1D CNN
The following code illustrates a network architecture containing a single 1D CNN layer.
The CNN1D_Model function takes the following arguments:
seed: to ensure reproducibility of results;input_size: input dimensions (sequence length \(t\) and number of channels \(q\));filters: number of CNN filters \(q_1\);kernel_size: size of the CNN kernel \(K\);stride: stride for the resolution \(\delta\);output_size: output dimension.
The model below includes a 1D CNN layer, followed by a flatten layer and a decoding FNN layer to compute the output.
# CNN_1D model function
library(keras)
library(tensorflow)
CNN1D_Model <- function(seed, input_size, filters, kernel_size, stride, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
Design <- layer_input(shape = c(input_size[1], input_size[2]), dtype = 'float32')
# 1D CNN encoder and a FNN decoder
Response <- Design %>%
layer_conv_1d(filters = filters, stride = stride, kernel_size = kernel_size, activation = 'linear') %>%
layer_flatten() %>%
layer_dense(output_size, activation = "linear")
# model definition
keras_model(inputs = c(Design), outputs = Response)
}- This example shows how to create a 1D CNN model applied to a 2D input tensor of size \((t,q)=(10, 5)\). The model uses \(q_1 = 1\) filter, a kernel size of \(K = 3\), a stride \(\delta = 1\), and 1 output unit.
# Example usage
model <- CNN1D_Model(
seed = 100,
input_size = c(10, 5),
filters = 1,
kernel_size = 3,
stride = 1,
output_size = 1)- This architecture has 25 weights to be fitted, see next table below, and it uses linear activation functions (which could be replaced in the above code).
summary(model)Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 5)] 0
conv1d (Conv1D) (None, 8, 1) 16
flatten (Flatten) (None, 8) 0
dense (Dense) (None, 1) 9
================================================================================
Total params: 25 (100.00 Byte)
Trainable params: 25 (100.00 Byte)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
The number of CNN weights are \((1+Kq)=16\).
A stride of \(\delta=1\) and a kernel size of \(K=3\) reduces the length \(t=10\) of the original time-series to \(t'=8\).
- This example shows how to create a 1D CNN model with multiple filters \(q_1 = 4\), a kernel size of \(K = 3\), a stride \(\delta = 1\), and 1 output unit.
# Example usage
model <- CNN1D_Model(
seed = 100,
input_size = c(10, 5),
filters = 4,
kernel_size = 3,
stride = 1,
output_size = 1)This architecture has 97 weights to be fitted, see next table below, and it uses linear activation functions.
We notice that the use of weights is rather economical because the size of the weights is regardless of the length \(t\) of the input sequence.
summary(model)Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 5)] 0
conv1d (Conv1D) (None, 8, 4) 64
flatten (Flatten) (None, 32) 0
dense (Dense) (None, 1) 33
================================================================================
Total params: 97 (388.00 Byte)
Trainable params: 97 (388.00 Byte)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
The number of CNN weights are \((1+Kq)q_1=16\cdot 4=64\).
Again, a stride of \(\delta=1\) and a kernel size of \(K=3\) reduces the length \(t=10\) of the original time-series to \(t'=8\).
3 2D convolutional neural networks
3.1 Hyper-parameters of a 2D CNN layer
For a 2D CNN layer, the modeler needs to specify hyper-parameters along both spatial directions.
In a 2D CNN, both, kernel size and stride, are defined as pairs:
Kernel size \(K = (K_t, K_s) \in \mathbb{N}^2\): \(K_t\) represents the height of the kernel (number of rows), and \(K_s\) the width of the kernel (number of columns).
Stride \(\delta = (\delta_t, \delta_s) \in \mathbb{N}^2\): \(\delta_t\) denotes the vertical stride (the number of rows the filter moves), and \(\delta_s\) the horizontal stride (the number of columns the filter moves).
Based on these choices 1D CNN layers and 2D CNN layers work rather similarly.
3.2 2D CNNs: single filter
Let \(\boldsymbol{X}_{1:t,1:s} \in \mathbb{R}^{t \times s \times q}\) be the 3D input tensor.
A 2D CNN layer with a single filter of size \((K_t, K_s)\) and stride \((\delta_t, \delta_s)\) is a mapping \[\begin{eqnarray*} \boldsymbol{z}_1^{(1)} &:& \mathbb{R}^{t \times s \times q} \to \mathbb{R}^{t^\prime \times s^\prime}\\ &&\boldsymbol{X}_{1:t,1:s} \mapsto \boldsymbol{z}_1^{(1)} (\boldsymbol{X}_{1:t,1:s}) = \left( z^{(1)}_{u,v,1} (\boldsymbol{X}_{1:t,1:s}) \right)_{1 \leq u \leq t^\prime, 1 \leq v \leq s^\prime}.\end{eqnarray*}\]
The output of a 2D CNN layer with a single filter is a matrix with \(t^{\prime} = \left\lfloor \frac{t - K_t}{\delta_t} + 1 \right\rfloor\) rows and \(s^{\prime}= \left\lfloor \frac{s - K_s}{\delta_s} + 1 \right\rfloor\) columns.
Unit \(z^{(1)}_{u,v,1}(\boldsymbol{X}_{1:t,1:s})\) is extracted by convolving the filter with the receptive field \((u, v)\) \[ z^{(1)}_{u,v,1}(\boldsymbol{X}_{1:t,1:s}) = \phi \left( w^{(1)}_{0,1} + \sum_{k_t = 1}^{K_t} \sum_{k_s = 1}^{K_s} \left\langle \boldsymbol{w}^{(1)}_{k_t, k_s,1}, \boldsymbol{X}_{ (u-1)\delta_t+k_t, (v-1)\delta_s+k_s} \right\rangle \right),\] with bias \(w^{(1)}_{0,1} \in \mathbb{R}\) and filter weights \(\boldsymbol{w}^{(1)}_{k_t, k_s,1} \in \mathbb{R}^q\), for \(1 \leq k_t \leq K_t\) and \(1\leq k_s \leq K_s\).
We abbreviate this by the convolution form (similarly to above) \[ \boldsymbol{z}^{(1)}_{1}(\boldsymbol{X}_{1:t,1:s}) = \phi \left( w^{(1)}_{0,1} + W_1^{(1)} \ast \boldsymbol{X}_{1:t,1:s} \right) ~\in~{\mathbb R}^{t'\times s'},\] with filter weights \[ W_1^{(1)}=\left(\boldsymbol{w}^{(1)}_{k_t,k_s,1}\right)_{1\le k_t \le K_t,1\le k_s \le K_s} \in {\mathbb R}^{K_t \times K_s\times q},\] where the convolution operation \(\ast\), properly interpreted, has a stride \((\delta_t,\delta_s)\), and where the bias is added to every component.
3.3 Illustration of 2D CNN filter

The rolling window (filter) now moves across both spatial axes.
This preserves the spatial topology.
3.4 Example of a 2D CNN
The following CNN2D_Model function takes the arguments:
seed: to ensure reproducibility of results;input_size: input dimensions (height \(t\), width \(s\), number of channels \(q\));filters: number of convolutional filters \(q_1\);kernel_size: size of the convolutional kernel, specified as a bivariate vector \((K_t,K_s)=\text{(height, width)}\);stride: stride of the convolution, specified as a bivariate vector \((\delta_t,\delta_s)=\text{(height, width)}\);output_size: number of neurons in the output layer.
The following model includes a 2D CNN layer, followed by a flatten layer and a decoding FNN layer to compute the output.
# CNN_2D model function
CNN2D_Model <- function(seed, input_size, filters, kernel_size, stride, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
# inputs: input_size should be c(height, width, channels)
Design <- layer_input(shape = input_size, dtype = 'float32')
# CNN 2D network body
Response <- Design %>%
layer_conv_2d(filters = filters, kernel_size = kernel_size, strides = stride, activation = 'linear') %>%
layer_flatten() %>%
layer_dense(units = output_size, activation = 'linear')
# model definition
keras_model(inputs = Design, outputs = Response)
}- This example shows how to create a 2D CNN model applied to an input tensor of size \((t,s,q)=(10, 10, 5)\). The model uses \(q_1 = 1\) filter, a kernel size of \(K = (3,3)\), a stride \(\delta = (1,1)\), and 1 output unit.
# CNN_2D model definition code
model <- CNN2D_Model(
seed = 100,
input_size = c(10, 10, 5),
filters = 1,
kernel_size = c(3,3),
stride = c(1,1),
output_size = 1)- This single filter 2D CNN has 111 weights to be fitted, see next table.
summary(model)Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 10, 5)] 0
conv2d (Conv2D) (None, 8, 8, 1) 46
flatten (Flatten) (None, 64) 0
dense (Dense) (None, 1) 65
================================================================================
Total params: 111 (444.00 Byte)
Trainable params: 111 (444.00 Byte)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
The number of CNN weights are \((1+K_1 K_2q)=46\).
A stride of \(\delta=(1,1)\) and a kernel size of \(K=(3,3)\) reduces the input \((t,s)=(10,10)\) of the original spatial graph to \((t',s')=(8,8)\).
3.5 2D CNN layer: multiple filters
A 2D CNN layer with multiple filters is obtained by the intuitive extension \[\begin{eqnarray*} \boldsymbol{z}^{(1)} &:& \mathbb{R}^{t \times s \times q} \to \mathbb{R}^{t^\prime \times s^\prime \times q_1}\\ &&\boldsymbol{X}_{1:t,1:s} \mapsto \boldsymbol{z}^{(1)} (\boldsymbol{X}_{1:t,1:s}) = \left( \boldsymbol{z}^{(1)}_{j} (\boldsymbol{X}_{1:t,1:s}) \right)_{1\le j \le q_1}.\end{eqnarray*}\]
We refrain from giving more technical details.
A 2D CNN model with \(q_1 = 4\) filters can be defined using the code:
# CNN_2D model definition code
model <- CNN2D_Model(seed = 100, input_size = c(10, 10, 5), filters = 4, kernel_size = c(3,3), stride = c(1,1), output_size = 1)- This 2D CNN has 441 weights to be fitted, see next table.
summary(model)Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 10, 5)] 0
conv2d (Conv2D) (None, 8, 8, 4) 184
flatten (Flatten) (None, 256) 0
dense (Dense) (None, 1) 257
================================================================================
Total params: 441 (1.72 KB)
Trainable params: 441 (1.72 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
4 Deep convolutional neural networks
- W.l.o.g., we present the case of a deep 2D CNN architecture.
4.1 Deep CNN architectures
Select a first 2D CNN layer \[ \boldsymbol{z}^{(1)} : \mathbb{R}^{t \times s \times q} \to \mathbb{R}^{t^{\prime}\times s^{\prime} \times q_1}. \] This is exactly as defined above with kernel size \(K^{(1)}\) and stride \(\delta^{(1)}\).
Select a second 2D CNN layer \[ \boldsymbol{z}^{(2)} : \mathbb{R}^{t^\prime \times s^\prime \times q_1} \to \mathbb{R}^{t^{\prime \prime}\times s^{\prime \prime} \times q_2}, \] with spatial dimensions \[ t^{\prime \prime} = \left\lfloor \frac{t^{\prime} - K_t^{(2)}}{\delta_t^{(2)}} + 1 \right\rfloor \in \mathbb{N} \quad \text{ and } \quad s^{\prime \prime} = \left\lfloor \frac{s^{\prime} - K_s^{(2)}}{\delta_s^{(2)}} + 1 \right\rfloor \in \mathbb{N}, \] for kernel size \(K^{(2)}\) and stride \(\delta^{(2)}\).
- Compose these two 2D CNN layers to a deep 2D CNN architecture \[ \boldsymbol{z}^{(2:1)} : \mathbb{R}^{t \times s \times q} \to \mathbb{R}^{t^{\prime \prime}\times s^{\prime \prime} \times q_2} \qquad \text{ by } \quad \boldsymbol{z}^{(2:1)}=\boldsymbol{z}^{(2)}\circ \boldsymbol{z}^{(1)}.\]

This can be generalized to any depth \(d\ge 2\).
There are various ways to map a 3D input to a 2D output (shrinking one of the dimensions to 1), mapping from 2D CNN layers to 1D CNN layers. This then reflects a marginal projection.
4.2 Example: deep CNN architecture
The following code gives a function Deep_CNN2D_Model which builds a 2D CNN architecture of depth 2.
The most important arguments of this function are:
filters: a list specifying the number of filters for each of the two CNN layers;kernel_size: a list specifying the kernel sizes for each CNN layer;stride: a list indicating the stride values for each CNN layer.
Deep_CNN2D_Model <- function(seed, input_size, filters, kernel_size, stride, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
# inputs: input_size should be c(height, width, channels)
Design <- layer_input(shape = input_size, dtype = 'float32')
# deep 2D CNN encoder
Response <- Design %>%
layer_conv_2d(filters = filters[[1]], kernel_size = kernel_size[[1]], strides = stride[[1]], activation = 'relu') %>%
layer_conv_2d(filters = filters[[2]], kernel_size = kernel_size[[2]], strides = stride[[2]], activation = 'relu') %>%
# FNN decoder
layer_flatten() %>%
layer_dense(units = output_size, activation = 'linear')
# model definition
keras_model(inputs = Design, outputs = Response)
}- The following code creates a deep architecture containing two 2D CNN layers with filters \(q_1 = 4\) and \(q_2 = 2\), kernel sizes \(K^{(1)} = (3,3)\) and \(K^{(2)} = (4,4)\) and stride values \(\delta^{(1)} = \delta^{(2)} = (1,1)\):
# Example usage
model <- Deep_CNN2D_Model(
seed = 100,
input_size = c(10, 10, 5),
filters = list(4, 2),
kernel_size = list(c(3, 3), c(4, 4)),
stride = list(c(1, 1), c(1, 1)),
output_size = 1
)- This deep 2D CNN architecture has 365 weights to be fitted.
summary(model)Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 10, 5)] 0
conv2d_1 (Conv2D) (None, 8, 8, 4) 184
conv2d (Conv2D) (None, 5, 5, 2) 130
flatten (Flatten) (None, 50) 0
dense (Dense) (None, 1) 51
================================================================================
Total params: 365 (1.43 KB)
Trainable params: 365 (1.43 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
- The number of CNN weights in the two CNN layers are \((1+K^{(1)}_1 K^{(1)}_2q)q_1+(1+K^{(2)}_1 K^{(2)}_2q_1)q_2=46\cdot4+65\cdot2\).
4.3 Pre-trained 2D CNN architectures
For image classification there are many pre-trained 2D CNN architectures, being trained on a huge data repository from internet to learn to extract local spatial structure.
We mention AlexNet of Krizhevsky, Sutskever and Hinton (2017) developed in 2012. AlexNet contains 60 million weights that have been pre-trained to extract local structure from images.
These pre-trained versions can directly be used to extract local structure from images, and a decoder can be fine-tuned on this structure to the specific task at hand; see, e.g., Zhu and Wüthrich (2021).
5 Appendix: General purpose layers
5.1 Padding layer
Padding can be applied to various types of input tensors.
- In 1D CNNs (used for time-series data), padding may be applied at the start and the end of the sequence.
- In 2D CNNs, padding is applied along both dimensions, and so on for higher-order CNNs.

The amount of padding is typically determined by the filter size \(K\) of the CNN layer; we assume stride \(\delta=1\) for the moment.
For instance, in the case of a 2D CNN layer with a filter of size \(3 \times 3\), padding of 1 pixel is commonly added to each side of the input to preserve its original dimension.
We consider the rows only, the columns are analogous. We add one pixel at both ends giving length \(1+t+1\). This gives \[ t^{\prime} = \left\lfloor \frac{1+t+1 - K_t}{\delta_t} + 1 \right\rfloor = \left\lfloor \frac{1+t+1 - 3}{1} + 1 \right\rfloor =t.\] This preserves the spatial shape of the tensor.
5.2 Pooling layer
For illustrative purposes, we consider the 1D pooling case, and the 2D pooling case is completely similar.
For this we select a pooling size \(K \in \mathbb{N}\) and a stride \(\delta \in \mathbb{N}\) giving the mapping \[\begin{eqnarray*} \boldsymbol{z}^{\rm pool} &:& \mathbb{R}^{t \times q} \to \mathbb{R}^{t' \times q}, \\ && \boldsymbol{X}_{1:t} \mapsto \boldsymbol{z}^{(1)}(\boldsymbol{X}_{1:t}) = \left( z^{\rm pool}_{u,j}(\boldsymbol{X}_{1:t}) \right)_{1 \leq u \leq t', 1 \leq j \leq q}. \end{eqnarray*}\]
We note that the 1D pooling layer reduces the first dimension of the data from \(t\) to \(t'= \lfloor \frac{t - K}{\delta} + 1 \rfloor\), while preserving the original covariate dimension \(q\).
The standard choice is \(K=\delta\) which gives disjoint sets, resulting in the new tensor length \[t'= \lfloor t/K \rfloor.\]
The elements of the output depend on the type of pooling used. Among the most popular pooling layers there are the following two.
MaxPooling selects the maximum value from a set of input values in the selected pooling windows (\(K=\delta\)) for \(1\le j \le q\) \[ z^{\rm pool}_{u,j}(\boldsymbol{X}_{1:t}) = \max_{(u-1)K+1 \leq k \leq u K} X_{k,j}. \]
AveragePooling computes the average of the values within the given pooling windows (\(K=\delta\)) for \(1\le j \le q\) \[ z^{\rm pool}_{u,j}(\boldsymbol{X}_{1:t}) = \frac{1}{K}\sum_{k = (u-1)K+1}^{uK} X_{k,j}. \]
This selects either the maximum or the average in each channel component \(1\le j \le q\) within the disjoint pooling windows labelled by \(1\le u \le t'\).
5.3 Example of MaxPooling
- The following function,
CNN1Dmax_Model, defines a 1D CNN architecture that includes a MaxPooling layer. This allows the model to down-sample the feature maps, reducing their size and controlling for over-fitting.
Compared to the simpler CNN architecture definition, CNN1Dmax_Model requires an additional hyper-parameter:
pooling_size: size of the max-pooling window, the stride is usually set to the same value.
# CNN_1D model with max-pooling layer
CNN1Dmax_Model<- function(seed, input_size, filters, kernel_size, stride, pooling_size, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
Design <- layer_input(shape = c(input_size[1], input_size[2]), dtype = 'float32')
# 1D CNN encoder
Response <- Design %>%
layer_conv_1d(filters = filters, stride = stride, kernel_size = kernel_size, activation = 'tanh') %>%
# max-pooling layer
layer_max_pooling_1d(pooling_size, padding = "same") %>%
# FNN decoder
layer_flatten() %>%
layer_dense(output_size, activation = "sigmoid")
# model definition
keras_model(inputs = c(Design), outputs = Response)
}- The following code creates an architecture containing a single filtered 1D CNN layer followed by a MaxPooling layer with pooling size equal to \(2\):
model <- CNN1Dmax_Model(
seed = 100,
input_size = c(10, 5),
filters = 1,
kernel_size = 3,
stride = 1,
pooling_size = 2,
output_size = 1)- This MaxPooling layer shrinks down the size of the decoder FNN layer and we only have 21 weights instead of 25 weights.
summary(model)Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 5)] 0
conv1d (Conv1D) (None, 8, 1) 16
max_pooling1d (MaxPooling1D) (None, 4, 1) 0
flatten (Flatten) (None, 4) 0
dense (Dense) (None, 1) 5
================================================================================
Total params: 21 (84.00 Byte)
Trainable params: 21 (84.00 Byte)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
5.4 Locally-connected networks
5.5 Definition of a locally-connected network layer
For simplicity, we use the 1D case for explaining a LCN layer, and the 2D case is completely analogous.
Formally, a 1D LCN layer with \(q_1 \in \mathbb{N}\) filters is a mapping \[\begin{eqnarray*} \boldsymbol{z}^{(1)} &:& \mathbb{R}^{t\times q} \to \mathbb{R}^{t^\prime \times q_1},\\ && {\boldsymbol{X}_{1:t}} \mapsto \boldsymbol{z}^{(1)}({\boldsymbol{X}_{1:t}})= \big(z^{(1)}_{u,j} ({\boldsymbol{X}_{1:t}})\big)_{1 \leq u \leq t^\prime, 1 \leq j \leq q_1},\end{eqnarray*}\] with units \[ z^{(1)}_{u,j}(\boldsymbol{X}_{1:t}) = \phi \left (w^{(1)}_{0,\textcolor{red}{u},j} + \sum_{k = 1}^{K} \left\langle \boldsymbol{w}^{(1)}_{k,\textcolor{red}{u},j}, \boldsymbol{X}_{(u-1)\delta+k} \right \rangle \right),\] with bias terms \(w^{(1)}_{0,\textcolor{red}{u},j} \in \mathbb{R}\) and filter weights \(\boldsymbol{w}^{(1)}_{k,\textcolor{red}{u},j} \in \mathbb{R}^q\), \(1 \leq k \leq K\), related to the \(u\)-th receptive field.
5.6 Remarks on LCN layers
In contrast to a 1D CNN layer, the bias terms and the filter weights have lower indices \(\textcolor{red}{u}\) corresponding to the \(u\)-th receptive field considered. This increases the size of the weights by a factor \(t^\prime\) compared to the 1D CNN case because these weights are not shared across \(1\le u \le t'\).
A FNN would have a filter size over the entire input dimension, up to flattening, this turns a LCN layer into a fully-connected FNN layer.
5.7 Example of a LCN architecture
- The
LCN1D_Modelfunction takes the same arguments as the functionCNN1D_Model.
LCN1D_Model <- function(seed, input_size, filters, kernel_size, stride, output_size) {
k_clear_session(); set.seed(seed); set_random_seed(seed)
Design <- layer_input(shape = c(input_size[1], input_size[2]), dtype = 'float32')
# 1D LCN
Response <- Design %>%
layer_locally_connected_1d(filters = filters, kernel_size = kernel_size, strides = stride, activation = 'tanh') %>%
layer_flatten() %>%
layer_dense(units = output_size, activation = "tanh")
# model definition
keras_model(inputs = Design, outputs = Response)
}- The following example shows how to create a 1D LCN model with multiple filters \(q_1 = 4\), a kernel size of \(K = 3\), a stride \(\delta = 1\), and 1 output unit.
# Example usage
model <- LCN1D_Model(
seed = 100,
input_size = c(10, 5),
filters = 4,
kernel_size = 3,
stride = 1,
output_size = 1)- Compared to the 1D CNN, this LCN has an increased number of weights from 87 to 545, see next table.
summary(model)Model: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 10, 5)] 0
locally_connected1d (LocallyConne (None, 8, 4) 512
cted1D)
flatten (Flatten) (None, 32) 0
dense (Dense) (None, 1) 33
================================================================================
Total params: 545 (2.13 KB)
Trainable params: 545 (2.13 KB)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
Copyright
© The Authors
This notebook and these slides are part of the project “AI Tools for Actuaries”. The lecture notes can be downloaded from:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162304
\(\,\)
- This material is provided to reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution and credit is given to the original authors and source, and if you indicate if changes were made. This aligns with the Creative Commons Attribution 4.0 International License CC BY-NC.