
AI TOOLS FOR ACTUARIES
Chapter 10: Unsupervised Learning - Part A
Chapter 10: Unsupervised Learning - Part A
1 Introduction to unsupervised learning
1.1 Unsupervised learning
Unsupervised learning covers the topic of learning the structure in the covariates \(\boldsymbol{X}\) without considering a response variable \(Y\).
This means that unsupervised learning aims at understanding the population distribution of the covariates \(\boldsymbol{X} \sim \mathbb{P}\).
This can be achieved by learning the inherent pattern in \(\boldsymbol{X}\) from an i.i.d. learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n\).
This learning sample \({\cal L}\) is called unlabeled because it does not include any responses.
This chapter can be seen as is a first step towards generative AI that aims at sampling new data \((Y,\boldsymbol{X})\), including the covariates.
1.2 Main problems solved by unsupervised learning
Dimension reduction. Starting from \(q\)-dimensional real-valued covariates \(\boldsymbol{X} \in \mathbb{R}^q\), find a lower dimensional representation of \(\boldsymbol{X}\) avoiding a significant loss of information (with a small reconstruction error). The most commonly used dimension reduction techniques are the principal component analysis (PCA) being based on the singular value decomposition (SVD), and the bottleneck neural network (BNN) auto-encoder.
Clustering. Clustering techniques classify (group) similar covariates \(\boldsymbol{X}\) into (homogeneous) classes (bins, clusters). This leads to a segmentation of a heterogeneous population into homogeneous classes. Popular methods are hierarchical clustering methods, \(K\)-means clustering, \(K\)-medoids clustering, distribution-based Gaussian mixture model (GMM) clustering or density-based spatial clustering of applications with noise (DBSCAN).
- Low-dimensional visualization. Low-dimensional visualization of high-dimensional data has some similarity with the dimension reduction problem from the first item, but instead of trying to minimize a loss of information (reconstruction error), these methods rather aim at preserving the local structure (topology), so that a certain adjacency relation in a neighborhood of a given instance is kept. Popular methods are \(t\)-distributed stochastic neighbor embedding (\(t\)-SNE), uniform manifold approximation and projection (UMAP) or self-organizing maps (SOM) also called Kohonen map.
These unsupervised learning methods are mainly used to understand, simplify and illustrate the covariate distribution \(\boldsymbol{X} \sim \mathbb{P}\). In insurance, these methods have applications in covariate selection, in result presentations, and in anomaly detection, e.g., indicating potential fraudulant behavior.
2 Dimension reduction techniques
2.1 A first example
Having a learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n\) with \(q\)-dimensional real-valued features \(\boldsymbol{X}_i \in \mathbb{R}^q\), one tries to understand whether \({\cal L}\) is contained in a lower dimensional manifold (object, surface).
The figure below gives an example of a learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n\) in \(\mathbb{R}^2\) that forms a noisy unit circle \[B_1=\{\boldsymbol{x} \in \mathbb{R}^2;\,x^2_1+x^2_2=1\}.\] In this case, it is sufficient to record the angle in the polar coordinate system (to reconstruct the unit circle).
In general, we work in high-dimensional spaces \(\mathbb{R}^q\), \(q\gg 1\), and there is no simple graphical solution:
First, finding such a manifold in high-dimensional spaces is difficult.
Second, though looking simple, already the noisy unit circle may pose some challenges because it is a non-linear object in the Euclidean coordinate system.
2.2 Standardization
Throughout this chapter, we assume to work with standardized data, meaning the following.
Consider the design matrix from a learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n \subset \mathbb{R}^q\) \[ \mathfrak{X}=\left[\boldsymbol{X}_1,\ldots, \boldsymbol{X}_n\right]^\top =\begin{pmatrix} X_{1,1} & \cdots & X_{1,q} \\ \vdots & \ddots & \vdots \\ X_{n,1} & \cdots & X_{n,q} \\\end{pmatrix} ~\in~ \mathbb{R}^{n\times q}.\]
The columns of this design matrix \({\mathfrak X}\) describe different quantities, e.g., the first column may describe the weight of the car, the second one the age of the policyholder, the third one the price of the car, etc.
Since the following methods are unit-free, we standardize all columns of \({\mathfrak X}\) to be centered and having unit variance.
2.3 Categorical covariates
A general difficulty in most of the unsupervised learning methods is the treatment of categorical variables.
They can be embedded using one-hot encoding, but if we have many levels, the resulting design matrix \({\mathfrak X}\) will be sparse, i.e., with lots of zero entries (before standardization).
Most unsupervised learning methods struggle with this sparsity; and the design matrix \({\mathfrak X}\) will not be (very) well-conditioned.
Therefore, it is advantageous if there is a low-dimensional entity embedding for categorical variables, though, it is not always clear where to get it from.
3 Auto-encoder
3.1 Dissimilarity function
The general principle of finding a lower dimensional object to encode the learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n\) can be described by an auto-encoder.
An auto-encoder maps a high-dimensional object \(\boldsymbol{X} \in \mathbb{R}^q\) to a low-dimensional representation in \(\mathbb{R}^p\), \(p<q\), so that one can still (almost perfectly) reconstruct the original data \(\boldsymbol{X}\) from the smaller object.
The loss of information is measured by a dissimilarity function \[ L(\cdot,\cdot):\mathbb{R}^q\times \mathbb{R}^q \to \mathbb{R}_+, \qquad (\boldsymbol{X}, \boldsymbol{X}') \mapsto L(\boldsymbol{X}, \boldsymbol{X}')\ge 0,\] with the property that \(L(\boldsymbol{X},\boldsymbol{X}')=0\) if and only if \(\boldsymbol{X}=\boldsymbol{X}'\).
3.2 Auto-encoder
Select dimensions \(p<q\).
An auto-encoder is a pair \((\Phi,\Psi)\) of mappings, called encoder and decoder, respectively,\[ \Phi: \mathbb{R}^q \to \mathbb{R}^p \qquad \text{ and } \qquad \Psi:\mathbb{R}^p \to \mathbb{R}^q,\] such that their composition \(\Psi\circ \Phi\) has a small reconstruction error w.r.t. the chosen dissimilarity function \(L(\cdot,\cdot)\), that is,\[ \boldsymbol{X} ~ \mapsto ~ L\left( \boldsymbol{X}, \Psi\circ \Phi(\boldsymbol{X})\right) \text{~ is small for all instances $\boldsymbol{X}$ of interest.}\]
Strictly speaking, the above is not a proper mathematical definition, because the clause in ‘small … of interest’ is not a well-defined mathematical term.
This sloppiness will not harm any mathematical arguments below.
Naturally, a small reconstruction error cannot be achieved for all \(\boldsymbol{X} \in \mathbb{R}^q\), because the dimension reduction \(p<q\) always leads to a loss of information on the entire space \(\mathbb{R}^q\).
However, if all \(\boldsymbol{X}\) ‘of interest’ live in a lower dimensional object (manifold) \(B_1\) of dimension \(p\), then it is possible to find an encoder (called coordinate mapping) \(\Phi: \mathbb{R}^q \to \mathbb{R}^p\) so that one can perfectly reconstruct the original object \(B_1\) by a suitable decoder \(\Psi:\mathbb{R}^p \to \mathbb{R}^q\).
We revisit our example of the (noisy) unit circle from above.
3.3 Example, unit circle revisited
Assume all covariates of interest live in the unit circle, \[\boldsymbol{X}\in B_1 =\{\boldsymbol{x} \in \mathbb{R}^2;\,x^2_1+x^2_2=1\}.\]
The encoder \(\Phi: \mathbb{R}^2 \to [0,2\pi)\) is received by describing the angle in polar coordinates.
The decoder \(\Psi:[0,2\pi) \to B_1\subset \mathbb{R}^2\) maps these angles to the unit circle.
This auto-encoder preserves the information of the polar angle and it loses the information about the size of the radius. If all covariates \(\boldsymbol{X} \in B_1\) ‘of interest’ lie in the unit circle, the information about the radius is not necessary, i.e., \(B_1\) is a one-dimensional \(p=1\) object.
We conclude that the interval \([0,2\pi)\) is a low-dimensional representation of the unit circle \(B_1\).
3.4 Typical dissimilarity functions
Working in Euclidean spaces \(\mathbb{R}^q\), one can take the classical \(L^k\)-norms, \(k \in (0,\infty]\), as dissimilarity measures \[ L(\boldsymbol{X}, \boldsymbol{X}') = \left\| \boldsymbol{X} - \boldsymbol{X}' \right\|_k =\left(\sum_{j=1}^q \left|X_j-X_j'\right|^k\right)^{1/k}.\]
For \(k=2\), we have the Euclidean distance; \(k=1\) corresponds to the \(L^1\)-distance (Manhattan distance); and the limiting case \(k=\infty\) is the supremums norm.
The Euclidean distance is a special case of the Mahalanobis distance \[ L(\boldsymbol{X}, \boldsymbol{X}') = \sqrt{(\boldsymbol{X} - \boldsymbol{X}')^\top A \,(\boldsymbol{X} - \boldsymbol{X}')},\] for a positive definite matrix \(A \in \mathbb{R}^{q\times q}\).
3.5 Categorical covariates
Categorical covariates are more difficult.
Assume \(X\) is categorical taking \(K\) levels. One-hot encoding embeds \(X \mapsto \boldsymbol{X} \in \{0,1\}^K\) into the \(K\)-dimensional Euclidean space \(\mathbb{R}^K\).
The Hamming distance counts the number of positions where two one-hot encoded vectors \(\boldsymbol{X}\) and \(\boldsymbol{X}'\) differ. The one-hot encoded Hamming distance is equal to \[ L(X, X') = 2 \cdot \mathbf{1}_{\{X \neq X'\}}.\]
A disadvantage of this approach is that all mutual distances between the levels are equal, and there is no notion of similarity between different levels. This is precisely the step where a low-dimensional entity embedding may be useful.
For multiple categorical covariates, one can scale them to bring them to the same units. To account for the number of levels \(K\), the scaled (probability) version is given by \[ L(X, X') = \frac{1}{K^2}\,\mathbf{1}_{\{X \neq X'\}}.\]
This is inspired by the fact that if \(X\) and \(X'\) are independent and uniform on \(\{a_1,\ldots, a_K\}\), every pair \((X,X')\) has probability \(1/K^2\).
For non-uniform i.i.d. categorical variables it may be adapted to a weighted version \[ L(X, X') = p_X \, p_{X'}\,\mathbf{1}_{\{X \neq X'\}},\] with (positive) categorical probabilities \((p_{a_k})_{k=1}^K\) summing to one.
4 Principal component analysis
4.1 Principal component analysis
The PCA is a linear dimension reduction technique that has different interpretations, one of them is a linear auto-encoder interpretation.
We start from the design matrix \({\mathfrak X}=[\boldsymbol{X}_1,\ldots, \boldsymbol{X}_n]^\top\).
The rows of \({\mathfrak X}\) contain the transposed covariates \(\boldsymbol{X}_i \in \mathbb{R}^q\) of all instances \(1\le i \le n\), \[ \mathfrak{X}=\left[\boldsymbol{X}_1,\ldots, \boldsymbol{X}_n\right]^\top =\begin{pmatrix} X_{1,1} & \cdots & X_{1,q} \\ \vdots & \ddots & \vdots \\ X_{n,1} & \cdots & X_{n,q} \\\end{pmatrix} ~\in~ \mathbb{R}^{n\times q}.\]
4.2 Principal components auto-encoder
Select the standard unit basis \((\boldsymbol{e}_{j})_{j=1}^q\) of the Euclidean space \(\mathbb{R}^q\).
The covariates \(\boldsymbol{X}_i\) can be written as \[ \boldsymbol{X}_i ~=~ X_{i,1}\, \boldsymbol{e}_1 + \ldots + X_{i,q}\, \boldsymbol{e}_q ~=~\sum_{j=1}^q X_{i,j}\, \boldsymbol{e}_j ~\in~\mathbb{R}^q.\]
The PCA represents \(\boldsymbol{X}_i\) in a different orthonormal basis \((\boldsymbol{v}_{j})_{j=1}^q\).
By a linear transformation, one can rewrite the covariates \(\boldsymbol{X}_i\) as \[ \boldsymbol{X}_i ~=~ a_{i,1}\, \boldsymbol{v}_1 + \ldots + a_{i,q}\, \boldsymbol{v}_q ~=~\sum_{j=1}^q \langle \boldsymbol{X}_i, \boldsymbol{v}_j \rangle\, \boldsymbol{v}_j ~\in~\mathbb{R}^q,\] with the new coefficients \(a_{i,j}=\langle \boldsymbol{X}_i, \boldsymbol{v}_j \rangle \in \mathbb{R}\).
The two representations are equivalent, but they use different bases of the (same) Euclidean space \(\mathbb{R}^q\).
The PCA tries to find an orthonormal basis \((\boldsymbol{v}_{j})_{j=1}^q\) that makes the reconstruction error of the entire design matrix \({\mathfrak X}\) small, when dropping a few coordinate components.
That is, for all instances \(1\le i \le n\) and given \(\textcolor{magenta}{p}<q\) \[ \boldsymbol{X}_i =\sum_{j=1}^q a_{i,j} \boldsymbol{v}_j ~\approx~\sum_{j=1}^{\textcolor{magenta}{p}} a_{i,j} \boldsymbol{v}_j.\]
This motivates the following encoder \(\Phi_{\textcolor{magenta}{p}}:\mathbb{R}^q \to \mathbb{R}^{\textcolor{magenta}{p}}\), \(\textcolor{magenta}{p}<q\), \[ \boldsymbol{X} ~\mapsto ~ \Phi_{\textcolor{magenta}{p}}(\boldsymbol{X}) = \left(\langle \boldsymbol{X}, \boldsymbol{v}_1 \rangle, \ldots, \langle \boldsymbol{X}, \boldsymbol{v}_{\textcolor{magenta}{p}} \rangle\right)^\top ~\in~\mathbb{R}^{\textcolor{magenta}{p}}.\] This only keeps the first \(\textcolor{magenta}{p}\) principal components in the alternative orthonormal basis \((\boldsymbol{v}_{j})_{j=1}^q\) representation.
For the decoder \(\Psi_p:\mathbb{R}^p \to \mathbb{R}^q\), we use a simple embedding \[ \boldsymbol{Z} ~\mapsto ~ \Psi_p(\boldsymbol{Z}) = \sum_{j=1}^p Z_j \boldsymbol{v}_j = \sum_{j=1}^p Z_j \boldsymbol{v}_j + \sum_{j=p+1}^q 0\cdot \boldsymbol{v}_j ~\in~\mathbb{R}^q.\]
Composing the encoder \(\Phi_p\) and the decoder \(\Psi_p\) gives the auto-encoder \[ \boldsymbol{X} ~\mapsto ~ \Psi_p \circ \Phi_p(\boldsymbol{X}) = \sum_{j=1}^{\textcolor{magenta}{p}} \langle \boldsymbol{X}, \boldsymbol{v}_j \rangle\, \boldsymbol{v}_j ~\in ~\mathbb{R}^q.\]
This describes an orthogonal projection of \(\boldsymbol{X}\) to the sub-space spanned by \((\boldsymbol{v}_{j})_{j=1}^{\textcolor{magenta}{p}}\).
This auto-encoder has a reconstruction error of \[ \boldsymbol{X} - \Psi_p \circ \Phi_p(\boldsymbol{X}) = \sum_{j=\textcolor{magenta}{p}+1}^q \langle \boldsymbol{X}, \boldsymbol{v}_j \rangle\, \boldsymbol{v}_j.\]
The PCA selects the orthonormal basis \((\boldsymbol{v}_{j})_{j=1}^q\) such that it makes this reconstruction error minimal for the \(L^2\)-distance dissimilarity measure over the entire learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n\).
For fixed \(p\), minimize the reconstruction error \[\begin{eqnarray*} \sum_{i=1}^n L\left(\boldsymbol{X}_i, \Psi_p \circ \Phi_p(\boldsymbol{X}_i)\right) &=& \sum_{i=1}^n\left\| \boldsymbol{X}_i - \Psi_p \circ \Phi_p(\boldsymbol{X}_i)\right\|_2^2 \\&=&\sum_{j=\textcolor{magenta}{p}+1}^q \left\| {\mathfrak X} \boldsymbol{v}_j \right\|_2^2. \end{eqnarray*}\]
The latter expression shows that the select the orthonormal basis \((\boldsymbol{v}_{j})_{j=1}^q\) should maximally decrease the terms \(\|{\mathfrak X} \boldsymbol{v}_j \|_2^2\) in \(j\).
As a result, this can be done simultaneously for all encoding dimensions \(1\le p \le q\) by solving recursive Lagrange problems.
The first orthonormal basis vector \(\boldsymbol{v}_1 \in \mathbb{R}^q\) is a solution of \[ \boldsymbol{v}_1 ~\in~\underset{\|\boldsymbol{v}\|_2=1}{\arg \max}~ \|{\mathfrak X} \boldsymbol{v}\|_2^2.\]
The \(j\)-th orthonormal basis vector \(\boldsymbol{v}_j \in \mathbb{R}^q\), \(2\le j \le q\), is computed recursively by \[ \boldsymbol{v}_j ~\in~\underset{\|\boldsymbol{v}\|_2=1}{\arg \max}~ \|{\mathfrak X} \boldsymbol{v}\|_2^2 \qquad \text{ subject to $\langle \boldsymbol{v}_k,\boldsymbol{v}\rangle=0$ for all $1\le k \le j-1$.}\]
This solves the total dissimilarity minimization problem for the linear auto-encoders simultaneously for all \(1\le p \le q\).
The single terms \(\langle \boldsymbol{X}, \boldsymbol{v}_j \rangle\, \boldsymbol{v}_j\) are the principal components in the lower dimensional representations.
(In principle), this fully solves the PCA auto-encoder.
4.3 Singular value decomposition
At this stage, we could close the chapter on the PCA, because the problem is fully solved.
However, there is an alternative way of computing the PCA than by recursively solving convex Lagrange problems.
This alternative way of computing the orthonormal basis \((\boldsymbol{v}_{j})_{j=1}^q\) uses the singular value decomposition (SVD) and the algorithm of Golub and Van Loan (1983); see Section 14.5.1 in Hastie, Tibshirani and Friedman (2009).
The SVD uses the following nice mathematical result.
Recall, a matrix \(U\in \mathbb{R}^{n\times q}\) is orthogonal if \(U^\top U ={\rm Id}_q\) gives the identity matrix.
There exist orthogonal matrices \(U\in \mathbb{R}^{n\times q}\) and \(V \in \mathbb{R}^{q\times q}\), and a diagonal matrix \(\Lambda = {\rm diag}(\lambda_1,\ldots, \lambda_q)\in \mathbb{R}^{q\times q}\) with singular values \(\lambda_1 \ge \ldots \ge \lambda_q \ge 0\) providing the SVD of \({\mathfrak X}\) \[ {\mathfrak X} = U \Lambda V^\top.\]
\(V\) is called right-singular matrix of \({\mathfrak X}\).
Mathematical result: the column vectors of the right-singular matrix \(V=[\boldsymbol{v}_1,\ldots, \boldsymbol{v}_q]\in \mathbb{R}^{q\times q}\) give the orthonormal basis \((\boldsymbol{v}_{j})_{j=1}^q\) that we are looking for.
This is also justified by the computation \[ \|{\mathfrak X} \boldsymbol{v}_j\|_2^2 = \boldsymbol{v}_j^\top {\mathfrak X}^\top{\mathfrak X} \boldsymbol{v}_j = \boldsymbol{v}_j^\top V \Lambda^2 V^\top \boldsymbol{v}_j = \lambda_j^2.\]
The singular values are ordered \(\lambda_1 \ge \ldots \ge \lambda_q \ge 0\); the orthonormal basis \((\boldsymbol{v}_{j})_{j=1}^q\) gives the eigenvectors of \({\mathfrak X}^\top{\mathfrak X}\) with eigenvalues \(\lambda_j^2\).
4.4 Two-dimensional example: multivariate Gaussian covariates
# construct a design matrix X
n1 <- 1000
set.seed(200)
X <- cbind(rnorm(n=n1, mean=0, sd=1), rnorm(n=n1, mean=0, sd=1))
A <- array( c(2,1,0,.5), dim=c(2,2))
X <- X %*% A
# singular value decomposition of X
svd_fit <- svd(X)
svd_fit$v # right-singular matrix (columns containing v) [,1] [,2]
[1,] -0.9940315 0.1090935
[2,] -0.1090935 -0.9940315
svd_fit$d # singular values[1] 71.36817 14.21831
- Plotting this two-dimensional example.
# plot the design matrix X
plot(x=X[,1], y=X[,2], xlim=range(X), ylim=range(X), xlab="X1", ylab="X2", main="PCA: Gaussian example")
# first right-singular vector
abline(a=0, b=svd_fit$v[2,1]/svd_fit$v[1,1], col="red", lwd=2)
# second right-singular vector
abline(a=0, b=svd_fit$v[2,2]/svd_fit$v[1,2], col="orange", lwd=2)
abline(h=0)
abline(v=0)
legend(x="topright", pch=c(1,-1,-1), lwd=c(-1,2,2), col=c("black","red","orange"), legend=c("observations X", "1st principal comp.", "2nd principal comp."), bg="white")- The first right-singular vector \(\boldsymbol{v}_1\) determines the direction of the biggest scatter/diameter of the observations.

4.5 Example: Belgium Sports Cars
We revisit the Belgium Sports Cars example considered in Rentzmann and Wüthrich (2019).
This example was originally studied in Ingenbleek and Lemaire (1988).
A Belgium insurance company computed a variable \(\tau\) using a proprietary code. If \(\tau<17\), the corresponding vehicle was declared to be a Sports Car, and otherwise a regular car.
Ingenbleek and Lemaire (1988) tried to reconstruct this Sports Cars encoding by a PCA on different vehicle information.
We present this PCA.
load(file="../../Data/SportsCars.rda")
dat <- SportsCars
str(dat)'data.frame': 475 obs. of 13 variables:
$ brand : chr "Austin" "Citroen" "Citroen" "Fiat" ...
$ type : chr "Rovermini" "Visa" "2CV" "Panda" ...
$ model : chr "Ehlemayair" "Baseclub" "Specialton" "34" ...
$ cubic_capacity : int 998 652 602 850 1598 845 956 1588 1596 992 ...
$ max_power : int 31 25 21 25 41 21 31 40 40 37 ...
$ max_torque : num 67 49 39 60 96 56 65 100 100 98 ...
$ seats : int 4 5 4 5 5 4 5 5 5 5 ...
$ weight : int 620 755 585 680 1015 695 695 900 1030 920 ...
$ max_engine_speed: int 5000 5500 5750 5250 4600 4500 5750 4500 4800 4250 ...
$ seconds_to_100 : num 19.5 26.2 NA 32.3 21 NA 19.3 18.7 20 NA ...
$ top_speed : int 129 125 115 125 143 115 137 148 140 130 ...
$ sports_car : int 0 0 0 0 0 0 0 0 0 0 ...
$ tau : num 23.3 34.1 28.6 32.8 35 ...
- The last line shows the proprietary \(\tau\), with \(\tau<17\) being Sports Cars.
- We pre-process the data as in Ingenbleek and Lemaire (1988).
# construct the covariates of Ingenbleek and Lemaire (1988)
dat$x1 <- log(dat$weight/dat$max_power)
dat$x2 <- log(dat$max_power/dat$cubic_capacity)
dat$x3 <- log(dat$max_torque)
dat$x4 <- log(dat$max_engine_speed)
dat$x5 <- log(dat$cubic_capacity)
X <- dat[, c("x1","x2","x3","x4","x5")]
# standardization of design matrix
X <- X-colMeans(X)[col(X)]
X <- as.matrix(X/sqrt(colMeans(X^2))[col(X)])- The resulting standardized design matrix \({\mathfrak X}\) has \(n=475\) rows and \(q=5\) columns for the five covariates.
round(cor (X, method="pearson"), 4) x1 x2 x3 x4 x5
x1 1.0000 -0.7484 -0.8173 -0.3074 -0.6690
x2 -0.7484 1.0000 0.4552 0.6100 0.1531
x3 -0.8173 0.4552 1.0000 -0.1076 0.9317
x4 -0.3074 0.6100 -0.1076 1.0000 -0.2533
x5 -0.6690 0.1531 0.9317 -0.2533 1.0000
- Some covariates are highly (linearly) correlated. This indicates that this data may have a lower-dimensional representation.
SVD <- svd(X)
round(SVD$d,2) # singular values[1] 37.53 28.07 11.48 6.48 2.12
- Basically, the first two principal components should be sufficient to explain this data.
- We compute the reconstruction error for the different dimensions \(1\le p \le q\).
# L2 reconstruction error
Frobenius.loss <- function(X1,X2){sqrt(sum(as.matrix((X1-X2)^2))/nrow(X1))}
reconstruction.error <- array(NA, c(5))
for (p in 1:5){
Xp <- SVD$v[,1:p] %*% t(SVD$v[,1:p]) %*% t(X)
Xp <- t(Xp)
reconstruction.error[p] <- Frobenius.loss(X,Xp)
}
# average reconstruction error
round(reconstruction.error,2)[1] 1.43 0.61 0.31 0.10 0.00
# projection to the first two right-singular vectors v1 and v2
dat$PCA1 <- X %*% SVD$v[,1]
dat$PCA2 <- X %*% SVD$v[,2]
# we give a two-dimensional representation/graph
rr <- max(abs(dat$PCA1), abs(dat$PCA2))
#
plot(x=dat$PCA1, y=dat$PCA2, col="blue",pch=20, ylim=c(-rr,rr), xlim=c(-rr,rr), ylab="2nd principal component", xlab="1st principal component", main="principal component analysis")
# sports car coloring
dat0 <- dat[which(dat$tau<21),]
points(x=dat0$PCA1, y=dat0$PCA2, col="green",pch=20)
dat0 <- dat[which(dat$tau<17),]
points(x=dat0$PCA1, y=dat0$PCA2, col="red",pch=20)
legend("bottomleft", c("tau>=21", "17<=tau<21", "tau<17 (sports car)"), col=c("blue", "green", "red"), lty=c(-1,-1,-1), lwd=c(-1,-1,-1), pch=c(20,20,20))
The previous plot gives a two-dimensional representation of \({\mathfrak X}\) using the first two principal components.
For this we compute the projections \(a_{i,j}= \langle \boldsymbol{X}_i, \boldsymbol{v}_j \rangle\) for the two principal components \(j=1,2\) of all instances \(1\le i \le n\).
This gives us the two-dimensional (encoded) representation \((a_{i,1}, a_{i,2})_{i=1}^n\) of all instances.
The coloring in the graph uses the Belgium Sports Cars definition through the variable \(\tau\) (with red for Sports Cars).
A conclusion of this picture is that the Sports Cars definition essentially uses the first principal component, i.e., basically having a first principal component bigger than 2 gives a Sports Car, and a further fine-tuning can be done with the second principal component; note that the true Belgium Sports Cars definition for computing \(\tau\) does not use the same variables and it is not linear in our covariates.
round(SVD$v,2) [,1] [,2] [,3] [,4] [,5]
[1,] -0.56 0.10 0.06 0.82 -0.08
[2,] 0.41 -0.48 -0.60 0.35 -0.35
[3,] 0.54 0.27 0.00 0.40 0.69
[4,] 0.13 -0.70 0.68 0.13 0.10
[5,] 0.46 0.43 0.43 0.16 -0.62
The first principal component aggregates \[ - 0.56 \boldsymbol{e}_1 + 0.41 \boldsymbol{e}_2 + 0.54 \boldsymbol{e}_3 + 0.13 \boldsymbol{e}_4 + 0.46 \boldsymbol{e}_5.\]
This describes the projection to the 1st right-singular vector \(\boldsymbol{v}_1\).

5 Bottleneck neural network
5.1 Bottleneck neural network
The PCA presented above is a linear auto-encoder.
A natural attempt to receive a non-linear auto-encoder is to set-up a FNN architecture that propagates all information to a small layer (encoder).
This encoder should be such that a suitable decoder leads to a small reconstruction error.
This has been introduced by Kramer (1991) and Hinton and Salakhutdinov (2006).
The next graph shows a BNN with a bottleneck of size two, this leads to a 2-dimensional encoding.

5.2 Formalizing the bottleneck neural network
Select a deep FNN architecture \(\boldsymbol{z}^{(d:1)}=\boldsymbol{z}^{(d)}\circ \cdots \circ \boldsymbol{z}^{(1)}\) with the following two special features:
The output dimension is equal to the input dimension, \(q_d=q_0=q\).
One FNN layer \(\boldsymbol{z}^{(m)}\), \(1\le m <d\), has a small dimension (number of units) \(q_m =p < q\). This dimension corresponds to the number of principal components in the PCA, and \(\boldsymbol{z}^{(m)}\) is called the bottleneck of the FNN.
The BNN encoder is for \(\boldsymbol{X} \in \mathbb{R}^q\) given by \[ \boldsymbol{X} ~\mapsto \Phi_p(\boldsymbol{X}) := \boldsymbol{z}^{(m:1)}(\boldsymbol{X})=\left(\boldsymbol{z}^{(m)}\circ \cdots \circ \boldsymbol{z}^{(1)}\right)(\boldsymbol{X}) ~\in~\mathbb{R}^p,\] and the BNN decoder for \(\boldsymbol{Z} \in \mathbb{R}^p\) by \[ \boldsymbol{Z} ~\mapsto \Psi_p(\boldsymbol{Z}) := \left(\boldsymbol{z}^{(d)}\circ \cdots \circ \boldsymbol{z}^{(m+1)}\right)(\boldsymbol{Z}) ~\in~\mathbb{R}^q.\]
Composing the BNN encoder and the BNN decoder gives the BNN auto-encoder \(\Psi_p\circ \Phi_p:\mathbb{R}^q \to \mathbb{R}^q\) \[ \boldsymbol{X}~\mapsto ~ \Psi_p\circ \Phi_p(\boldsymbol{X}) =\boldsymbol{z}^{(d:1)}(\boldsymbol{X}).\]
The above figure illustrates a BNN auto-encoder of depth \(d=4\), input dimension \(q_0=q_4=5\), and with a bottleneck dimension \(q_2=p=2\).
This BNN auto-encoder is trained on the learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n\) using the SGD algorithm as described in the network chapter.
As loss function one selects a dissimilarity function \(L\), and the SGD algorithm tries to minimize the resulting reconstruction error w.r.t. \(L\).
The low-dimensional representation of the data is then received by evaluating the bottleneck of the trained FNN\[ \left\{\Phi_p(\boldsymbol{X}_i)\right\}_{i=1}^n = \left\{\boldsymbol{z}^{(m:1)}(\boldsymbol{X}_i)\right\}_{i=1}^n ~\subset ~\mathbb{R}^p.\]
The advantage of the BNN auto-encoder over the PCA is that the BNN auto-encoder can deal with non-linear structures in the data, supposed one selects a non-linear activation function \(\phi\).
The disadvantage clearly is that the BNN auto-encoder treats the bottleneck dimension \(p\) as a hyper-parameter that needs to be selected before training the BNN.
Changing this hyper-parameter requires to refit a new BNN architecture, i.e., in contrast to the PCA, one does not simultaneously get the results for all \(1\le p \le q\).
There is no notion of singular values \((\lambda_j)_{j=1}^q\) that quantifies the significance of the principal components, but one has to evaluate the reconstruction errors for every bottleneck dimension \(p\) individually to receive the suitable size of the bottleneck, i.e., an acceptable level of reconstruction error.
5.3 Example: Belgium Sports Cars, revisited
We revisit the Belgium Sports Cars example from above and continue the dimension reduction analysis.
We use a BNN of depth \(d=4\) similar to the above graph, having a bottleneck of dimension \(p=2\). This gives us a two-dimensional encoder \(\Phi_2: {\mathbb R}^5 \to {\mathbb R}^2\) for the data \(\mathfrak{X}\).
\(~\)
library(tensorflow)
library(keras) # Keras version 2- The following code defines a BNN auto-encoder architecture.
bottleneck <- function(seed, qq, acti){
tf$keras$backend$clear_session()
set.seed(seed)
set_random_seed(seed)
Input <- layer_input(shape=c(qq[1]), dtype='float32')
#
Encoder = Input %>%
layer_dense(units=qq[2], activation=acti, use_bias=FALSE) %>%
layer_dense(units=qq[3], activation=acti, use_bias=FALSE, name='Encoder')
#
Decoder = Encoder %>%
layer_dense(units=qq[2], activation=acti, use_bias=FALSE) %>%
layer_dense(units=qq[1], activation='linear', use_bias=FALSE)
#
keras_model(inputs=Input, outputs=Decoder)
}BN <- 2 # size of bottleneck
qq <- c(ncol(X), 9, BN) # X is the design matrix
(model <- bottleneck(1234, qq, 'tanh')) # tanh activation functionModel: "model"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
input_1 (InputLayer) [(None, 5)] 0
dense (Dense) (None, 9) 45
Encoder (Dense) (None, 2) 18
dense_2 (Dense) (None, 9) 18
dense_1 (Dense) (None, 5) 45
================================================================================
Total params: 126 (504.00 Byte)
Trainable params: 126 (504.00 Byte)
Non-trainable params: 0 (0.00 Byte)
________________________________________________________________________________
- Training the BNN: for simplicity we do not track over-fitting, since our input is too small, generally this should be done (e.g., by cross-validation).
model %>% compile(optimizer = optimizer_nadam(), loss = 'mean_squared_error')
epochs <- 20000
fit <- model %>% fit(as.matrix(X), as.matrix(X), epochs = epochs, batch_size = nrow(X), verbose = 0)
- Comparing the reconstruction error of the BNN and the PCA auto-encoders for dimension \(p=2\).
BNN <- model %>% predict(as.matrix(X), verbose = 0)
c(round(Frobenius.loss(X, BNN),4), round(reconstruction.error[2], 4))[1] 0.5220 0.6124
- We observe a clearly smaller reconstruction error of the BNN over the PCA, however, on bigger datasets one should carefully assess over-fitting, e.g., using cross-validation to ensure that only the structure is extracted and not the noisy part.
- We extract the encoding \(\Phi_p(\boldsymbol{X}_i\)) of all instances \(1\le i \le n\).
#
# extract the BNN encoding
Encoder <- keras_model(inputs=model$input, outputs=get_layer(model, 'Encoder')$output)
XX <- data.frame(Encoder %>% predict(as.matrix(X), verbose=0))- The figure below shows the results and it compares it to the PCA encoder.

There is similarity between the two encoders, but the BNN encoder seems to be a slightly rotated version of the PCA. Note that the BNN does not label the BNN neurons w.r.t. any importance (singular values), nor does it consider any notion of orthogonality between the BNN neurons.
5.4 Closing remarks on BNNs
- Hinton and Salakhutdinov (2006) noticed that gradient descent training of BNN architectures can be difficult, and it can likely result in poorly trained BNNs. Therefore, Hinton and Salakhutdinov (2006) propose to use a BNN architecture that is symmetric in the bottleneck layer with a special training procedure; for more details, see Section 7.5.5 in Wüthrich and Merz (2023).
- For the linear activation function for \(\phi\), the bottleneck represents a linear subspace of dimension \(p\) (supposed that all other FNN layers have more units), and the trained BNN gives a linear data compression. The result is the same as the one from the PCA, with the essential difference that the BNN auto-encoder does not give a representation in an orthonormal basis \((\boldsymbol{v}_j)_{j=1}^p\), but it gives the results in the same linear subspace in a different parametrization (that is not explicitly specified), because the BNN auto-encoder does not have any notion of orthonormality.
6 Kernel principal component analysis
The PCA is a linear dimension reduction technique, and one might want to look for a non-linear variant in the spirit of the PCA.
To arrive at a non-linear PCA, the idea is to consider a feature map \[ F : \mathbb{R}^q \to \mathbb{R}^d, \qquad \boldsymbol{X} \mapsto F(\boldsymbol{X}),\] where we typically think of a higher dimensional feature embedding, \(d>q\). The PCA is then applied to \((F(\boldsymbol{X}_i))_{i=1}^n\).
The feature map \(F\) is the first part of the idea behind the kernel PCA, i.e., this higher dimensional embedding gives the necessary modeling flexibility.
Main problem: How can we find a suitable feature map \(F\)?
Part of the answer to this question is: It is not necessary to explicitly select the feature map \(F\), but for a PCA it suffices to know the (implied) kernel \[ K : {\cal X} \times {\cal X} \to \mathbb{R}, \qquad (\boldsymbol{X}, \boldsymbol{X}') \mapsto K(\boldsymbol{X}, \boldsymbol{X}') =\left\langle F(\boldsymbol{X}), F(\boldsymbol{X}')\right\rangle, \] because the non-linear PCA is fully determined by this kernel \(K\).
This is called the kernel trick, which means that for such types of problems, it is sufficient to know the kernel \(K\).
Popular kernels are of the form \[\begin{eqnarray*} K(\boldsymbol{X}, \boldsymbol{X}') &=& (b+\langle \boldsymbol{X}, \boldsymbol{X}'\rangle)^k,\\ K(\boldsymbol{X}, \boldsymbol{X}') &=& \exp \left\{- \gamma \|\boldsymbol{X}- \boldsymbol{X}'\|_2^2 \right\}. \end{eqnarray*}\]
For the most popular kernel choices \(K\), there is a mathematical theory that makes this concept work: there exists a reproducing kernel Hilbert space (RKHS) that has the selected kernel \(K\) as a proper kernel of a feature map \(F\) on a suitable Hilbert space.
More explicitly, when \(K\) is a Mercer kernel, the Aronszajn (1950) theorem says that there exists such a RKHS.
As the last item of the kernel PCA, Schölkopf, Smola and Müller (1998) show that the knowledge of the kernel \(K\) is sufficient to perform the (kernel) PCA.
We do not further discuss the kernel PCA. Generally, it is difficult to select a suitable kernel \(K\), and computationally this method can be challenging.
Copyright
© The Authors
This notebook and these slides are part of the project “AI Tools for Actuaries”. The lecture notes can be downloaded from:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162304
\(\,\)
- This material is provided to reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution and credit is given to the original authors and source, and if you indicate if changes were made. This aligns with the Creative Commons Attribution 4.0 International License CC BY-NC.