library(cluster)
load(file="../../Data/SportsCars.rda")
dat <- SportsCarsAI TOOLS FOR ACTUARIES
Chapter 10: Unsupervised Learning - Part C
Chapter 10: Unsupervised Learning - Part C
1 Unsupervised learning
The previous two notebooks have been focusing on dimension reduction and clustering. This notebook considers low-dimensional visualization of high-dimensional data.
2 Visualization methods
2.1 Low-dimensional visualization methods
Low-dimensional visualization methods are of a topological nature trying to preserve neighborhoods.
Popular methods are:
We present the \(t\)-SNE method in detail here.
2.2 Topological visualization methods
Visualization methods are based on finding instances \((\boldsymbol{Z}_i)_{i=1}^n\) in a lower dimensional space, say, \(\mathbb{R}^2\), such that there is similarity in the distance behavior of the learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n \subset \mathbb{R}^q\) and the lower dimensional sample \((\boldsymbol{Z}_i)_{i=1}^n \subset \mathbb{R}^2\).
This translates to minimization problems of type \[ \underset{(\boldsymbol{Z}_i)_{i=1}^n\subset \mathbb{R}^2}{\arg \min}~ \sum_{1\le i,i' \le n} \bigg( L(\boldsymbol{X}_i,\boldsymbol{X}_{i'}) - \|\boldsymbol{Z}_i-\boldsymbol{Z}_{i'}\|_2\bigg)^2.\]
Thus, the sample \((\boldsymbol{Z}_i)_{i=1}^n\) in the two-dimensional Euclidean space should be such that the original adjacency matrix \[ \left(L(\boldsymbol{X}_i,\boldsymbol{X}_{i'})\right)_{1\le i,i' \le n} ~\in ~\mathbb{R}^{n\times n}_+,\] is preserved as good as possible.
3 \(t\)-SNE method
3.1 \(t\)-SNE method
The \(t\)-SNE method of Maaten and Hinton (2008) considers an embedding that modifies the above idea.
Map the adjacency matrix \((L(\boldsymbol{X}_i,\boldsymbol{X}_{i'}))_{i,i'}\) to a categorical distribution \(\boldsymbol{q}=(q_j)_{j=1}^J\), and try to find \(t\)-distributed instances \((\boldsymbol{Z}_i)_{i=1}^n\) with dissimilarities resulting in a similar categorical distribution \(\boldsymbol{p}=(p_j)_{j=1}^J\).
This results in trying to minimize the KL divergence from \(\boldsymbol{p}\) to \(\boldsymbol{q}\) \[ D_{\rm KL}(\boldsymbol{q}||\boldsymbol{p}) = \sum_{j=1}^J {q_j} \log \left(\frac{q_j}{p_j}\right)~\ge ~0.\]
Remark, the KL divergence is zero if and only if \(\boldsymbol{q}=\boldsymbol{p}\); this can be proved by Jensen’s inequality.
Original sample
For \(\boldsymbol{X}_i, \boldsymbol{X}_{i'} \in {\cal L}\), define the conditional probability weight \[ q_{i'|i} = \frac{ \exp \left\{-\frac{1}{2 \sigma_i^2}\|\boldsymbol{X}_{i'} - \boldsymbol{X}_{i}\|^2_2 \right\}} { \sum_{k\neq i}\exp \left\{-\frac{1}{2 \sigma_i^2}\|\boldsymbol{X}_k - \boldsymbol{X}_i\|^2_2 \right\}}~\in~ (0,1), \qquad \text{ for $i\neq i'$;}\] the choice of the bandwidth \(\sigma_i>0\) is discussed below.
\(q_{i'|i}\) describes the Gaussian probability of selecting \(\boldsymbol{X}_{i'}\) as the neighbor of \(\boldsymbol{X}_i\) from all instances \(i'\neq i\).
We symmetrize \(q_{i'|i}\) by setting (the diagonal is excluded) \[ q_{i,i'} = \left(q_{i'|i} + q_{i|i'}\right)/(2n)~\in~ (0,1), \qquad \text{ for $i\neq i'$.}\]
There is \(\sum_{i=1}^n \sum_{i'\neq i} q_{i,i'}=1\). Henceforth, \(\boldsymbol{q}=(q_{i,i'})_{i\neq i'}\) is a categorical distribution with \(J=n^2-n\) components.
Visualization sample
Select a fixed dimension \(p<q\).
Find a visualization sample \((\boldsymbol{Z}_i)_{i=1}^n \subset \mathbb{R}^p\) that has Student-\(t\) probabilities \(\boldsymbol{p}=(p_{i,i'})_{i\neq i'}\) with one degree of freedom \[ p_{i,i'} = \frac{ \left(1+\|\boldsymbol{Z}_i - \boldsymbol{Z}_{i'}\|^2_2 \right)^{-1}} { \sum_{k\neq l}\left(1+\|\boldsymbol{Z}_k - \boldsymbol{Z}_l\|^2_2 \right)^{-1}}~\in~ (0,1), \qquad \text{ for $i\neq i'$,}\] such that \(\boldsymbol{p}\) is close to \(\boldsymbol{q}=(q_{i,i'})_{i\neq i'}\).
This motivates the following minimization problem \[ (\boldsymbol{Z}^*_i)_{i=1}^n ~\in~ \underset{(\boldsymbol{Z}_i)_{i=1}^n}{\arg \min}~ D_{\rm KL}(\boldsymbol{q}||\boldsymbol{p}) ~=~\underset{(\boldsymbol{Z}_i)_{i=1}^n}{\arg \min}~ \sum_{i\neq i'} {q_{i,i'}} \log \left(\frac{q_{i,i'}}{p_{i,i'}}\right).\]
This is solved with gradient descent.
3.2 Remarks
There is some discrepancy in the definition of \(\boldsymbol{q}\) and \(\boldsymbol{p}\). For the high-dimensional case, the definition of \(\boldsymbol{q}\) via the conditional probabilities leads to more robustness towards outliers. In the low-dimensional case, this is not necessary.
The Student-\(t\) distribution is heavy-tailed (regularly varying), and for one degree of freedom (Cauchy distribution), it has a quadratic asymptotic decay \(p_{i,i'} \approx \|\boldsymbol{Z}_i - \boldsymbol{Z}_{i'}\|^{-2}_2\) for \(\|\boldsymbol{Z}_i - \boldsymbol{Z}_{i'}\|_2 \to \infty\).
There remains the choice of the bandwidth \(\sigma_i>0\). A smaller value for \(\sigma_i>0\) gives a denser clustering.
For \(\boldsymbol{q}_{\bullet|i}=(q_{i'|i})_{i'\neq i}\), one defines the so-called perplexity \[ {\rm Perp}(\boldsymbol{q}_{\bullet|i}) = \exp \left\{H(\boldsymbol{q}_{\bullet|i})\right\}= \exp \left\{ - \sum_{i'\neq i} q_{i'|i} \log_2 (q_{i'|i}) \right\}, \] with \(H(\boldsymbol{q}_{\bullet|i})\) being the Shannon entropy.
According to Maaten and Hinton (2008), a good choice of the bandwidths \(\sigma_i>0\) is received by keeping the perplexity \({\rm Perp}(\boldsymbol{q}_{\bullet|i})\) constant in \(i\).
3.3 Example \(t\)-SNE
We revisit the Belgium Sports Cars example considered in Rentzmann and Wüthrich (2019); this example was originally studied in Ingenbleek and Lemaire (1988).
A Belgium insurance company used this data to classify Sports Cars. This is expressed in the variable \(\tau>0\), see our notebook on the PCA. We are going to color the instances differently according to their \(\tau\) values.
Load data and necessary libraries.
Pre-processing of data as in the previous notebooks.
# pre-process data
dat$x1 <- log(dat$weight/dat$max_power)
dat$x2 <- log(dat$max_power/dat$cubic_capacity)
dat$x3 <- log(dat$max_torque)
dat$x4 <- log(dat$max_engine_speed)
dat$x5 <- log(dat$cubic_capacity)
dat1 <- dat[, c("x1","x2","x3","x4","x5")]
# normalization of design matrix
m0 <- colMeans(dat1)
X01 <- dat1-colMeans(dat1)[col(dat1)]
sds <- sqrt(colMeans(X01^2))*sqrt(nrow(dat1)/(nrow(dat1)-1))
X <- as.matrix(X01/sqrt(colMeans(X01^2))[col(X01)])library(tsne)
# the results depend on the seed (initialization of gradient descent)
set.seed(100)
# gradient descent takes roughly 50 seconds
K_res <- tsne(X, k=2, initial_dim=ncol(X), perplexity=30)
In the above plots we observe the sensitivity of the results in the initialization (seed).
The plots below show the results for different perplexities, a bigger perplexity gives a more uniform distribution.

- Note that \(\boldsymbol{p}\) is invariant under rotations and translations of the visualization sample \((\boldsymbol{Z}_i)_{i=1}^n\).
4 UMAP method
4.1 UMAP method
UMAP of McInnes, Healy and Melville (2018) gives a second method of trying to give a topology preserving representation of the data.
UMAP is based on the assumption that the observed sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n \subset \mathbb{R}^q\) is contained in a lower-dimensional manifold, and we try to learn its local structure.
4.2 Original sample
Select a dissimilarity measure \(L\) and a nearest neighbor number \(k \in \mathbb{N}\).
Find the \(k\) nearest neighbors of \(\boldsymbol{X}_i\) under the selected dissimilarity measure \(L\), and denote them by \(\boldsymbol{X}_{i_1},\ldots, \boldsymbol{X}_{i_k} \in {\cal L}\).
Compute the dissimilarity with the closest neighbor \[ \varrho_i = \min \left\{ L(\boldsymbol{X}_i, \boldsymbol{X}_{i_j});~1\le j \le k, ~L(\boldsymbol{X}_i, \boldsymbol{X}_{i_j})>0\right\},\] and choose the bandwidth \(\sigma_i>0\) such that \[ \sum_{j=1}^k \exp \left\{- \frac{\max\{0, L(\boldsymbol{X}_i,\boldsymbol{X}_{i_j})-\varrho_i\}}{\sigma_i}\right\}= \log_2(k).\]
For the \(k\) nearest neighbors we define the weights \[ w_{i_j|i} = \exp \left\{- \frac{\max\{0, L(\boldsymbol{X}_i,\boldsymbol{X}_{i_j})-\varrho_i\}}{\sigma_i}\right\},\] thus, we restrict to the \(k\) nearest neighbors.
This gives a weighted directed graph \({\cal G}=(V,E,w)\) with vertices \(V={\cal L}=(\boldsymbol{X}_i)_{i=1}^n\), edges \(E=(\boldsymbol{X}_i, \boldsymbol{X}_{i_j})_{1\le i \le n, 1\le j \le k}\) and weights \((w_{i_j|i})_{1\le i \le n, 1\le j \le k}\).
We denote the adjacency matrix obtained from these weights by \(A \in \mathbb{R}_+^{n\times n}\). Thus, matrix \(A\) describes a weighted directed graph with weights giving the likelihoods for an edge from \(\boldsymbol{X}_i\) to \(\boldsymbol{X}_{j}\) being present in the graph \({\cal G}\).
We transform \(A\) into a symmetric undirected version by \[ B = A + A^\top - A \circ A^\top, \] where \(\circ\) is the Hadamard product.
4.3 Visualization sample
Choose dimension \(p<q\).
The goal is to find a low-dimensional representation \((\boldsymbol{Z}_i)_{i=1}^n\subset \mathbb{R}^p\) of the graph described by the matrix \(B\).
The idea behind UMAP is to design a force directed graph that has a similar topology to the original one.
Since this step is technically challenging, we skip the details and refer to Algorithms 4 and 5 in McInnes, Healy and Melville (2018).
Remarks
The nearest neighbor hyperparameter \(k \in \mathbb{N}\) defines the local scale at which the manifold is studied.
The second parameter involved in the following algorithm is \({\tt min\_dist}\). This determines the mutual distances within \((\boldsymbol{Z}_i)_{i=1}^n\).
library(umap)
seed <- 100
set.seed(seed)
kNeighbors <- 15 # default is 15
min_dist <- .1
umap.param <- umap.defaults
umap.param$n_components <- 2
umap.param$n_neighbors <- kNeighbors
umap.param$random_state <- seed
umap.param$min_dist <- min_dist
K_res <- umap(X, config=umap.param, method="naive")
The parameter \({\tt min\_dist}\) determines the mutual distances within \((\boldsymbol{Z}_i)_{i=1}^n\).

A higher nearest neighbor number \(k\) leads to more connectedness.

5 Conclusions
We have presented the \(t\)-SNE method and the UMAP method to give low-dimensional visualizations of high-dimensional data. These methods aim a preserving the local struture.
Generally, these methods are computationally demanding; our examples are based on 475 instances only.
Kohonen maps is yet another method; see Kohonen (1982). Its application is more sophisticated and we refrain from giving further details, but we refer to Kohonen (1982), Kohonen (2001), Kohonen (2013).
Copyright
© The Authors
This notebook and these slides are part of the project “AI Tools for Actuaries”. The lecture notes can be downloaded from:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5162304
\(\,\)
- This material is provided to reusers to distribute, remix, adapt, and build upon the material in any medium or format for noncommercial purposes only, and only so long as attribution and credit is given to the original authors and source, and if you indicate if changes were made. This aligns with the Creative Commons Attribution 4.0 International License CC BY-NC.