AI TOOLS FOR ACTUARIES
Chapter 10: Unsupervised Learning - Part C

Author

Mario V. Wüthrich

Published

November 2, 2025

Abstract
This notebook presents visualization of high-dimensional data.

1 Unsupervised learning

Overview

Unsupervised learning methods aim at understanding the distribution of the covariates without using any responses. There are three main types of unsupervised learning problems:

  • dimension reduction of high-dimensional data,
  • clustering of similar data, and
  • visualization of high-dimensional data.

The previous two notebooks have been focusing on dimension reduction and clustering. This notebook considers low-dimensional visualization of high-dimensional data.

2 Visualization methods

Overview

Low-dimensional visualization of high-dimensional data has similarity with dimension reduction problems, but instead of trying to minimize a reconstruction error, these methods rather aim at preserving the local structure (topology), so that a certain adjacency relation in a neighborhood of a given instance is preserved.

2.1 Low-dimensional visualization methods

  • Low-dimensional visualization methods are of a topological nature trying to preserve neighborhoods.

  • Popular methods are:

    • \(t\)-distributed stochastic neighbor embedding (\(t\)-SNE) of Maaten and Hinton (2008),
    • uniform manifold approximation and projection (UMAP) of McInnes, Healy and Melville (2018), and
    • self-organizing maps (SOM) also called Kohonen maps by Kohonen (1982), Kohonen (2001), Kohonen (2013).
  • We present the \(t\)-SNE method in detail here.

2.2 Topological visualization methods

  • Visualization methods are based on finding instances \((\boldsymbol{Z}_i)_{i=1}^n\) in a lower dimensional space, say, \(\mathbb{R}^2\), such that there is similarity in the distance behavior of the learning sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n \subset \mathbb{R}^q\) and the lower dimensional sample \((\boldsymbol{Z}_i)_{i=1}^n \subset \mathbb{R}^2\).

  • This translates to minimization problems of type \[ \underset{(\boldsymbol{Z}_i)_{i=1}^n\subset \mathbb{R}^2}{\arg \min}~ \sum_{1\le i,i' \le n} \bigg( L(\boldsymbol{X}_i,\boldsymbol{X}_{i'}) - \|\boldsymbol{Z}_i-\boldsymbol{Z}_{i'}\|_2\bigg)^2.\]

  • Thus, the sample \((\boldsymbol{Z}_i)_{i=1}^n\) in the two-dimensional Euclidean space should be such that the original adjacency matrix \[ \left(L(\boldsymbol{X}_i,\boldsymbol{X}_{i'})\right)_{1\le i,i' \le n} ~\in ~\mathbb{R}^{n\times n}_+,\] is preserved as good as possible.

3 \(t\)-SNE method

3.1 \(t\)-SNE method

  • The \(t\)-SNE method of Maaten and Hinton (2008) considers an embedding that modifies the above idea.

  • Map the adjacency matrix \((L(\boldsymbol{X}_i,\boldsymbol{X}_{i'}))_{i,i'}\) to a categorical distribution \(\boldsymbol{q}=(q_j)_{j=1}^J\), and try to find \(t\)-distributed instances \((\boldsymbol{Z}_i)_{i=1}^n\) with dissimilarities resulting in a similar categorical distribution \(\boldsymbol{p}=(p_j)_{j=1}^J\).

  • This results in trying to minimize the KL divergence from \(\boldsymbol{p}\) to \(\boldsymbol{q}\) \[ D_{\rm KL}(\boldsymbol{q}||\boldsymbol{p}) = \sum_{j=1}^J {q_j} \log \left(\frac{q_j}{p_j}\right)~\ge ~0.\]

  • Remark, the KL divergence is zero if and only if \(\boldsymbol{q}=\boldsymbol{p}\); this can be proved by Jensen’s inequality.


Original sample

  • For \(\boldsymbol{X}_i, \boldsymbol{X}_{i'} \in {\cal L}\), define the conditional probability weight \[ q_{i'|i} = \frac{ \exp \left\{-\frac{1}{2 \sigma_i^2}\|\boldsymbol{X}_{i'} - \boldsymbol{X}_{i}\|^2_2 \right\}} { \sum_{k\neq i}\exp \left\{-\frac{1}{2 \sigma_i^2}\|\boldsymbol{X}_k - \boldsymbol{X}_i\|^2_2 \right\}}~\in~ (0,1), \qquad \text{ for $i\neq i'$;}\] the choice of the bandwidth \(\sigma_i>0\) is discussed below.

  • \(q_{i'|i}\) describes the Gaussian probability of selecting \(\boldsymbol{X}_{i'}\) as the neighbor of \(\boldsymbol{X}_i\) from all instances \(i'\neq i\).

  • We symmetrize \(q_{i'|i}\) by setting (the diagonal is excluded) \[ q_{i,i'} = \left(q_{i'|i} + q_{i|i'}\right)/(2n)~\in~ (0,1), \qquad \text{ for $i\neq i'$.}\]

  • There is \(\sum_{i=1}^n \sum_{i'\neq i} q_{i,i'}=1\). Henceforth, \(\boldsymbol{q}=(q_{i,i'})_{i\neq i'}\) is a categorical distribution with \(J=n^2-n\) components.


Visualization sample

  • Select a fixed dimension \(p<q\).

  • Find a visualization sample \((\boldsymbol{Z}_i)_{i=1}^n \subset \mathbb{R}^p\) that has Student-\(t\) probabilities \(\boldsymbol{p}=(p_{i,i'})_{i\neq i'}\) with one degree of freedom \[ p_{i,i'} = \frac{ \left(1+\|\boldsymbol{Z}_i - \boldsymbol{Z}_{i'}\|^2_2 \right)^{-1}} { \sum_{k\neq l}\left(1+\|\boldsymbol{Z}_k - \boldsymbol{Z}_l\|^2_2 \right)^{-1}}~\in~ (0,1), \qquad \text{ for $i\neq i'$,}\] such that \(\boldsymbol{p}\) is close to \(\boldsymbol{q}=(q_{i,i'})_{i\neq i'}\).

  • This motivates the following minimization problem \[ (\boldsymbol{Z}^*_i)_{i=1}^n ~\in~ \underset{(\boldsymbol{Z}_i)_{i=1}^n}{\arg \min}~ D_{\rm KL}(\boldsymbol{q}||\boldsymbol{p}) ~=~\underset{(\boldsymbol{Z}_i)_{i=1}^n}{\arg \min}~ \sum_{i\neq i'} {q_{i,i'}} \log \left(\frac{q_{i,i'}}{p_{i,i'}}\right).\]

  • This is solved with gradient descent.

3.2 Remarks

  • There is some discrepancy in the definition of \(\boldsymbol{q}\) and \(\boldsymbol{p}\). For the high-dimensional case, the definition of \(\boldsymbol{q}\) via the conditional probabilities leads to more robustness towards outliers. In the low-dimensional case, this is not necessary.

  • The Student-\(t\) distribution is heavy-tailed (regularly varying), and for one degree of freedom (Cauchy distribution), it has a quadratic asymptotic decay \(p_{i,i'} \approx \|\boldsymbol{Z}_i - \boldsymbol{Z}_{i'}\|^{-2}_2\) for \(\|\boldsymbol{Z}_i - \boldsymbol{Z}_{i'}\|_2 \to \infty\).

  • There remains the choice of the bandwidth \(\sigma_i>0\). A smaller value for \(\sigma_i>0\) gives a denser clustering.


  • For \(\boldsymbol{q}_{\bullet|i}=(q_{i'|i})_{i'\neq i}\), one defines the so-called perplexity \[ {\rm Perp}(\boldsymbol{q}_{\bullet|i}) = \exp \left\{H(\boldsymbol{q}_{\bullet|i})\right\}= \exp \left\{ - \sum_{i'\neq i} q_{i'|i} \log_2 (q_{i'|i}) \right\}, \] with \(H(\boldsymbol{q}_{\bullet|i})\) being the Shannon entropy.

  • According to Maaten and Hinton (2008), a good choice of the bandwidths \(\sigma_i>0\) is received by keeping the perplexity \({\rm Perp}(\boldsymbol{q}_{\bullet|i})\) constant in \(i\).

3.3 Example \(t\)-SNE

  • We revisit the Belgium Sports Cars example considered in Rentzmann and Wüthrich (2019); this example was originally studied in Ingenbleek and Lemaire (1988).

  • A Belgium insurance company used this data to classify Sports Cars. This is expressed in the variable \(\tau>0\), see our notebook on the PCA. We are going to color the instances differently according to their \(\tau\) values.

Load data and necessary libraries.

library(cluster)
load(file="../../Data/SportsCars.rda")
dat <- SportsCars

Pre-processing of data as in the previous notebooks.

# pre-process data
dat$x1 <- log(dat$weight/dat$max_power)
dat$x2 <- log(dat$max_power/dat$cubic_capacity)
dat$x3 <- log(dat$max_torque)
dat$x4 <- log(dat$max_engine_speed)
dat$x5 <- log(dat$cubic_capacity)
dat1   <- dat[, c("x1","x2","x3","x4","x5")]

# normalization of design matrix
m0  <- colMeans(dat1)
X01 <- dat1-colMeans(dat1)[col(dat1)]
sds <- sqrt(colMeans(X01^2))*sqrt(nrow(dat1)/(nrow(dat1)-1))
X   <- as.matrix(X01/sqrt(colMeans(X01^2))[col(X01)])

library(tsne)
# the results depend on the seed (initialization of gradient descent)
set.seed(100)
# gradient descent takes roughly 50 seconds
K_res <- tsne(X, k=2, initial_dim=ncol(X), perplexity=30)


  • In the above plots we observe the sensitivity of the results in the initialization (seed).

  • The plots below show the results for different perplexities, a bigger perplexity gives a more uniform distribution.

  • Note that \(\boldsymbol{p}\) is invariant under rotations and translations of the visualization sample \((\boldsymbol{Z}_i)_{i=1}^n\).

4 UMAP method

4.1 UMAP method

  • UMAP of McInnes, Healy and Melville (2018) gives a second method of trying to give a topology preserving representation of the data.

  • UMAP is based on the assumption that the observed sample \({\cal L}=(\boldsymbol{X}_i)_{i=1}^n \subset \mathbb{R}^q\) is contained in a lower-dimensional manifold, and we try to learn its local structure.

4.2 Original sample

  • Select a dissimilarity measure \(L\) and a nearest neighbor number \(k \in \mathbb{N}\).

  • Find the \(k\) nearest neighbors of \(\boldsymbol{X}_i\) under the selected dissimilarity measure \(L\), and denote them by \(\boldsymbol{X}_{i_1},\ldots, \boldsymbol{X}_{i_k} \in {\cal L}\).

  • Compute the dissimilarity with the closest neighbor \[ \varrho_i = \min \left\{ L(\boldsymbol{X}_i, \boldsymbol{X}_{i_j});~1\le j \le k, ~L(\boldsymbol{X}_i, \boldsymbol{X}_{i_j})>0\right\},\] and choose the bandwidth \(\sigma_i>0\) such that \[ \sum_{j=1}^k \exp \left\{- \frac{\max\{0, L(\boldsymbol{X}_i,\boldsymbol{X}_{i_j})-\varrho_i\}}{\sigma_i}\right\}= \log_2(k).\]


  • For the \(k\) nearest neighbors we define the weights \[ w_{i_j|i} = \exp \left\{- \frac{\max\{0, L(\boldsymbol{X}_i,\boldsymbol{X}_{i_j})-\varrho_i\}}{\sigma_i}\right\},\] thus, we restrict to the \(k\) nearest neighbors.

  • This gives a weighted directed graph \({\cal G}=(V,E,w)\) with vertices \(V={\cal L}=(\boldsymbol{X}_i)_{i=1}^n\), edges \(E=(\boldsymbol{X}_i, \boldsymbol{X}_{i_j})_{1\le i \le n, 1\le j \le k}\) and weights \((w_{i_j|i})_{1\le i \le n, 1\le j \le k}\).

  • We denote the adjacency matrix obtained from these weights by \(A \in \mathbb{R}_+^{n\times n}\). Thus, matrix \(A\) describes a weighted directed graph with weights giving the likelihoods for an edge from \(\boldsymbol{X}_i\) to \(\boldsymbol{X}_{j}\) being present in the graph \({\cal G}\).

  • We transform \(A\) into a symmetric undirected version by \[ B = A + A^\top - A \circ A^\top, \] where \(\circ\) is the Hadamard product.

4.3 Visualization sample

  • Choose dimension \(p<q\).

  • The goal is to find a low-dimensional representation \((\boldsymbol{Z}_i)_{i=1}^n\subset \mathbb{R}^p\) of the graph described by the matrix \(B\).

  • The idea behind UMAP is to design a force directed graph that has a similar topology to the original one.

  • Since this step is technically challenging, we skip the details and refer to Algorithms 4 and 5 in McInnes, Healy and Melville (2018).

Remarks

  • The nearest neighbor hyperparameter \(k \in \mathbb{N}\) defines the local scale at which the manifold is studied.

  • The second parameter involved in the following algorithm is \({\tt min\_dist}\). This determines the mutual distances within \((\boldsymbol{Z}_i)_{i=1}^n\).


library(umap)

seed <- 100
set.seed(seed)

kNeighbors <- 15     # default is 15
min_dist   <- .1

umap.param              <- umap.defaults
umap.param$n_components <- 2
umap.param$n_neighbors  <- kNeighbors
umap.param$random_state <- seed
umap.param$min_dist     <- min_dist

K_res <- umap(X, config=umap.param, method="naive")

The parameter \({\tt min\_dist}\) determines the mutual distances within \((\boldsymbol{Z}_i)_{i=1}^n\).


A higher nearest neighbor number \(k\) leads to more connectedness.


5 Conclusions

  • We have presented the \(t\)-SNE method and the UMAP method to give low-dimensional visualizations of high-dimensional data. These methods aim a preserving the local struture.

  • Generally, these methods are computationally demanding; our examples are based on 475 instances only.

  • Kohonen maps is yet another method; see Kohonen (1982). Its application is more sophisticated and we refrain from giving further details, but we refer to Kohonen (1982), Kohonen (2001), Kohonen (2013).

References

Ingenbleek, J.-F. and Lemaire, J. (1988) “What is a sports car?” ASTIN Bulletin - The Journal of the IAA, 18/2, pp. 175–187. Available at: https://www.cambridge.org/core/journals/astin-bulletin-journal-of-the-iaa/article/what-is-a-sports-car/E3BF2632C2517D49019C043999957B1D.
Kohonen, T. (1982) “Self-organized formation of topologically correct feature maps,” Biological Cybernetics, 43, pp. 59–69. Available at: https://link.springer.com/article/10.1007/BF00337288.
Kohonen, T. (2001) Self-organizing maps. 3rd ed. Springer. Available at: https://doi.org/10.1007/978-3-642-56927-2.
Kohonen, T. (2013) “Essentials of the self-organizing map,” Neural Networks, 37, pp. 52–65. Available at: https://dl.acm.org/doi/10.1016/j.neunet.2012.09.018.
Maaten, L.J.P. van der and Hinton, G.E. (2008) “Visualizing data using \(t\)-SNE,” Journal of Machine Learning Research, 9, pp. 2579–2605. Available at: https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf.
McInnes, L., Healy, J. and Melville, J. (2018) “UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv, 1802.03426v2. Available at: https://arxiv.org/abs/1802.03426v2.
Rentzmann, S. and Wüthrich, M.V. (2019) “Unsupervised learning: What is a sports car?” Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3439358.