Ch10 High-Dimensional Data

In this chapter, we will introduce the visualization techniques for high-dimensional data. Most of the visualization types are suitable for low-dimensional data with a few variables. High-dimensional data is usually harder to visualize since the association among many variables may be complex and hard to describe with traditional visualization tools which usually appear on paper or screen. One alternative is to conduct dimension reduction before visualization, so that we can visualize the low-dimensional representation of the high-dimensional data. Another commonly used technique is clustering analysis, which partition data in different groups according to their similarities, making the visualization easier. In this chapter, we will discuss these two important techniques, dimension reduction and clustering.

Here is a list of R packages used in this section.

library(tidyverse)
library(plotly)
library(devtools)
library(ggbiplot)#devtools::install_github("vqv/ggbiplot")
library(gridExtra)
library(ggrepel)
library(Rtsne)
library(GGally)
library(clue)
library(factoextra)
library(mclust)

1 Dimension Reduction

We start with the dimension reduction techniques. The high-dimensional data usually comes in two different forms. First, it can be represented by a \(n\) by \(p\) matrix where each row is one observation and each column is one variable. In this case, we can often apply principal component analysis and etc. Second, it can also be a \(n\) by \(n\) similarity/dissimilarity matrix of \(n\) units, where the element on the \(i\)-th row and \(j\)-th column represents the similarity or dissimilarity measure between units \(i\) and \(j\). In this case, we can apply multidimensional scaling and etc. Note that even if data comes in the first form, we can convert it to the second form by calculating its similarity/dissimilarity matrices.

1.1 Principal Component Analysis

Principal component analysis (PCA) approximates a high-dimensional data set, a \(n\) by \(p\) matrix, with a low-dimensional data set, a \(n\) by \(d\) matrix, so that the low-dimensional data preserves the variation of the high-dimensional data as much as possible. In particular, each column of the low-dimensional data set is called a principal component, which is a linear combination the columns/variables in the high-dimensional data set. In particular, suppose the high-dimensional data sets contains \(p\) columns/variables \(X_1, ..., X_p\). And the low-dimensional data set contains \(d\) principal components \(PC_1, ..., PC_d\). Then these principal components can be represented as \[ PC_1 = \beta_{11} X_1 + ... + \beta_{1p} X_p,\\ ...\\ PC_d = \beta_{d1} X_1 + ... + \beta_{dp} X_p. \] where these coefficients \(\beta_{kl}\) are called the loading. The variance of the these principal components are in descending order, i.e., \(Var(PC_1) \geq ... \geq Var(PC_d)\), meaning the first principal component represents the most variance in the original data, the second principal component represents the second most variance. Note that these coefficients are designed to be orthogonal to each other across different principal components to ensure different principal components represent different aspects of the original data. Also note that in theory, we can calculate \(p\) principal components, but in that case, the dimension is not reduced. To select an appropriate \(d\), we often look at the value \(Var(PC_d)\) as a function \(d\) and find the elbow point.

Let us see an example. Suppose we are given the mtcars data set with only three variables, miles per gallon (mpg), displacement (disp), and 1/4 mile time (qsec). Here we use only three variables for the ease of illustration. Real data sets usually come with many more variables. Since it has only three variables, we can visualize it with a 3D scatterplot.

library(tidyverse)
library(plotly)
X = select(mtcars, mpg, disp, qsec)
p1 <- plot_ly(X, x = ~qsec, y = ~mpg, z = ~disp, text = rownames(mtcars), marker = list(size = 5, color = 'rgba(200, 128, 128, 1)'))
p1

The 3D plot seems OK, especially when it is interactive. However, as we can mentioned previously, 3D visualization should generally be avoided because most visualizations are meant to be printed either on paper or screen in which case the relative positions among these data points are hard to see. Besides, what if we have more than three variables, say four? There is no 3D plot suitable for 4-dimensional data. Hence, we need to visualize this data set in lower dimensions, say two dimensions.

To visualize the data in a two dimensional space, we need to compress the data. A simple approach is push all the data points into a two-dimensional hyperplane, i.e., projection. But which hyperplane? Ideally, we would like to have these projected data points to best represents the original data. Intuitively, the hyperplane is decided according the the directions in which the original data points spread the most. We would like to have projected data points to have as much variance as the original data points have.

In the figure, the directions in which the data points’ variation is the largest, second and third largest are shown in orange, green, and yellow. To preserve the variance of the data and push the data to two dimension space, it makes sense to project the data points to the hyperplane determined by the first and second directions. The projected data points are shown in blue. Note that these projected data points still live in three dimensional space with three coordinates (x,y,z). When we look at only the hyperplane, then each blue point can be indexed by two coordinate (PC1, PC2), which are what principal components represent. In other words, the principal components can understood as the coordinates of these projects data points in the subspace hyperplane.

Mathematically, the principal components are calculated using the singular value decomposition (SVD). For an arbitrary \(n\) by \(p\) matrix, it can be decomposed to the product of three matrices. The last matrix essentially presents the \(d\) orthogonal directions in which the variation of the data is the descending. The middle matrix is diagonal, that is only elements in the diagonal positions are non zero and the rest elements are all zero. These diagonal elements represents the variance in these directions. The first matrix represents the standardized coordinates of the projected data points in the sense that the variance in all directions are one. Therefore, the product of the first and second matrices are essentially the principal components. Here we pick the first two principal components. Note that if we use all three principal components, then we essentially just rotate the data without reducing the dimension.

Now let us conduct PCA on the 3-dimensional mtcars data set to obtain the 2-dimensional approximation. PCA is provided in the prcomp() function in R. It is often recommended to standardize the data before applying PCA.

Xscaled = as_tibble(scale(X))
PCA <- prcomp(Xscaled, center = TRUE, scale. = TRUE)
summary(PCA)

## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     1.4699 0.8289 0.39022
## Proportion of Variance 0.7202 0.2291 0.05076
## Cumulative Proportion  0.7202 0.9492 1.00000

As we can see the first principal component has a standard deviation of 1.46 or variance of 1.46^2, which explains 72% of the total variance of the original data. If we consider the first two principal components, they account for 94% of the total variance, which is sufficient for representing the data. We can also take a look at these principal components, which are very different from the original data in value.

head(PCA$x)

##              PC1          PC2        PC3
## [1,] -0.09528567 -0.919471288  0.3128131
## [2,] -0.23972494 -0.641407526  0.3075617
## [3,] -1.10073594 -0.084455508  0.3811553
## [4,] -0.40789088  0.785600653 -0.3241362
## [5,]  1.01450238 -0.006513126 -0.5719196
## [6,] -0.43396725  1.274706037  0.2427178

We can plot these principal components in a scatter plot. Comparing the blue points in the scatter plot with the blue points in 3D plot, we can see the identical pattern, confirming our understanding that the principal components are just the projection of the original data set. We can also plot the directions of the three original variables in the principal components space.

p3 <- ggplot(cbind(as.data.frame(PCA$x),name = rownames(mtcars)))+
  geom_hline(yintercept = 0, color="#ff9900")+
  geom_vline(xintercept = 0,color="#ffcc00")+
  geom_point(aes(x=PC1, y=PC2),size=2,color="blue")+
  geom_text_repel(aes(x=PC1, y=PC2, label=name), size=2)
library(ggbiplot)
p4=ggbiplot(PCA)
grid.arrange(p3, p4, ncol=2)

Now let us look at a more realistic example using the entire mtcars data set (excluding the discrete variables). When conducting PCA, the first decision is \(d\), i.e., the dimension to reduce to. The selection of \(d\) is usually based on the variance of the principal components. The rule of thumb is to select the “elbow” point as the dimension to reduce to. So in this case, two or three dimensions are appropriate choices.

mtcars_data = mtcars[,c(1:7,10,11)]
p=dim(mtcars_data)[2]
n=dim(mtcars_data)[1]
mtcars_center = as_tibble(scale(mtcars_data))
mtcars_pca <- prcomp(mtcars_center, center = TRUE,scale. = TRUE)
summary(mtcars_pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion  0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
##                           PC8     PC9
## Standard deviation     0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion  0.9975 1.00000

# check: sum(abs(mtcars_pca$x - as.matrix(mtcars_center)%*%as.matrix(mtcars_pca$rotation)))

library(ggbiplot)
g1=ggbiplot(mtcars_pca)
g2=ggplot(cbind(as.data.frame(mtcars_pca$x),name = rownames(mtcars)))+
  geom_point(aes(x=PC1, y=PC2),size=2)+
  geom_text_repel(aes(x=PC1, y=PC2, label=name), size=2)
g3=ggplot(data.frame(dim = 1:p, var_percentage=mtcars_pca$sdev^2/sum(mtcars_pca$sdev^2)))+
  geom_point(aes(x=dim, y=var_percentage))
grid.arrange(g1, g2, g3, ncol=3)

1.2 Multidimensional Scaling

When the high dimensional data is a \(n\) by \(n\) similar matrix, where the element on the \(i\)-th row and \(j\)-th column represents the distance between units \(i\) and \(j\), then multidimensional scaling (MDS) is usually a suitable method for dimension reduction. MDS translates these pairwise distances among \(n\) observations into a coordinates of the units in a low-dimensional space, which is often called embedding. And the Euclidean distances among points in the embedding space maximally preserve the distance in the original similarity matrix.

\[ \arg\min_{y_{1},...,y_{n}}{\sum _{i < j}(d_{ij}-\|y_{i}-y_{j}\|)^{2}}. \]

We use the following example to demonstrate the usage of MDS. We have a similar matrix showing the number of times 15 congressmen from New Jersey voted differently in the House of Representatives on 19 environmental bills. Abstentions are not recorded. The data is in the R package HSAUR for Hothorn and Everitt (2009). We also show the party affiliation of these congressmen. We first show the similar matrix as follows.

voting=read.csv("./data/voting_NJ.csv",row.names = 1)
voting

##                  Hunt_R Sandman_R Howard_D Thompson_D Freylinghuysen_R
## Hunt_R                0         8       15         15               10
## Sandman_R             8         0       17         12               13
## Howard_D             15        17        0          9               16
## Thompson_D           15        12        9          0               14
## Freylinghuysen_R     10        13       16         14                0
## Forsythe_R            9        13       12         12                8
## Widnall_R             7        12       15         13                9
## Roe_D                15        16        5         10               13
## Heltoski_D           16        17        5          8               14
## Rodino_D             14        15        6          8               12
## Minish_D             15        16        5          8               12
## Rinaldo_R            16        17        4          6               12
## Maraziti_R            7        13       11         15               10
## Daniels_D            11        12       10         10               11
## Patten_D             13        16        7          7               11
##                  Forsythe_R Widnall_R Roe_D Heltoski_D Rodino_D Minish_D
## Hunt_R                    9         7    15         16       14       15
## Sandman_R                13        12    16         17       15       16
## Howard_D                 12        15     5          5        6        5
## Thompson_D               12        13    10          8        8        8
## Freylinghuysen_R          8         9    13         14       12       12
## Forsythe_R                0         7    12         11       10        9
## Widnall_R                 7         0    17         16       15       14
## Roe_D                    12        17     0          4        5        5
## Heltoski_D               11        16     4          0        3        2
## Rodino_D                 10        15     5          3        0        1
## Minish_D                  9        14     5          2        1        0
## Rinaldo_R                10        15     3          1        2        1
## Maraziti_R                6        10    12         13       11       12
## Daniels_D                 6        11     7          7        4        5
## Patten_D                 10        13     6          5        6        5
##                  Rinaldo_R Maraziti_R Daniels_D Patten_D
## Hunt_R                  16          7        11       13
## Sandman_R               17         13        12       16
## Howard_D                 4         11        10        7
## Thompson_D               6         15        10        7
## Freylinghuysen_R        12         10        11       11
## Forsythe_R              10          6         6       10
## Widnall_R               15         10        11       13
## Roe_D                    3         12         7        6
## Heltoski_D               1         13         7        5
## Rodino_D                 2         11         4        6
## Minish_D                 1         12         5        5
## Rinaldo_R                0         12         6        4
## Maraziti_R              12          0         9       13
## Daniels_D                6          9         0        9
## Patten_D                 4         13         9        0

heatmap(as.matrix(voting), Rowv=NA, Colv=NA)

As we can see, these 15 congressmen are quite different in their voting patterns, but the heatmap is still hard to digest and we do not have holistic view of how these 15 congressmen are different. To solve that, we apply MDS on this similar matrix.

voting_mds <- as.data.frame(cmdscale(voting))
ggplot(voting_mds, aes(x=V1, y=V2, label=rownames(voting_mds))) + 
  geom_point()+
  geom_text_repel()

Clearly, the democrats vote similarly. Interestingly, the republicans vote differently from the democrats and also from themselves, since their coordinates in the MDS are far apart from each other. One republican actually votes similarly to the democrats, i.e., Rinaldo_R. These patterns cannot be identified from the heatmap of the similarity matrix.

Lastly, PCA and classical MDS are equivalent if the similarity matrix for MDS is calculated using Euclidean distance. This is shown below using the mtcars data set.

# Classical MDS
# n rows (objects) x p columns (variables)
# each row identified by a unique row name
mtcars_dist <- dist(mtcars_center) # euclidean distances between the rows
mtcars_mds <- cmdscale(mtcars_dist, k=3) # k is the number of dim
mtcars_mds <- cbind(as.data.frame(mtcars_mds),
                    name=rownames(mtcars))
ggplot(mtcars_mds)+
  geom_point(aes(x=V1, y=V2))+
  geom_text_repel(aes(x=V1, y=V2, label=name))

1.3 Isomap

Isomap (Tenenbaum, Silva, and Langford (2000)) is another dimension reduction method similar to MDS. It is a common technique in manifold learning. It computes a low-dimensional representation of a set of high-dimensional data points. The key difference between Isomap and MDS is that Isomap uses geodesic distance, which is often approximated by the distances between two observations induced by a neighborhood graph. The key difference is the distance is induced by neighbors instead of Euclidean distance.

Swiss roll example

d = read_csv("data/swiss.csv")
swiss = d[,1:3]
swiss_label = d$y
library(plotly)
p1 <- plot_ly(d, x = ~x1, y = ~x2, z = ~x3, color = ~y)
p1

If we apply MDS i.e., PCA with Euclidean distance, the results are not very surprising.

swiss_dist <- dist(swiss) # euclidean distances between the rows
swiss_mds  <- cmdscale(swiss_dist, k=2) # k is the number of dim
swiss_mds  <- as.data.frame(swiss_mds)
ggplot(swiss_mds) +
  geom_point(aes(x=V1, y=V2))

swiss_mds_lab = data.frame(swiss_mds, label = swiss_label)
ggplot(swiss_mds_lab) +
  geom_point(aes(x = V1, y = V2, color = label)) +
  scale_color_viridis_c()

On the other hand, we can apply isomap.

isomap <- function(data_csv) {
    dist_vec <- dist(data_csv, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
    #size = length(data_csv[,1])
    size = dim(data_csv)[1]
    matrix <- matrix(Inf, nrow = size, ncol = size)
    for (i in 1:(size - 1)) {
        for (j in (i + 1):size) {
            matrix[i, j] = dist_vec[size*(i - 1) - i*(i - 1)/2 + j - i]
            matrix[j, i] = matrix[i, j]
        }  
    }
    
    knn <- 5
    nn = matrix(0, size, knn)
    mtx <- matrix
    
    for (i in 1:size) {
        matrix[i, i] <- Inf
    }
    
    for(i in 1:size){
        nn[i,] <- order(mtx[i,])[1:knn]
    }
    
    for (i in 1:length(nn[,1])) {
        matrix[i, -nn[i,]] <- Inf
    }
    
    for (i in 1:size) {
        matrix[i, i] <- 0
    }
    
    # All-Pairs Shortest Path Algorithms
    for (k in 1:size) {
        for (i in 1:size) {
            for (j in 1:size) {
                matrix[i, j] <- min(matrix[i, j], matrix[i, k] + matrix[k, j])
                matrix[j, i] <- matrix[i, j]
            }
        }
    }

    #maxind <- which(matrix == max(matrix), arr.ind = T)
    #matrixmax <- matrix[maxind[1], maxind[2]]
    #print(matrixmax)
    # for (i in 1:size) {
    #     for (j in 1:size) {
    #         if (matrix[i,j] == Inf) {
    #             matrix[i,j] <- matrixmax * 2
    #         }
    #     }
    # }

    adj_mat = matrix
    num_inf = sum(is.infinite(adj_mat))
    adj_mat[is.infinite(adj_mat)] = max(adj_mat[is.finite(adj_mat)]) * runif(num_inf)
    fit <- cmdscale(adj_mat, eig = FALSE, k = 2)
}
swiss_isomap <- isomap(swiss)
swiss_isomap_lab = data.frame(swiss_isomap, label = swiss_label)
ggplot(swiss_isomap_lab) +
  geom_point(aes(x = X1, y = X2, color = label)) +
  scale_color_viridis_c() +
  ylim(-5,5)

1.4 t-Distributed Stochastic Neighbor Embedding

The t-SNE also aims to learn the \(d\)-dimensional embeddings \(z_{1}, \dots , z_{n} \in \mathbb{R}^d\) of the original observation \(x_{1}, \dots , x_{n} \in \mathbb{R}^p\), so that the similarity matrix among the embeddings mimics the similarity matrix among the original observations as closely as possible. This is very similar to the concept of MDS. In the case of MDS, the similarity matrix consists of the pairwise Euclidean distance and the closeness between two similarity matrices are measure by matrix norm The key part in t-SNE is the way similarities are defined and the distance between two set of similarities. In particular, the similarities are defined in the following way.

First, the similarities among the original observations are defined as

\[p_{ij} = (p_{i|j} + p_{j|i})/(2n),\\ p_{j\mid i}= \frac {\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{j}\rVert ^{2}/2\sigma _{i}^{2})} {\sum _{k\neq i}\exp(-\lVert \mathbf {x} _{i}-\mathbf {x} _{k}\rVert ^{2}/2\sigma _{i}^{2})}.\]

Then, the similarities among the embeddings are defined as

\[q_{ij} = \frac {(1+\lVert z_i-z_j \rVert^{2})^{-1}} {\sum _{k} \sum _{l\neq k} (1+\lVert z_{k}-z_{l}\rVert ^{2})^{-1}}\]

Note that \(p_{ji}=p_{ji}\) and \(q_{ji}=q_{ji}\), \(p_{ii}=q_{ii}=0\), and \(\sum_{i,j}p_{ij} = \sum_{i,j}q_{ij}=1\).

Finally, the embeddings are determined by minimizing the (non-symmetric) Kullback–Leibler divergence of the distribution of \(p_{ij}\) from that of \(q_{ij}\),

\[\text{KL} ( P | Q )= \sum_{i \neq j} p_{ij} \log {\frac {p_{ij}}{q_{ij}}}\]

The minimization of the Kullback–Leibler divergence with respect to the points \(z_{i}\) is performed using gradient descent. The result of this optimization is a map that reflects the similarities between the high-dimensional inputs.

library(Rtsne)
set.seed(123)
mtcars_tSNE <- Rtsne(mtcars_center, perplexity=10)
head(mtcars_tSNE$Y)

##            [,1]       [,2]
## [1,]  10.113945   1.682154
## [2,]  12.933098  -1.628959
## [3,]  21.068805 -15.501378
## [4,]  -4.915016  -1.429881
## [5,] -17.798873   8.442781
## [6,]  -7.848545  -4.316165

embedding = cbind(as.data.frame(mtcars_tSNE$Y),name=rownames(mtcars))
ggplot(embedding)+
  geom_point(aes(x=V1, y=V2))+
  geom_text_repel(aes(x=V1, y=V2, label=name))

We now visualize a much larger data set — MINST. The

library(tidyverse)
library(Rtsne)
d = readRDS("data/mnist.rds")
X = d[,-1]
Y = d[, 1]
par(mfrow=c(2,5))
for (i in 1:10){
  image(matrix(as.numeric(X[i,]), nrow=28)[,28:1])
}

We conduct tSNE on this data set and visualize the data in 2-dimensional space. As we can see, the observations that corresponding to the same digit are close to each other.

mnist_tsne <- Rtsne(X, perplexity=5)
tsne = data.frame(mnist_tsne$Y, label = Y)
ggplot(tsne) +
  geom_point(aes(x = X1, y = X2, color = as.factor(label))) +
  scale_color_viridis_d()

As a comparison, we reduce the dimension using PCA. The scree plot shows that the best dimension is around 50. Due to the limit of space, we plot the first four principal components below. As we can see, the PCA does not separate the observations as well as tSNE. In fact,

Xclean = X[, apply(X, 2, sd) != 0]
Xscaled = as_tibble(scale(Xclean))
PCA <- prcomp(Xscaled, center = TRUE, scale. = TRUE)
plot(PCA$sdev^2, pch=20)

We plot the first a few principal components in a pairwise scatterplot.

PCA_df = data.frame(PCA$x, Y)
library(GGally)
ggpairs(PCA_df,          # Data frame
        columns = 1:5,
        aes(color = as.factor(Y), alpha = 0.1))

Swiss roll example

swiss_tsne <- Rtsne(swiss, perplexity=5)
swiss_tsne_lab = data.frame(swiss_tsne$Y, label = swiss_label)
ggplot(swiss_tsne_lab) +
  geom_point(aes(x = X1, y = X2, color = label)) +
  scale_color_viridis_c()

1.5 Autoencoder

Autoencoder uses a neural network to learn a low-dimensional representation of a set of high-dimensional data points. It consists a encoder and a decoder. The encoder reduces the dimension of the original data set, while the decoder uses the output from encoder to recover the original data set.

1.6 Radviz

library(ggplot2)
library(Radviz)
library(dplyr)
library(tidyr)
library(GGally)
library(bodenmiller)
data(refPhenoMat) 
head(refPhenoMat) 
data(refAnnots)
ref.df <- data.frame(refAnnots,
                     refPhenoMat)
data(refPhenoMat)
data(refFuncMat)
data(refAnnots)
ref.df <- data.frame(refAnnots,
                     refPhenoMat,
                     refFuncMat)


trans <- function(coln) 
do.L(coln,fun=function(x) quantile(x,c(0.005,0.995)))

hist(ref.df$CD3)
abline(v=quantile(ref.df$CD3,c(0.005,0.995)),col=2,lty=2)

ct.S <- make.S(dimnames(refPhenoMat)[[2]])
##取一个列名称的向量，并返回一个由[x,y]位置组成的矩阵
ct.sim <- cosine(as.matrix(ref.df[,row.names(ct.S)]))
in.da(ct.S,ct.sim)
rv.da(ct.S,ct.sim)
optim.ct <- do.optimRadviz(ct.S,ct.sim,iter=100,n=1000)
ct.S <- make.S(get.optim(optim.ct)) 
##获取维度排列的最优顺序
ct.S <- recenter(ct.S,'CD3') 
##把CD3的位置维持在中间

ct.rv <- do.radviz(ref.df,ct.S,trans=trans) 
summary(ct.rv)
##投影
ct.rv ##输出图
plot(ct.rv,anchors.only=FALSE) #绘制radviz图
plot(ct.rv) + geom_point()
plot(ct.rv) +
  geom_point(data=. %>% 
               arrange(CD4), 
##按照CD4排序
             aes(color=CD4))+
  scale_color_gradient(low='grey80',high="dodgerblue4") ##调整颜色，可以加其他图层
smoothRadviz(ct.rv)
smoothRadviz(ct.rv)+
  geom_point(shape='.',alpha=1/5)
contour(ct.rv) ##轮廓图
hexplot(ct.rv) ##六角形装仓图

cur.pop <- 'igm+'
sub.rv <- subset(ct.rv,refAnnots$Cells==cur.pop)
smoothRadviz(ct.rv)+
  geom_density2d(data=sub.rv$proj$data,
                 aes(x=rx,y=ry),
                 color='black')
hexplot(ct.rv)
hexplot(ct.rv,color='CD4')
hexplot(ct.rv,color='pS6')
hexplot(ct.rv,color='pAkt')

2 Clustering

When the high-dimensional data is reduced to a lower dimension, it is often useful to conduct clustering analysis to show the similarity and dissimilarity among data points. Alternatively, we could also conduct the clustering on the high-dimensional data. Either way, clustering is useful in identifying the patterns in the data.

2.1 K-means Clustering

K-means clustering is an algorithm that partitions \(n\) observations into \(k\) clusters. Given an initial \(k\) centroids (centers of the clusters), for each iteration, it first assigns each observation to the nearest centriod, and then recalculate the centroid of each cluster using newly assigned observations by taking the mean. The algorithm stops when the assignment of observations converge.

We use k-means clustering to analyze a real data set of different mall customers with measurements of their annual incomes and spending scores (1-100)¹. We would like to segment these customers into different groups so that the marketing strategies can be differentiated. Spending score is determined based on customer behavior and purchasing data.

customer = read.csv("./data/Mall_Customers.csv",check.names = FALSE)
head(customer)

##   CustomerID Gender Age Annual_Income Spending_Score
## 1          1   Male  19            15             39
## 2          2   Male  21            15             81
## 3          3 Female  20            16              6
## 4          4 Female  23            16             77
## 5          5 Female  31            17             40
## 6          6 Female  22            17             76

customer = as.data.frame(apply(customer[,4:5], 2, scale))
customer_clust = customer

We can apply k-means on the MDS embedding of the mtcars.

mtcars_data = mtcars[,c(1:7,10,11)]
mtcars_center = as_tibble(scale(mtcars_data))
mtcars_dist <- dist(mtcars_center)
data = as.data.frame(cmdscale(mtcars_dist, k=2)) # k is the number of dim
ngrid <- 500
pred.mat <- expand.grid(
  `V1` = with(data, seq(-4,4, length.out = ngrid)),
  `V2` = with(data, seq(-4,4, length.out = ngrid))
)
km <- kmeans(data, centers=4, algorithm = "Forgy", iter.max = 1)
updated_centroids = aggregate(data, by=list(cluster=factor(km$cluster)), mean)
data_clust <- data.frame(data, cluster = factor(km$cluster))
pred.mat$cluster <- factor(cl_predict(km, pred.mat))
p1 <- ggplot(main= "iteration = 1") +
        geom_raster(data = pred.mat, aes(x = V1, y = V2, fill = cluster), alpha = 0.3) +
        geom_point(data = data_clust, aes(x = V1, y = V2, color = cluster, shape=cluster) ) +
        geom_point(data = updated_centroids, aes(x = V1, y = V2, color = cluster, shape=cluster), size=5)  +
        theme_bw() + coord_cartesian(expand = FALSE)
km <- kmeans(data, centers=updated_centroids[,-1], algorithm = "Forgy", iter.max = 2)
updated_centroids = aggregate(data, by=list(cluster=factor(km$cluster)), mean)
data_clust <- data.frame(data, cluster = factor(km$cluster))
pred.mat$cluster <- factor(cl_predict(km, pred.mat))
p2 <- ggplot(main= "iteration = 1") +
        geom_raster(data = pred.mat, aes(x = V1, y = V2, fill = cluster), alpha = 0.3) +
        geom_point(data = data_clust, aes(x = V1, y = V2, color = cluster, shape=cluster) ) +
        geom_point(data = updated_centroids, aes(x = V1, y = V2, color = cluster, shape=cluster), size=5)  +
        theme_bw() + coord_cartesian(expand = FALSE)
km <- kmeans(data, centers=updated_centroids[,-1], algorithm = "Forgy", iter.max = 2)
updated_centroids = aggregate(data, by=list(cluster=factor(km$cluster)), mean)
data_clust <- data.frame(data, cluster = factor(km$cluster))
pred.mat$cluster <- factor(cl_predict(km, pred.mat))
p3 <- ggplot(main= "iteration = 1") +
        geom_raster(data = pred.mat, aes(x = V1, y = V2, fill = cluster), alpha = 0.3) +
        geom_point(data = data_clust, aes(x = V1, y = V2, color = cluster, shape=cluster) ) +
        geom_point(data = updated_centroids, aes(x = V1, y = V2, color = cluster, shape=cluster), size=5)  +
        theme_bw() + coord_cartesian(expand = FALSE)
km <- kmeans(data, centers=updated_centroids[,-1], algorithm = "Forgy", iter.max = 2)
updated_centroids = aggregate(data, by=list(cluster=factor(km$cluster)), mean)
data_clust <- data.frame(data, cluster = factor(km$cluster))
pred.mat$cluster <- factor(cl_predict(km, pred.mat))
p4 <- ggplot(main= "iteration = 1") +
        geom_raster(data = pred.mat, aes(x = V1, y = V2, fill = cluster), alpha = 0.3) +
        geom_point(data = data_clust, aes(x = V1, y = V2, color = cluster, shape=cluster) ) +
        geom_point(data = updated_centroids, aes(x = V1, y = V2, color = cluster, shape=cluster), size=5)  +
        theme_bw() + coord_cartesian(expand = FALSE)
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol=2)

2.2 Model Based Clustering

As a generalization of k-means, model based clustering assumes that the observations from each cluster are generated by a distinct distribution. Model based clustering first estimates these distributions simultaneously and used the estimated distributions to segment the data points. Among all the distributions, Gaussian distribution are most frequently assumed, hence the model based clustering is conducted by fitting a Gaussian mixture model (GMM) to the data. Note that when these Gaussian distributions are assumed to have the identical spherical covariance matrix, i.e., diagonal and all elements on the diagonal are equal \(\Sigma = \lambda I\), then model based clustering reduces to k-means.

Accounting for distribution variance (can handle oblong clusters) v.s. k-means circular cluster
Soft classification

Example: Gaussian Mixture model (GMM): Assume data are generated from multiple Gaussian distributions, our goal is to assign the data points to the appropriate gaussian distribution center.

Using the same data set, we fit a GMM to the data and

set.seed(111)
library(factoextra)
library(mclust)
data = read.csv("./data/Mall_Customers.csv",check.names = FALSE)
head(data)

##   CustomerID Gender Age Annual_Income Spending_Score
## 1          1   Male  19            15             39
## 2          2   Male  21            15             81
## 3          3 Female  20            16              6
## 4          4 Female  23            16             77
## 5          5 Female  31            17             40
## 6          6 Female  22            17             76

data[,3:5] = apply(data[,3:5],2,scale)
fit_gmm = Mclust(data[,4:5], G=5)
data$cluster_gmm <- fit_gmm$classification
p5 <- fviz_cluster(fit_gmm, data[,c(4,5)],
             geom = "point",
             ellipse.type = "norm",
             main = "gmm cluster",
             ggtheme = theme_bw(),
             ylim = c(-2,3)
             )
p5

2.3 Hierarchical Clustering

Treating each observation as a separate cluster, for each iteration,

Identify the two clusters that are closest together.
Merge the two most similar clusters.

This algorithm stops until all the clusters are merged into 1.

set.seed(111)
hc_result <- hclust(dist(data[,c(4,5)]))
plot(hc_result,xlab="data")
#Cut Dendrogram into 3 Clusters
rect.hclust(hc_result, k=3)

Hothorn, Torsten, and Brian S. Everitt. 2009. A Handbook of Statistical Analyses Using r, Second Edition. 2nd ed. CRC Press. https://www.amazon.com/Handbook-Statistical-Analyses-Using-Second/dp/1420079336.

Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. 2000. “A Global Geometric Framework for Nonlinear Dimensionality Reduction.” Science 290 (5500): 2319–23. https://doi.org/10.1126/science.290.5500.2319.

https://github.com/SteffiPeTaffy/machineLearningAZ/blob/master/Machine%20Learning%20A-Z%20Template%20Folder/Part%204%20-%20Clustering/Section%2025%20-%20Hierarchical%20Clustering/Mall_Customers.csv ↩︎

Ch10 High-Dimensional Data Visualization

Descriptive Analytics and Data Visualization

Yichen Qin (qinyn@ucmail.uc.edu), University of Cincinnati

2024-12-20