Model Based Clustering

Prof. Dr. Tim Weber

Deggendorf Institute of Technology

model based clustering

traditional clustering methods …

  • … are not based on formal models
  • … require the user to specify the number of clusters

model based clustering …

  • … assumes data coming from a mixture of two or more clusters
  • … uses a soft assignment (each point has a certain probability of belonging to a cluster)

Concept of model based clustering

Each component (cluster) \(k\) is modeled by the normal distribution (Kassambara 2017)

\(\mu\)

mean vector

\(\sum_k\)

covariance matrix

… an associated probability in the mixture. Each point has a probability of belonging to each cluster

introductory example

  • 3 cluster?
  • 3 ellipses similar in terms of volume, shape and orientation
    • homogenous covariance matrix?

Estimating model parameters

  • (E)xpectation-(M)aximization as initialized by hierarchical clustering
  • geometric features (shape, volume, orientation) are determined by the covariance matrix
  • different parametrizations of \(\sum_k\)
  • available model options: EII, VII, EEI, VEI, EVI, VVI, EEE, EEV, VEV, VVV

available models:

  • 1st identifier refers to volume, 2nd to shape, 3rd to orientation
  • E for equal, V for variable, I for coordinate axes
    • EVI denotes a model with Equal volume, Variable shape and the orientation is the Identity
    • EEE means that the clusters have Equal volume, Equal shape and Equal orientation

choosing the best model

  • use mle to fit all models for a range of \(k\) components
  • model selection based on the (B)ayesian (I)nformation (C)riterion (BIC)
  • a greater BIC score is considered better in mbc

Important

The BIC shall only be used to compare model within one method (as in model based cluster vs. model based cluster) not across different modeling methods (as in linear vs. logisitc regression)

Bayersion Information Criterion

\[\begin{align} BIC = k \ln(n)-2ln(\hat{L}) \end{align}\]

\(\hat{L}\)

the maximized value of the likelihood function of the model

\(n\)

number of data points

\(k\)

the number of parameters estimated by the model

A model could perform better by overfitting. The BIC introduces a penalty to the model parameters. Compare \(r^2\) and \(r^2_{adjusted}\)

data for clustering

data("diabetes")

head(diabetes)
   class glucose insulin sspg
1 Normal      80     356  124
2 Normal      97     289  117
3 Normal     105     319  143
4 Normal      90     356  199
5 Normal      90     323  240
6 Normal      86     381  157
  • class: diagnosis: normal, chemically diabetic and overly diabetic. Will be excluded
  • glucose: plasma glucose response to oral glucose
  • insulin: plasma insulin respnse to oral glucose
  • sspg: steady-state plasma glucose (measures insuline resistance)

model output

df <- scale(diabetes[,-1])
mc <- Mclust(df)

summary(mc)
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3
components: 

 log-likelihood   n df       BIC       ICL
      -169.0908 145 29 -482.5069 -501.4662

Clustering table:
 1  2  3 
81 36 28 

detailed model output

mc$modelName
[1] "VVV"
mc$G
[1] 3

visualize cluster output

model selection

show the clustering

classification uncertainty

geysers?

model selection

show the clustering

classification uncertainty

univariate clustering

clustering

uv_clust <- densityMclust(acidity)

uv_clust <- densityMclust(acidity, modelName = "V" )

summary(uv_clust)
------------------------------------------------------- 
Density estimation via Gaussian finite mixture modeling 
------------------------------------------------------- 

Mclust V (univariate, unequal variance) model with 3 components: 

 log-likelihood   n df       BIC       ICL
      -178.7817 155  8 -397.9108 -458.8648
tmp <- acidity |> 
  add_column(
    classification = uv_clust$classification
  ) |> 
  add_column(
    uncertainty = uv_clust$uncertainty
  )

tmp |> 
  ggplot(
    aes(
      x = classification,
      y = acidity,
      size = uncertainty
    )
  )+
  geom_jitter()+
  geom_text(
    aes(
      label = uncertainty |> round(digits = 2)
    ),
    size = 9
  )

modeling visualization

cluster visualization

more viz

References

Kassambara, Alboukadel. 2017. Practical Guide to Cluster Analysis in r: Unsupervised Machine Learning. Vol. 1. STHDA.