class glucose insulin sspg
1 Normal 80 356 124
2 Normal 97 289 117
3 Normal 105 319 143
4 Normal 90 356 199
5 Normal 90 323 240
6 Normal 86 381 157
traditional clustering methods …
model based clustering …
Each component (cluster) \(k\) is modeled by the normal distribution (Kassambara 2017)
mean vector
covariance matrix
… an associated probability in the mixture. Each point has a probability of belonging to each cluster
volume, 2nd to shape, 3rd to orientationE for equal, V for variable, I for coordinate axes
EVI denotes a model with Equal volume, Variable shape and the orientation is the IdentityEEE means that the clusters have Equal volume, Equal shape and Equal orientationmle to fit all models for a range of \(k\) componentsImportant
The BIC shall only be used to compare model within one method (as in model based cluster vs. model based cluster) not across different modeling methods (as in linear vs. logisitc regression)
\[\begin{align} BIC = k \ln(n)-2ln(\hat{L}) \end{align}\]
the maximized value of the likelihood function of the model
number of data points
the number of parameters estimated by the model
A model could perform better by overfitting. The BIC introduces a penalty to the model parameters. Compare \(r^2\) and \(r^2_{adjusted}\)
class glucose insulin sspg
1 Normal 80 356 124
2 Normal 97 289 117
3 Normal 105 319 143
4 Normal 90 356 199
5 Normal 90 323 240
6 Normal 86 381 157
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 3
components:
log-likelihood n df BIC ICL
-169.0908 145 29 -482.5069 -501.4662
Clustering table:
1 2 3
81 36 28


-------------------------------------------------------
Density estimation via Gaussian finite mixture modeling
-------------------------------------------------------
Mclust V (univariate, unequal variance) model with 3 components:
log-likelihood n df BIC ICL
-178.7817 155 8 -397.9108 -458.8648
tmp <- acidity |>
add_column(
classification = uv_clust$classification
) |>
add_column(
uncertainty = uv_clust$uncertainty
)
tmp |>
ggplot(
aes(
x = classification,
y = acidity,
size = uncertainty
)
)+
geom_jitter()+
geom_text(
aes(
label = uncertainty |> round(digits = 2)
),
size = 9
)
Copyright Prof. Dr. Tim Weber, 2024