Agglomerative Clustering

Prof. Dr. Tim Weber

tim.weber@th-deg.de

Deggendorf Institute of Technology

Agglomerative Clustering

Each observation is treated as a single cluster in the beginning (a leaf)
the most similar clusters are successively merged until there is only one single big cluster (root)
result is a tree-like representation of the objects, known as dendogram

The idea

What’s a dendogram

What is similar?

… the classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix. (Kassambara 2017)

distance measures

Classical
- Euclidean
- Manhatten
Correlation based
- Pearson correlation
- Eisen cosine correlation distance
- Spearman correlation distance
- Kendall correlation distance

Manhatten distance

\[\begin{align} d_{man}(x,y) = \sum_{i=1}^n|x_i-y| \end{align}\]

Pros:

Robustness to Outliers:
- Less sensitive to outliers compared to Euclidean distance.
Computational Simplicity:
- Involves only addition and subtraction, making it computationally simpler and faster.
Suitability for High-Dimensional Data:
- Performs better in high-dimensional spaces, less affected by the “curse of dimensionality.”

Cons:

Ignores Diagonal Distance:
- Only considers movements along the coordinate axes, ignoring diagonal relationships.
Dependence on Scale:
- Affected by the scale and units of features, requiring proper normalization.
Less Effective for Spherical Clusters:
- May not capture the true geometric relationships in spherical or circular clusters as accurately as Euclidean distance.

Pearson correlation distance

\[\begin{align} d_{cor}(x,y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \end{align}\]

Spearman correlation distance

\[\begin{align} d_{spear}(x,y) = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \end{align}\]

Kendall correlation coefficient

\[\begin{align} d_{kend}(x,y) = 1-\frac{n_c-n_d}{\frac{1}{2}n(n-1)} \end{align}\]

comparison of correlation

mpg: Miles/(US) gallon
cyl: Number of cylinders
disp: Displacement (cu.in.)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (1000 lbs)
qsec: 1/4 mile time
vs: Engine (0 = V-shaped, 1 = straight)
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors

mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

pearson

spearman

kendall

Steps to agglomerative hierarchical clustering

prepare the data
Compute (dis)similarity information between every pair of objects
Using linkage function to group object into hierarchical cluster tree, based on the distance information generated in step 1. Objects/clusters that are in close proximity are linked together using the linkage function.
Determining where to cut the hierarchical tree into clusters. This creates the partition of the data.

linkage function(s)

Maximum or complete linkage: The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce more compact clusters.

Minimum or single linkage: The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce long, “loose” clusters.
Mean or average linkage: The distance between two clusters is defined as the average distance between the elements in cluster 1 and the elements in cluster 2.
Centroid linkage: The distance between two clusters is defined as the distanced between the centroid for cluster 1 (a mean vector of length \(p\) variables) and the centroid for cluster 2

Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

lets get to it

data("USArrests") # get the data

df <- scale(USArrests) # scale the data

res.dist <- dist(df, method = "euclidean") # compute distance metric

res.hc <- hclust(d = res.dist, method = "ward.D2") # use the linkage function

dendogram

verify the tree

res.coph <- cophenetic(res.hc)

cor(res.dist,res.coph)

[1] 0.6975266

so… how many groups?

… you need to find that yourself

Comparing Dendograms - visual

comparing dendogram - quantitative

correlation matrix between a list of dendograms

visualizing dendograms

horizontal

phylogenic

circular

mtcars?

data("mtcars") # get the data

df <- scale(mtcars) # scale the data

res.dist.spea <- get_dist(df, method = "spearman") # compute distance metric
res.dist.kend <- get_dist(df, method = "kendall") # compute distance metric


res.hc.spea <- hclust(d = res.dist.spea, method = "ward.D2") # use the linkage function
res.hc.kend <- hclust(d = res.dist.kend, method = "ward.D2") # use the linkage function

verify mtcars clustering

res.coph.spea <- cophenetic(res.hc.spea)
res.coph.kend <- cophenetic(res.hc.kend)

cor(res.dist.spea,res.coph.spea)

[1] 0.8138052

cor(res.dist.kend,res.coph.kend)

[1] 0.8177303

groups in mtcars

spearman

kendall

results spearman

results kendall

comparing spearman and kendall

Comparing Dendograms - visual

spearman - complete vs. ward

kendall - complete vs. ward

spearman - dendogram correlation

kendall - dendogram correlation

complete comparison

Spearman Ward vs. Kendall Ward

final result

“Pluralitas non est ponenda sine neccesitate”

pca on data

loadings of PC’s

Dim1 vs. Dim2

Dim3 vs. Dim4

Cluster on Dim1 vs. Dim2

Cluster on Dim3 vs. Dim4

Cluster 1

car	mpg	cyl	disp	hp	drat	wt	qsec	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	1	4	4
Ford Pantera L	15.8	8	351	264	4.22	3.170	14.50	1	5	4
Ferrari Dino	19.7	6	145	175	3.62	2.770	15.50	1	5	6
Maserati Bora	15.0	8	301	335	3.54	3.570	14.60	1	5	8

Cluster 2

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.9	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.7	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.9	1	1	5	2
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.6	1	1	4	2

Cluster 3

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	3	1
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	3	1
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	4	2

car	mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	4	4
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	3	1

Cluster 4

car	mpg	cyl	disp	hp	drat	wt	qsec	gear	carb
Hornet Sportabout	18.7	8	360.0	175	3.15	3.44	17.02	3	2
Duster 360	14.3	8	360.0	245	3.21	3.57	15.84	3	4
Merc 450SE	16.4	8	275.8	180	3.07	4.07	17.40	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.73	17.60	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.78	18.00	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.25	17.98	3	4

car	mpg	cyl	disp	hp	drat	wt	qsec	gear	carb
Lincoln Continental	10.4	8	460	215	3.00	5.424	17.82	3	4
Chrysler Imperial	14.7	8	440	230	3.23	5.345	17.42	3	4
Dodge Challenger	15.5	8	318	150	2.76	3.520	16.87	3	2
AMC Javelin	15.2	8	304	150	3.15	3.435	17.30	3	2
Camaro Z28	13.3	8	350	245	3.73	3.840	15.41	3	4
Pontiac Firebird	19.2	8	400	175	3.08	3.845	17.05	3	2

References

Kassambara, Alboukadel. 2017. Practical Guide to Cluster Analysis in r: Unsupervised Machine Learning. Vol. 1. STHDA.