Agglomerative Clustering

Prof. Dr. Tim Weber

Deggendorf Institute of Technology

Agglomerative Clustering

  • Each observation is treated as a single cluster in the beginning (a leaf)
  • the most similar clusters are successively merged until there is only one single big cluster (root)
  • result is a tree-like representation of the objects, known as dendogram

The idea

What’s a dendogram

What is similar?

… the classification of observations into groups requires some methods for computing the distance or the (dis)similarity between each pair of observations. The result of this computation is known as a dissimilarity or distance matrix. (Kassambara 2017)

distance measures

  • Classical
    • Euclidean
    • Manhatten
  • Correlation based
    • Pearson correlation
    • Eisen cosine correlation distance
    • Spearman correlation distance
    • Kendall correlation distance

Manhatten distance

\[\begin{align} d_{man}(x,y) = \sum_{i=1}^n|x_i-y| \end{align}\]

Pros:

  1. Robustness to Outliers:
    • Less sensitive to outliers compared to Euclidean distance.
  2. Computational Simplicity:
    • Involves only addition and subtraction, making it computationally simpler and faster.
  3. Suitability for High-Dimensional Data:
    • Performs better in high-dimensional spaces, less affected by the “curse of dimensionality.”

Cons:

  1. Ignores Diagonal Distance:
    • Only considers movements along the coordinate axes, ignoring diagonal relationships.
  2. Dependence on Scale:
    • Affected by the scale and units of features, requiring proper normalization.
  3. Less Effective for Spherical Clusters:
    • May not capture the true geometric relationships in spherical or circular clusters as accurately as Euclidean distance.

Pearson correlation distance

\[\begin{align} d_{cor}(x,y) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} \end{align}\]

Spearman correlation distance

\[\begin{align} d_{spear}(x,y) = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \end{align}\]

Kendall correlation coefficient

\[\begin{align} d_{kend}(x,y) = 1-\frac{n_c-n_d}{\frac{1}{2}n(n-1)} \end{align}\]

comparison of correlation

  1. mpg: Miles/(US) gallon
  2. cyl: Number of cylinders
  3. disp: Displacement (cu.in.)
  4. hp: Gross horsepower
  5. drat: Rear axle ratio
  6. wt: Weight (1000 lbs)
  7. qsec: 1/4 mile time
  8. vs: Engine (0 = V-shaped, 1 = straight)
  9. am: Transmission (0 = automatic, 1 = manual)
  10. gear: Number of forward gears
  11. carb: Number of carburetors
mpg cyl disp hp drat wt qsec vs am gear carb
21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

pearson

spearman

kendall

Steps to agglomerative hierarchical clustering

  1. prepare the data
  2. Compute (dis)similarity information between every pair of objects
  3. Using linkage function to group object into hierarchical cluster tree, based on the distance information generated in step 1. Objects/clusters that are in close proximity are linked together using the linkage function.
  4. Determining where to cut the hierarchical tree into clusters. This creates the partition of the data.

linkage function(s)

  • Maximum or complete linkage: The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce more compact clusters.
  • Minimum or single linkage: The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2. It tends to produce long, “loose” clusters.

  • Mean or average linkage: The distance between two clusters is defined as the average distance between the elements in cluster 1 and the elements in cluster 2.

  • Centroid linkage: The distance between two clusters is defined as the distanced between the centroid for cluster 1 (a mean vector of length \(p\) variables) and the centroid for cluster 2

  • Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged.

lets get to it

data("USArrests") # get the data

df <- scale(USArrests) # scale the data

res.dist <- dist(df, method = "euclidean") # compute distance metric

res.hc <- hclust(d = res.dist, method = "ward.D2") # use the linkage function

dendogram

verify the tree

res.coph <- cophenetic(res.hc)

cor(res.dist,res.coph)
[1] 0.6975266

so… how many groups?

… you need to find that yourself

Comparing Dendograms - visual

comparing dendogram - quantitative

correlation matrix between a list of dendograms

visualizing dendograms

horizontal

phylogenic

circular

mtcars?

data("mtcars") # get the data

df <- scale(mtcars) # scale the data

res.dist.spea <- get_dist(df, method = "spearman") # compute distance metric
res.dist.kend <- get_dist(df, method = "kendall") # compute distance metric


res.hc.spea <- hclust(d = res.dist.spea, method = "ward.D2") # use the linkage function
res.hc.kend <- hclust(d = res.dist.kend, method = "ward.D2") # use the linkage function

verify mtcars clustering

res.coph.spea <- cophenetic(res.hc.spea)
res.coph.kend <- cophenetic(res.hc.kend)

cor(res.dist.spea,res.coph.spea)
[1] 0.8138052
cor(res.dist.kend,res.coph.kend)
[1] 0.8177303

groups in mtcars

spearman

kendall

results spearman

results kendall

comparing spearman and kendall

Comparing Dendograms - visual

spearman - complete vs. ward

kendall - complete vs. ward

spearman - dendogram correlation

kendall - dendogram correlation

complete comparison

Spearman Ward vs. Kendall Ward

final result

“Pluralitas non est ponenda sine neccesitate”

pca on data

loadings of PC’s

Dim1 vs. Dim2

Dim3 vs. Dim4

Cluster on Dim1 vs. Dim2

Cluster on Dim3 vs. Dim4

Cluster 1

car path mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Ford Pantera L 15.8 8 351 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8

Cluster 2

car path mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
car path mpg cyl disp hp drat wt qsec vs am gear carb
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.9 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2

Cluster 3

car path mpg cyl disp hp drat wt qsec vs am gear carb
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
car path mpg cyl disp hp drat wt qsec vs am gear carb
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

Cluster 4

car path mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
Merc 450SE 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.73 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
car path mpg cyl disp hp drat wt qsec vs am gear carb
Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
Dodge Challenger 15.5 8 318 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2

References

Kassambara, Alboukadel. 2017. Practical Guide to Cluster Analysis in r: Unsupervised Machine Learning. Vol. 1. STHDA.