UML-03: Hierarchical Clustering

Publish on: 2022/10/03 Classify at: CODE/Unsupervised Machine Learning

Words: 1658 Read:≈ 8min

Summary

Master Hierarchical Clustering: the 'Digital Librarian' of machine learning. Understand how to organize data into nested trees, interpret dendrograms, and choose the right linkage method without needing to guess 'K' upfront.

Learning Objectives

After reading this post, you will be able to:

Understand the difference between agglomerative and divisive clustering
Master the four main linkage methods and when to use each
Read and interpret dendrograms for data exploration
Choose the optimal number of clusters using dendrogram analysis

Theory

The Intuition: The Digital Librarian

Imagine you have a messy hard drive with thousands of random files (images, documents, code). You don’t just dump them into 3 arbitrary buckets. Instead, you organize them hierarchically:

Files go into specific sub-folders (e.g., “vacation_photos”).
Sub-folders go into broader categories (e.g., “Photos”).
Categories go into the root drive (e.g., “D: Drive”).

This is exactly what Hierarchical Clustering does. It organizes your data into a tree of nested clusters, giving you a complete map of how data points relate to each other at different levels of granularity.

What is Hierarchical Clustering?

Unlike K-Means, which forces you to choose $K$ clusters upfront (flat clustering), Hierarchical Clustering builds a tree structure (Dendrogram) that preserves the relationships between all points.

graph TD subgraph Structure [" The Hierarchy (Dendrogram)"] direction TB Root["📂 Root (All Data)"] Root --> GroupA["📁 Folder A"] Root --> GroupB["📁 Folder B"] GroupA --> SubA1["📂 Sub-folder A1"] GroupA --> SubA2["📂 Sub-folder A2"] SubA1 --> F1["📄 File 1"] SubA1 --> F2["📄 File 2"] GroupB --> F3["📄 File 3"] end style Root fill:#e1f5fe style GroupA fill:#fff9c4 style GroupB fill:#fff9c4 style SubA1 fill:#c8e6c9 style F1 fill:#ffccbc

Two main approaches exist:

Approach	Direction	Strategy
Agglomerative (bottom-up)	⬆️ Merge	Start with N clusters, merge similar ones
Divisive (top-down)	⬇️ Split	Start with 1 cluster, split into smaller ones

In practice: Agglomerative clustering is far more common because it’s computationally simpler. This post focuses on agglomerative methods.

Agglomerative Clustering Algorithm

The algorithm proceeds as follows:

graph LR subgraph Process ["The Agglomerative Process"] direction LR Start(("Start\n(N Clusters)")) --> Find["Find Closest Pair"] Find --> Merge["Merge Pairs"] Merge --> Check{"1 Cluster Left?"} Check -->|No| Find Check -->|Yes| End(("Stop\n(1 Giant Cluster)")) end style Start fill:#e1f5fe style Merge fill:#fff9c4,stroke:#fbc02d style End fill:#c8e6c9

Initialize: Each data point is its own cluster ($N$ clusters)
Find closest pair: Identify the two most similar clusters
Merge: Combine them into a single cluster
Repeat: Until only one cluster remains
Output: A hierarchical tree (dendrogram)

Linkage: The Rules of Organization

The librarian needs a rule to decide which folders to merge next. How do we measure distance between clusters?

Different linkage methods define cluster distance differently:

Linkage	Nickname	Distance Definition	Characteristics
Single	“The Optimist”	Min distance (closest points)	Creates long, chain-like clusters. Good for non-globular shapes.
Complete	“The Skeptic”	Max distance (farthest points)	Creates compact, spherical bubbles. Sensitive to outliers.
Average	“The Democrat”	Average of all pairs	Balanced, but affected by noise.
Ward	“The Architect”	Minimize variance increase	Creates similarly sized, spherical clusters. Default & usually best.

Comparison of linkage methods — Linkage methods produce different cluster structures: single linkage chains, complete linkage creates spheres

Mathematical Definitions

For clusters $A$ and $B$:

Single linkage: $$d(A, B) = \min_{a \in A, b \in B} |a - b|$$

Complete linkage: $$d(A, B) = \max_{a \in A, b \in B} |a - b|$$

Average linkage: $$d(A, B) = \frac{1}{|A||B|} \sum_{a \in A} \sum_{b \in B} |a - b|$$

Ward’s method: $$d(A, B) = \sqrt{\frac{2|A||B|}{|A|+|B|}} |\mu_A - \mu_B|$$

Recommendation: Start with Ward’s linkage — it tends to produce well-balanced clusters and is less sensitive to noise than single linkage.

The Dendrogram: The Map of Your Data

A dendrogram is a tree diagram showing the full history of merges or splits. It is essentially the “directory tree” of your data.

Anatomy of a dendrogram — Dendrogram structure: leaves are data points, height shows merge distance, horizontal cut gives clusters

Key elements:

Leaves: Individual data points (bottom)
Vertical lines: Clusters being merged
Height: Distance at which merge occurs
Horizontal cut: Determines number of clusters

To get $K$ clusters, draw a horizontal line that crosses exactly $K$ vertical lines.

Choosing the Number of Clusters

Unlike K-Means, hierarchical clustering doesn’t require specifying $K$ upfront. Two approaches:

1. Visual inspection of dendrogram

Look for large gaps in merge heights
Cut where there’s the biggest “jump”

2. Inconsistency coefficient

Compare each merge height to previous merges
High inconsistency suggests a natural boundary

Code Practice

Now, let’s hand over the job to the algorithm. We will let Python act as the digital librarian, automatically sorting our raw data into a structured hierarchy.

Agglomerative Clustering with sklearn

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# Generate sample data
np.random.seed(42)
X, y_true = make_blobs(n_samples=50, centers=3, cluster_std=0.7, random_state=42)

# Perform agglomerative clustering
# linkage='ward': Minimizes variance significantly (good for spherical clusters)
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels = agg.fit_predict(X)

print("=" * 50)
print("AGGLOMERATIVE CLUSTERING RESULTS")
print("=" * 50)
print(f"\n📊 Cluster sizes: {np.bincount(labels)}")
print(f"🔗 Linkage method: Ward")

Output:

1
2
3
4
5
6
==================================================
AGGLOMERATIVE CLUSTERING RESULTS
==================================================

📊 Cluster sizes: [17 16 17]
🔗 Linkage method: Ward

Creating and Interpreting Dendrograms

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

# Compute linkage matrix
Z = linkage(X, method='ward')

# Create dendrogram
fig, ax = plt.subplots(figsize=(12, 6))
dendrogram(Z, 
           truncate_mode='lastp',  # Simplify the plot: only show the last p merged clusters
           p=20,                    # Show only the top 20 merges (cleaner visualization)
           leaf_rotation=90,
           leaf_font_size=10,
           show_contracted=True)    # Show small dots for contracted clusters

ax.set_title('Hierarchical Clustering Dendrogram (Ward)', fontsize=14, fontweight='bold')
ax.set_xlabel('Sample Index or Cluster Size', fontsize=11)
ax.set_ylabel('Distance (Ward)', fontsize=11)
ax.axhline(y=10, color='r', linestyle='--', label='Cut for 3 clusters')
ax.legend()

plt.tight_layout()
plt.savefig('assets/dendrogram_example.png', dpi=150)
plt.show()

# Get cluster labels by cutting the dendrogram
labels_cut = fcluster(Z, t=3, criterion='maxclust')
print(f"\n📊 Labels from dendrogram cut: {np.bincount(labels_cut)[1:]}")

Output:

`1`	`📊 Labels from dendrogram cut: [16 17 17]`

Dendrogram with cut line — Dendrogram with horizontal cut line — cutting at height 10 gives 3 clusters

Comparing Linkage Methods

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from sklearn.datasets import make_moons

# Generate non-spherical data
X_moons, y_moons = make_moons(n_samples=200, noise=0.05, random_state=42)

# Compare linkage methods
linkage_methods = ['single', 'complete', 'average', 'ward']

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.flatten()

colors = ['#e74c3c', '#3498db']

for ax, method in zip(axes, linkage_methods):
    # We try 2 clusters for everything to see how they split the "moons"
    agg = AgglomerativeClustering(n_clusters=2, linkage=method)
    labels = agg.fit_predict(X_moons)
    
    for i in range(2):
        mask = labels == i
        ax.scatter(X_moons[mask, 0], X_moons[mask, 1], 
                   c=colors[i], alpha=0.6, s=40, edgecolors='white')
    
    ax.set_title(f'{method.capitalize()} Linkage', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.suptitle('Linkage Methods on Moon-Shaped Data', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('assets/linkage_comparison.png', dpi=150)
plt.show()

Comparison of linkage methods on moons data — Single linkage handles non-spherical clusters; Ward struggles with elongated shapes

Key insight: Single linkage correctly identifies the moon shapes because it only needs one close pair to merge clusters. Ward linkage fails because it assumes spherical clusters.

When to Use Which Linkage?

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from sklearn.metrics import adjusted_rand_score

print("=" * 50)
print("LINKAGE METHOD COMPARISON (Moons Data)")
print("=" * 50)

for method in linkage_methods:
    agg = AgglomerativeClustering(n_clusters=2, linkage=method)
    labels = agg.fit_predict(X_moons)
    ari = adjusted_rand_score(y_moons, labels)
    print(f"{method.capitalize():10} Linkage: ARI = {ari:.4f}")

Output:

1
2
3
4
5
6
7
==================================================
LINKAGE METHOD COMPARISON (Moons Data)
==================================================
Single     Linkage: ARI = 1.0000
Complete   Linkage: ARI = 0.4461
Average    Linkage: ARI = 0.4464
Ward       Linkage: ARI = 0.4464

Interpreting the Results

Notice the stark difference!

Single Linkage (“The Optimist”) achieved a perfect score (ARI=1.0) because it could “chain” along the curved moon shapes. It only needs one pair of close points to merge clusters.
Ward (“The Architect”) failed (ARI ~0.45) because it tries to minimize variance, forcing the data into spherical (ball-like) shapes. It cut right through the crescents!

Decision Guide

So, which one should you choose?

Method	Strength	Best For…
Ward	Balanced, cohesive clusters	General purpose. This is your default choice. Works best for noisy data or when you want clusters of similar sizes.
Single	Follows continuous shapes	Non-globular data (like the moons above) or detecting outliers. Warning: suffers from the “chaining” effect on noisy data.
Complete	Compact clusters	When you want to ensure the diameter of clusters is small (all points are somewhat similar).
Average	Compromise	A middle ground. Use when Ward is too strict about spherical shapes and Single is too loose.

Deep Dive

Hierarchical vs K-Means

Aspect	Hierarchical	K-Means
Specify K upfront	No	Yes
Dendrogram	Yes	No
Complexity	$O(N^2 \log N)$ to $O(N^3)$	$O(N \cdot K \cdot d \cdot I)$
Cluster shapes	Flexible (with single linkage)	Spherical
Scalability	Poor for large N	Good
Reproducibility	Deterministic	Random (depends on init)

Computational Complexity

Operation	Complexity
Distance matrix	$O(N^2)$
Single/Complete linkage	$O(N^2 \log N)$
Average/Ward linkage	$O(N^2 \log N)$ to $O(N^3)$
Memory	$O(N^2)$

Scalability warning: Storing the $N \times N$ distance matrix becomes impractical for large datasets (e.g., 100K+ points). For big data, consider:

Mini-batch approaches
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
K-Means first, then hierarchical on cluster centers

Frequently Asked Questions

Q1: When should I use hierarchical clustering over K-Means?

Use hierarchical clustering when:

You don’t know how many clusters exist
You want to explore data structure at multiple scales
You need a dendrogram for interpretation
Dataset is small enough ($N < 10000$)

Q2: Can I use hierarchical clustering with categorical data?

Yes, but you need an appropriate distance metric:

Hamming distance: Count of differing attributes
Jaccard distance: For binary/set data
Gower distance: For mixed data types

Summary

Concept	Key Points
Hierarchical Clustering	Builds tree of clusters without specifying K
Agglomerative	Bottom-up: merge similar clusters
Linkage Methods	Single, Complete, Average, Ward
Dendrogram	Visual representation of cluster hierarchy
Choosing K	Cut dendrogram where big gaps appear
Trade-off	More interpretable, but $O(N^2)$ complexity

References

sklearn Hierarchical Clustering
scipy.cluster.hierarchy Documentation
Müllner, D. (2011). “Modern hierarchical, agglomerative clustering algorithms”
“The Elements of Statistical Learning” - Chapter 14.3