UML-02: K-Means Clustering

Publish on: 2022/10/02 Classify at: CODE/Unsupervised Machine Learning

Words: 1960 Read:≈ 10min

Summary

Unlock the power of clustering. Master Lloyd's Algorithm from scratch, visualize the iterative 'Assignment-Update' loop, and solve the 'How many clusters?' dilemma with the Elbow Method.

Learning Objectives

After reading this post, you will be able to:

Understand the K-Means algorithm step-by-step
Implement Lloyd’s algorithm from scratch
Use the elbow method to choose the optimal number of clusters
Know when K-Means works well and when it fails

Theory

The Intuition: The Delivery Problem

Imagine you are a logistics manager for a city and you want to open K delivery centers. Your goal is simple: maximize efficiency by ensuring that the average travel time for all customers to their nearest center is minimized.

Initialize: You place $K$ centers at random locations.
Assign: You assign every customer to their nearest center.
Update: You realize the centers aren’t optimal, so you move each center to the “middle” (centroid) of its assigned customers.
Repeat: You keep moving the centers and re-assigning customers until the centers stop moving.

This is exactly how K-Means works. It is an iterative algorithm that tries to find the best “centers” for your data.

What is K-Means?

K-Means partitions data into $k$ distinct, non-overlapping groups (clusters). The algorithm assigns each data point to the cluster with the nearest centroid.

graph LR subgraph Goal ["K-Means Goal"] direction LR A["Data Points"] --> B["K-Means"] B --> C["K Clusters"] end style B fill:#fff9c4 style C fill:#c8e6c9

The K-Means Objective

Formally, K-Means minimizes the sum of squared distances from each point to its assigned centroid:

$$J = \sum_{i=1}^{N} \sum_{k=1}^{K} r_{ik} |x_i - \mu_k|^2$$

where:

$N$ = number of data points
$K$ = number of clusters
$r_{ik} \in {0, 1}$ = 1 if point $x_i$ belongs to cluster $k$
$\mu_k$ = centroid of cluster $k$

This is also called the inertia or within-cluster sum of squares (WCSS).

Lloyd’s Algorithm

The standard K-Means algorithm, known as Lloyd’s algorithm, alternates between two steps:

graph LR Start(("Start")) --> Init["1. Initialize K Centroids"] Init --> Assignment subgraph Loop ["The Loop (Lloyd's Algorithm)"] direction TB Assignment["2. Assignment Step\n(Assign points to nearest)"] Update["3. Update Step\n(Move centroids to mean)"] Check{"Converged?"} Assignment --> Update Update --> Check Check -->|No| Assignment end Check -->|Yes| End(("Final Clusters")) style Loop fill:#f9fbe7,stroke:#827717,stroke-dasharray: 5 5 style Assignment fill:#fff9c4,stroke:#fbc02d style Update fill:#c8e6c9,stroke:#388e3c style Check fill:#ffe0b2,stroke:#f57c00

Step	Description	Formula
Initialize	Randomly select $K$ initial centroids	-
Assignment	Assign each point to nearest centroid	$r_{ik} = 1$ if $k = \arg\min_j \|x_i - \mu_j\|^2$
Update	Recalculate centroids as cluster means	$\mu_k = \dfrac{\sum_i r_{ik} x_i}{\sum_i r_{ik}}$
Repeat	Until centroids don’t change	-

Convergence guarantee: Lloyd’s algorithm always converges because each step reduces (or maintains) the objective $J$. However, it may converge to a local minimum, not the global optimum.

Step-by-Step Example

Let’s trace through K-Means with $K=2$ on simple 2D data:

Iteration 0: Initialize

Randomly select 2 points as initial centroids

Iteration 1: Assign + Update

Assign each point to its nearest centroid
Recompute centroids as the mean of assigned points

Iteration 2+: Repeat

Continue until centroids stabilize

K-Means algorithm steps visualization — K-Means iterations: random initialization → assignment → update → convergence

Initialization Strategies

The initial centroids strongly affect the final result. Common strategies:

Strategy	Description	Quality
Random	Pick $K$ random points	⚠️ Inconsistent
Random partition	Randomly assign points, then compute centroids	⚠️ Inconsistent
K-Means++	Smart initialization, spread out centroids	✅ Better
Multiple runs	Run algorithm $n$ times, pick best result	✅ Robust

K-Means++ initialization (default in sklearn): Pick the first centroid randomly, then pick subsequent centroids with probability proportional to squared distance from existing centroids. This spreads out initial centroids and typically leads to better results.

Choosing K: The Elbow Method

A key question: how many clusters should we use?

The elbow method plots inertia (WCSS) against $K$ and looks for an “elbow” — a point where adding more clusters gives diminishing returns.

Elbow method for choosing K — Elbow method: Look for the 'elbow' where inertia decreases sharply then levels off

Practical Tip:

The “elbow” isn’t always sharp! It’s often subjective. If the elbow is at K=3, but K=4 makes more business sense (e.g., 4 clothing sizes: S, M, L, XL), choose the business logic. Models are tools, not masters.

Silhouette Score

Another approach is the silhouette score, which measures how similar points are to their own cluster vs. other clusters:

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

where:

$a(i)$ = mean distance from point $i$ to other points in same cluster
$b(i)$ = mean distance to points in nearest other cluster
$s(i) \in [-1, 1]$: higher is better

Score	Interpretation
$\approx 1$	Well-clustered
$\approx 0$	On cluster boundary
$< 0$	Possibly wrong cluster

Code Practice

Now that we understand the theory and the math, let’s bring it to life with code. We’ll start by building K-Means from scratch to see the mechanics, and then use sklearn for production-ready clustering.

K-Means from Scratch

Let’s implement Lloyd’s algorithm step by step:

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt

class KMeansFromScratch:
    def __init__(self, n_clusters=3, max_iters=100, random_state=42):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.random_state = random_state
        
    def fit(self, X):
        """Step-by-step implementation of Lloyd's Algorithm"""
        np.random.seed(self.random_state)
        n_samples = X.shape[0]
        
        # Step 1: Initialize centroids randomly
        # We pick k random data points to start
        random_idx = np.random.choice(n_samples, self.n_clusters, replace=False)
        self.centroids = X[random_idx].copy()
        
        self.history = [self.centroids.copy()]
        
        for iteration in range(self.max_iters):
            # Step 2: Assign points to nearest centroid
            # We compute distance matrix and find min index for each row
            distances = self._compute_distances(X)
            self.labels = np.argmin(distances, axis=1)
            
            # Step 3: Update centroids
            # Move centroid to the mean of all points assigned to it
            new_centroids = np.array([
                X[self.labels == k].mean(axis=0) 
                for k in range(self.n_clusters)
            ])
            
            self.history.append(new_centroids.copy())
            
            # Check convergence
            # If centroids didn't move, we are done!
            if np.allclose(self.centroids, new_centroids):
                print(f"Converged at iteration {iteration + 1}")
                break
                
            self.centroids = new_centroids
            
        return self
    
    def _compute_distances(self, X):
        """
        Vectorized distance calculation.
        Returns a matrix of shape (n_samples, n_clusters)
        where [i, k] is the distance from point i to centroid k.
        """
        distances = np.zeros((X.shape[0], self.n_clusters))
        for k, centroid in enumerate(self.centroids):
            # axis=1 means we sum across the columns (x, y coordinates)
            distances[:, k] = np.linalg.norm(X - centroid, axis=1)
        return distances
    
    def predict(self, X):
        """Assign new points to the nearest cluster"""
        distances = self._compute_distances(X)
        return np.argmin(distances, axis=1)
    
    def inertia(self, X):
        """Compute within-cluster sum of squares (WCSS)"""
        distances = self._compute_distances(X)
        # For each point, take the distance to its assigned centroid (min distance)
        # Square it, then sum them up
        return np.sum(np.min(distances, axis=1) ** 2)


# Generate sample data
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

# Fit our implementation
kmeans = KMeansFromScratch(n_clusters=4, random_state=42)
kmeans.fit(X)

print(f"\n📊 Final centroids:\n{kmeans.centroids}")
print(f"📏 Inertia: {kmeans.inertia(X):.2f}")

Output:

1
2
3
4
5
6
7
8
Converged at iteration 7

📊 Final centroids:
[[ 4.30482919  2.37577687]
 [-6.84180708 -6.84038791]
 [ 5.33339082  1.47653843]
 [-5.73523256  8.10177082]]
📏 Inertia: 1886.16

Visualizing K-Means Steps

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Visualize the algorithm iterations
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6']

for i, ax in enumerate(axes):
    if i >= len(kmeans.history):
        ax.axis('off')
        continue
        
    centroids = kmeans.history[i]
    
    # Compute labels for this iteration's centroids
    distances = np.array([[np.linalg.norm(x - c) for c in centroids] for x in X])
    labels = np.argmin(distances, axis=1)
    
    # Plot points
    for k in range(4):
        mask = labels == k
        ax.scatter(X[mask, 0], X[mask, 1], c=colors[k], alpha=0.5, s=30)
    
    # Plot centroids
    ax.scatter(centroids[:, 0], centroids[:, 1], 
               c=colors[:len(centroids)], marker='X', s=200, 
               edgecolors='black', linewidths=2)
    
    ax.set_title(f'Iteration {i}', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.suptitle('K-Means Algorithm: Step by Step', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('assets/kmeans_steps.png', dpi=150)
plt.show()

K-Means iterations visualization — K-Means converges in just 4 iterations on well-separated clusters

Using sklearn K-Means

In practice, use sklearn’s optimized implementation:

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

# Fit K-Means
# n_init=10: Run algorithm 10 times with different random starts and keep the best one.
#            Crucial to avoid getting stuck in bad local minima!
# random_state=42: For reproducibility throughout this tutorial.
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

print("=" * 50)
print("SKLEARN K-MEANS RESULTS")
print("=" * 50)
print(f"\n📊 Cluster sizes: {np.bincount(labels)}")
print(f"📏 Inertia: {kmeans.inertia_:.2f}")
print(f"📐 Silhouette Score: {silhouette_score(X, labels):.4f}")
print(f"🔄 Iterations: {kmeans.n_iter_}")

Output:

1
2
3
4
5
6
7
8
==================================================
SKLEARN K-MEANS RESULTS
==================================================

📊 Cluster sizes: [75 75 75 75]
📏 Inertia: 362.47
📐 Silhouette Score: 0.8335
🔄 Iterations: 3

The Elbow Method

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Find optimal K using elbow method
inertias = []
silhouettes = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X, labels))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Elbow plot
axes[0].plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].axvline(x=4, color='r', linestyle='--', label='Elbow at K=4')
axes[0].set_xlabel('Number of Clusters (K)', fontsize=11)
axes[0].set_ylabel('Inertia (WCSS)', fontsize=11)
axes[0].set_title('Elbow Method', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Silhouette plot
axes[1].plot(K_range, silhouettes, 'go-', linewidth=2, markersize=8)
axes[1].axvline(x=4, color='r', linestyle='--', label='Best at K=4')
axes[1].set_xlabel('Number of Clusters (K)', fontsize=11)
axes[1].set_ylabel('Silhouette Score', fontsize=11)
axes[1].set_title('Silhouette Analysis', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/elbow_method.png', dpi=150)
plt.show()

Elbow method and silhouette analysis — Both elbow method and silhouette score suggest K=4 is optimal

Deep Dive

When K-Means Works Well

K-Means performs best when:

Condition	Why
✅ Spherical clusters	K-Means uses Euclidean distance, assumes circular shapes
✅ Similar cluster sizes	All clusters get roughly equal weight
✅ Well-separated clusters	Clear boundaries between groups
✅ Known number of clusters	Must specify $K$ in advance

When K-Means Fails

K-Means struggles with:

Non-spherical clusters (elongated, irregular shapes)
Varying cluster densities (tight vs. spread out)
Varying cluster sizes (small vs. large)
Outliers (can heavily influence centroids)

K-Means limitations on non-spherical data — K-Means fails on non-spherical clusters — it splits the 'moons' incorrectly

Computational Complexity

Operation	Complexity
Single iteration	$O(N \cdot K \cdot d)$
Total (worst case)	$O(N \cdot K \cdot d \cdot I)$

Where $N$ = samples, $K$ = clusters, $d$ = dimensions, $I$ = iterations.

Practical note: K-Means is very fast, even on large datasets. For millions of points, consider Mini-Batch K-Means.

Frequently Asked Questions

Q1: How do I choose K if the elbow isn’t clear?

Options:

Use silhouette score (higher is better)
Apply domain knowledge (what makes sense for your problem?)
Try Gap statistic (compares to null reference)
Use hierarchical clustering first to guide K selection

Q2: Why do I get different results each time?

K-Means is sensitive to initialization. Solutions:

Set random_state for reproducibility
Use n_init=10 (default) to run multiple times
Use K-Means++ initialization (default in sklearn)

Q3: Can K-Means handle categorical data?

No, K-Means uses Euclidean distance which requires numerical data. Alternatives:

K-Modes: For categorical data
K-Prototypes: For mixed data
Convert categories to one-hot encoding (with caution)

Summary

Concept	Key Points
K-Means	Partition data into K clusters by minimizing WCSS
Lloyd’s Algorithm	Alternate: assign points → update centroids
Initialization	Use K-Means++ for better results
Choosing K	Elbow method, silhouette score
Limitations	Assumes spherical, equal-sized clusters
Complexity	$O(N \cdot K \cdot d \cdot I)$ — very fast

References

sklearn KMeans Documentation
Lloyd, S. (1982). “Least squares quantization in PCM”
Arthur, D. & Vassilvitskii, S. (2007). “k-means++: The advantages of careful seeding”
“Pattern Recognition and Machine Learning” by Christopher Bishop - Chapter 9.1