UML-02: K-Means Clustering

Summary
Unlock the power of clustering. Master Lloyd's Algorithm from scratch, visualize the iterative 'Assignment-Update' loop, and solve the 'How many clusters?' dilemma with the Elbow Method.

Learning Objectives

After reading this post, you will be able to:

  • Understand the K-Means algorithm step-by-step
  • Implement Lloyd’s algorithm from scratch
  • Use the elbow method to choose the optimal number of clusters
  • Know when K-Means works well and when it fails

Theory

The Intuition: The Delivery Problem

Imagine you are a logistics manager for a city and you want to open K delivery centers. Your goal is simple: maximize efficiency by ensuring that the average travel time for all customers to their nearest center is minimized.

  1. Initialize: You place $K$ centers at random locations.
  2. Assign: You assign every customer to their nearest center.
  3. Update: You realize the centers aren’t optimal, so you move each center to the “middle” (centroid) of its assigned customers.
  4. Repeat: You keep moving the centers and re-assigning customers until the centers stop moving.

This is exactly how K-Means works. It is an iterative algorithm that tries to find the best “centers” for your data.

What is K-Means?

K-Means partitions data into $k$ distinct, non-overlapping groups (clusters). The algorithm assigns each data point to the cluster with the nearest centroid.

graph LR subgraph Goal ["K-Means Goal"] direction LR A["Data Points"] --> B["K-Means"] B --> C["K Clusters"] end style B fill:#fff9c4 style C fill:#c8e6c9

The K-Means Objective

Formally, K-Means minimizes the sum of squared distances from each point to its assigned centroid:

$$J = \sum_{i=1}^{N} \sum_{k=1}^{K} r_{ik} |x_i - \mu_k|^2$$

where:

  • $N$ = number of data points
  • $K$ = number of clusters
  • $r_{ik} \in {0, 1}$ = 1 if point $x_i$ belongs to cluster $k$
  • $\mu_k$ = centroid of cluster $k$

This is also called the inertia or within-cluster sum of squares (WCSS).

Lloyd’s Algorithm

The standard K-Means algorithm, known as Lloyd’s algorithm, alternates between two steps:

graph LR Start(("Start")) --> Init["1. Initialize K Centroids"] Init --> Assignment subgraph Loop ["The Loop (Lloyd's Algorithm)"] direction TB Assignment["2. Assignment Step\n(Assign points to nearest)"] Update["3. Update Step\n(Move centroids to mean)"] Check{"Converged?"} Assignment --> Update Update --> Check Check -->|No| Assignment end Check -->|Yes| End(("Final Clusters")) style Loop fill:#f9fbe7,stroke:#827717,stroke-dasharray: 5 5 style Assignment fill:#fff9c4,stroke:#fbc02d style Update fill:#c8e6c9,stroke:#388e3c style Check fill:#ffe0b2,stroke:#f57c00
StepDescriptionFormula
InitializeRandomly select $K$ initial centroids-
AssignmentAssign each point to nearest centroid$r_{ik} = 1$ if $k = \arg\min_j |x_i - \mu_j|^2$
UpdateRecalculate centroids as cluster means$\mu_k = \dfrac{\sum_i r_{ik} x_i}{\sum_i r_{ik}}$
RepeatUntil centroids don’t change-
Convergence guarantee: Lloyd’s algorithm always converges because each step reduces (or maintains) the objective $J$. However, it may converge to a local minimum, not the global optimum.

Step-by-Step Example

Let’s trace through K-Means with $K=2$ on simple 2D data:

Iteration 0: Initialize

  • Randomly select 2 points as initial centroids

Iteration 1: Assign + Update

  • Assign each point to its nearest centroid
  • Recompute centroids as the mean of assigned points

Iteration 2+: Repeat

  • Continue until centroids stabilize
K-Means algorithm steps visualization
K-Means iterations: random initialization → assignment → update → convergence

Initialization Strategies

The initial centroids strongly affect the final result. Common strategies:

StrategyDescriptionQuality
RandomPick $K$ random points⚠️ Inconsistent
Random partitionRandomly assign points, then compute centroids⚠️ Inconsistent
K-Means++Smart initialization, spread out centroids✅ Better
Multiple runsRun algorithm $n$ times, pick best result✅ Robust
K-Means++ initialization (default in sklearn): Pick the first centroid randomly, then pick subsequent centroids with probability proportional to squared distance from existing centroids. This spreads out initial centroids and typically leads to better results.

Choosing K: The Elbow Method

A key question: how many clusters should we use?

The elbow method plots inertia (WCSS) against $K$ and looks for an “elbow” — a point where adding more clusters gives diminishing returns.

Elbow method for choosing K
Elbow method: Look for the 'elbow' where inertia decreases sharply then levels off

Practical Tip:

The “elbow” isn’t always sharp! It’s often subjective. If the elbow is at K=3, but K=4 makes more business sense (e.g., 4 clothing sizes: S, M, L, XL), choose the business logic. Models are tools, not masters.

Silhouette Score

Another approach is the silhouette score, which measures how similar points are to their own cluster vs. other clusters:

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

where:

  • $a(i)$ = mean distance from point $i$ to other points in same cluster
  • $b(i)$ = mean distance to points in nearest other cluster
  • $s(i) \in [-1, 1]$: higher is better
ScoreInterpretation
$\approx 1$Well-clustered
$\approx 0$On cluster boundary
$< 0$Possibly wrong cluster

Code Practice

Now that we understand the theory and the math, let’s bring it to life with code. We’ll start by building K-Means from scratch to see the mechanics, and then use sklearn for production-ready clustering.

K-Means from Scratch

Let’s implement Lloyd’s algorithm step by step:

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import numpy as np
import matplotlib.pyplot as plt

class KMeansFromScratch:
    def __init__(self, n_clusters=3, max_iters=100, random_state=42):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.random_state = random_state
        
    def fit(self, X):
        """Step-by-step implementation of Lloyd's Algorithm"""
        np.random.seed(self.random_state)
        n_samples = X.shape[0]
        
        # Step 1: Initialize centroids randomly
        # We pick k random data points to start
        random_idx = np.random.choice(n_samples, self.n_clusters, replace=False)
        self.centroids = X[random_idx].copy()
        
        self.history = [self.centroids.copy()]
        
        for iteration in range(self.max_iters):
            # Step 2: Assign points to nearest centroid
            # We compute distance matrix and find min index for each row
            distances = self._compute_distances(X)
            self.labels = np.argmin(distances, axis=1)
            
            # Step 3: Update centroids
            # Move centroid to the mean of all points assigned to it
            new_centroids = np.array([
                X[self.labels == k].mean(axis=0) 
                for k in range(self.n_clusters)
            ])
            
            self.history.append(new_centroids.copy())
            
            # Check convergence
            # If centroids didn't move, we are done!
            if np.allclose(self.centroids, new_centroids):
                print(f"Converged at iteration {iteration + 1}")
                break
                
            self.centroids = new_centroids
            
        return self
    
    def _compute_distances(self, X):
        """
        Vectorized distance calculation.
        Returns a matrix of shape (n_samples, n_clusters)
        where [i, k] is the distance from point i to centroid k.
        """
        distances = np.zeros((X.shape[0], self.n_clusters))
        for k, centroid in enumerate(self.centroids):
            # axis=1 means we sum across the columns (x, y coordinates)
            distances[:, k] = np.linalg.norm(X - centroid, axis=1)
        return distances
    
    def predict(self, X):
        """Assign new points to the nearest cluster"""
        distances = self._compute_distances(X)
        return np.argmin(distances, axis=1)
    
    def inertia(self, X):
        """Compute within-cluster sum of squares (WCSS)"""
        distances = self._compute_distances(X)
        # For each point, take the distance to its assigned centroid (min distance)
        # Square it, then sum them up
        return np.sum(np.min(distances, axis=1) ** 2)


# Generate sample data
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

# Fit our implementation
kmeans = KMeansFromScratch(n_clusters=4, random_state=42)
kmeans.fit(X)

print(f"\n📊 Final centroids:\n{kmeans.centroids}")
print(f"📏 Inertia: {kmeans.inertia(X):.2f}")

Output:

1
2
3
4
5
6
7
8
Converged at iteration 7

📊 Final centroids:
[[ 4.30482919  2.37577687]
 [-6.84180708 -6.84038791]
 [ 5.33339082  1.47653843]
 [-5.73523256  8.10177082]]
📏 Inertia: 1886.16

Visualizing K-Means Steps

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Visualize the algorithm iterations
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6']

for i, ax in enumerate(axes):
    if i >= len(kmeans.history):
        ax.axis('off')
        continue
        
    centroids = kmeans.history[i]
    
    # Compute labels for this iteration's centroids
    distances = np.array([[np.linalg.norm(x - c) for c in centroids] for x in X])
    labels = np.argmin(distances, axis=1)
    
    # Plot points
    for k in range(4):
        mask = labels == k
        ax.scatter(X[mask, 0], X[mask, 1], c=colors[k], alpha=0.5, s=30)
    
    # Plot centroids
    ax.scatter(centroids[:, 0], centroids[:, 1], 
               c=colors[:len(centroids)], marker='X', s=200, 
               edgecolors='black', linewidths=2)
    
    ax.set_title(f'Iteration {i}', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.suptitle('K-Means Algorithm: Step by Step', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('assets/kmeans_steps.png', dpi=150)
plt.show()
K-Means iterations visualization
K-Means converges in just 4 iterations on well-separated clusters

Using sklearn K-Means

In practice, use sklearn’s optimized implementation:

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Generate data
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

# Fit K-Means
# n_init=10: Run algorithm 10 times with different random starts and keep the best one.
#            Crucial to avoid getting stuck in bad local minima!
# random_state=42: For reproducibility throughout this tutorial.
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)

print("=" * 50)
print("SKLEARN K-MEANS RESULTS")
print("=" * 50)
print(f"\n📊 Cluster sizes: {np.bincount(labels)}")
print(f"📏 Inertia: {kmeans.inertia_:.2f}")
print(f"📐 Silhouette Score: {silhouette_score(X, labels):.4f}")
print(f"🔄 Iterations: {kmeans.n_iter_}")

Output:

1
2
3
4
5
6
7
8
==================================================
SKLEARN K-MEANS RESULTS
==================================================

📊 Cluster sizes: [75 75 75 75]
📏 Inertia: 362.47
📐 Silhouette Score: 0.8335
🔄 Iterations: 3

The Elbow Method

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Find optimal K using elbow method
inertias = []
silhouettes = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X, labels))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Elbow plot
axes[0].plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].axvline(x=4, color='r', linestyle='--', label='Elbow at K=4')
axes[0].set_xlabel('Number of Clusters (K)', fontsize=11)
axes[0].set_ylabel('Inertia (WCSS)', fontsize=11)
axes[0].set_title('Elbow Method', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Silhouette plot
axes[1].plot(K_range, silhouettes, 'go-', linewidth=2, markersize=8)
axes[1].axvline(x=4, color='r', linestyle='--', label='Best at K=4')
axes[1].set_xlabel('Number of Clusters (K)', fontsize=11)
axes[1].set_ylabel('Silhouette Score', fontsize=11)
axes[1].set_title('Silhouette Analysis', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/elbow_method.png', dpi=150)
plt.show()
Elbow method and silhouette analysis
Both elbow method and silhouette score suggest K=4 is optimal

Deep Dive

When K-Means Works Well

K-Means performs best when:

ConditionWhy
✅ Spherical clustersK-Means uses Euclidean distance, assumes circular shapes
✅ Similar cluster sizesAll clusters get roughly equal weight
✅ Well-separated clustersClear boundaries between groups
✅ Known number of clustersMust specify $K$ in advance

When K-Means Fails

K-Means struggles with:

  • Non-spherical clusters (elongated, irregular shapes)
  • Varying cluster densities (tight vs. spread out)
  • Varying cluster sizes (small vs. large)
  • Outliers (can heavily influence centroids)
K-Means limitations on non-spherical data
K-Means fails on non-spherical clusters — it splits the 'moons' incorrectly

Computational Complexity

OperationComplexity
Single iteration$O(N \cdot K \cdot d)$
Total (worst case)$O(N \cdot K \cdot d \cdot I)$

Where $N$ = samples, $K$ = clusters, $d$ = dimensions, $I$ = iterations.

Practical note: K-Means is very fast, even on large datasets. For millions of points, consider Mini-Batch K-Means.

Frequently Asked Questions

Q1: How do I choose K if the elbow isn’t clear?

Options:

  1. Use silhouette score (higher is better)
  2. Apply domain knowledge (what makes sense for your problem?)
  3. Try Gap statistic (compares to null reference)
  4. Use hierarchical clustering first to guide K selection

Q2: Why do I get different results each time?

K-Means is sensitive to initialization. Solutions:

  1. Set random_state for reproducibility
  2. Use n_init=10 (default) to run multiple times
  3. Use K-Means++ initialization (default in sklearn)

Q3: Can K-Means handle categorical data?

No, K-Means uses Euclidean distance which requires numerical data. Alternatives:

  • K-Modes: For categorical data
  • K-Prototypes: For mixed data
  • Convert categories to one-hot encoding (with caution)

Summary

ConceptKey Points
K-MeansPartition data into K clusters by minimizing WCSS
Lloyd’s AlgorithmAlternate: assign points → update centroids
InitializationUse K-Means++ for better results
Choosing KElbow method, silhouette score
LimitationsAssumes spherical, equal-sized clusters
Complexity$O(N \cdot K \cdot d \cdot I)$ — very fast

References

  • sklearn KMeans Documentation
  • Lloyd, S. (1982). “Least squares quantization in PCM”
  • Arthur, D. & Vassilvitskii, S. (2007). “k-means++: The advantages of careful seeding”
  • “Pattern Recognition and Machine Learning” by Christopher Bishop - Chapter 9.1