UML-01: Introduction to Unsupervised Learning

Publish on: 2022/10/01 Classify at: CODE/Unsupervised Machine Learning

Words: 2337 Read:≈ 11min

Summary

Unlock the hidden potential of 'Dark Data'. Master Unsupervised Learning fundamentals, explore the 'Student vs. Explorer' analogy, and navigate the taxonomy of Clustering, Dimensionality Reduction, and Anomaly Detection.

Learning Objectives

After reading this post, you will be able to:

Understand the fundamental differences between supervised and unsupervised learning
Know the three major categories of unsupervised methods: clustering, dimensionality reduction, and anomaly detection
Identify appropriate use cases for unsupervised techniques
Preview the UML series roadmap and what’s coming next

Acronym Clarification: In the context of this series, UML stands for Unsupervised Machine Learning, not the Unified Modeling Language used in software engineering. We use this abundance of caution because, ironically, “ambiguity” is a core theme of unsupervised learning!

Theory

What is Unsupervised Learning?

Imagine you are an explorer landing on an alien planet. You encounter strange plants and animals you’ve never seen before. There is no guidebook, no teacher, and no labels telling you “this is a tree” or “that is a wolf.”

What do you do?

You start observing. You notice that some creatures have wings and fly (Group A), while others have fins and swim (Group B). You notice that some plants are tall with wood (structure), while others are small and green. You are learning by observation. This is the essence of Unsupervised Learning.

In the Supervised Learning series, every training sample had a label — a “correct answer” provided by a “teacher.” But in the real world, most data is like that alien planet: vast, complex, and completely unlabeled. This is often called Dark Data.

Unsupervised learning discovers hidden patterns, structures, and relationships in this “dark” data without any labels. Instead of learning input-output mappings (like a student preparing for a test), these algorithms find the underlying structure in the data itself (like a scientist discovering natural laws).

Supervised vs Unsupervised: A Comparison

graph TD subgraph Supervised [🎓 Supervised Learning: The Student] direction TB S_Data[("Input Data + Correct Answers\n(Images + Labels)")] S_Algo["🧠 Model (Student)"] S_Pred["📝 Prediction"] S_Teacher["👨‍🏫 Teacher (Loss Function)"] S_Data --> S_Algo S_Algo --> S_Pred S_Pred --> S_Teacher S_Data -.->|"Correct Answer"| S_Teacher S_Teacher --"Feedback / Correction"--> S_Algo end subgraph Unsupervised [🔍 Unsupervised Learning: The Explorer] direction TB U_Data[("Input Data Only\n(Raw Observations)")] U_Algo["🧠 Model (Explorer)"] U_Process{{"Finding Similarities"}} U_Structure["📐 Hidden Structure\n(Clusters / Rules)"] U_Data --> U_Algo U_Algo --> U_Process U_Process --> U_Structure end style S_Teacher fill:#ffccbc,stroke:#d35400,stroke-width:2px style S_Algo fill:#e1f5fe style U_Algo fill:#fff9c4 style U_Process fill:#e1bee7

Aspect	Unsupervised Learning	Supervised Learning
Data	Unlabeled (input only)	Labeled (input + output pairs)
Goal	Discover patterns/structure in data	Predict labels for new data
Evaluation	Subjective; harder to evaluate	Clear metrics (accuracy, MSE)
Analogy	Learning by Exploration: You figure out the rules yourself.	Learning with a Teacher: The teacher corrects your mistakes.
Examples	Clustering, Dimensionality Reduction	Classification, Regression

Real-world insight: Labeling data is expensive, slow, and human-intensive. Unsupervised learning unlocks the potential of the remaining 95%+ of your data that sits unused.

The Explorer’s Toolkit

Just as an explorer uses different tools for different terrains (maps, compasses, drills), we use three main types of unsupervised learning to navigate unknown data:

graph TD Root["🔍 Unsupervised Learning"] subgraph Cluster [📊 Clustering] direction TB C_Goal[("Goal: Find Groups")] C1["K-Means"] C2["Hierarchical"] C3["DBSCAN"] C4["GMM"] end subgraph DimRed [📉 Dimensionality Reduction] direction TB DR_Goal[("Goal: Compress")] DR1["PCA"] DR2["t-SNE"] DR3["UMAP"] end subgraph Anomaly [⚠️ Anomaly Detection] direction TB A_Goal[("Goal: Find Outliers")] A1["Isolation Forest"] A2["One-Class SVM"] A3["Local Outlier Factor"] end Root --> Cluster Root --> DimRed Root --> Anomaly %% Connections inside subgraphs for vertical alignment C_Goal ~~~ C1 --> C2 --> C3 --> C4 DR_Goal ~~~ DR1 --> DR2 --> DR3 A_Goal ~~~ A1 --> A2 --> A3 %% Styling style Root fill:#e1f5fe,stroke:#01579b,stroke-width:2px,font-size:16px style Cluster fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style DimRed fill:#fff3e0,stroke:#ef6c00,stroke-width:2px style Anomaly fill:#ffebee,stroke:#c62828,stroke-width:2px style C_Goal fill:#c8e6c9,stroke:none style DR_Goal fill:#ffe0b2,stroke:none style A_Goal fill:#ffcdd2,stroke:none

Clustering: Finding Groups

Clustering partitions data into groups (clusters) where:

Points within a cluster are similar to each other
Points in different clusters are dissimilar

Use cases:

Customer segmentation (group customers by behavior)
Document categorization (group similar articles)
Image segmentation (group similar pixels)
Gene expression analysis (group similar genes)

Dimensionality Reduction: Compressing Information

Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving important information.

Use cases:

Visualization (plot 100D data in 2D)
Feature compression (reduce storage/computation)
Noise reduction (remove irrelevant dimensions)
Preprocessing for other ML algorithms

Anomaly Detection: Identifying Outliers

Anomaly detection finds data points that differ significantly from the majority — the “unusual” observations.

Use cases:

Fraud detection (unusual credit card transactions)
System monitoring (server failures, network intrusions)
Quality control (defective products)
Medical diagnosis (rare diseases)

When to Use Unsupervised Learning

graph LR A[Your Problem] --> B{Have labels?} B -->|Yes| C[Supervised Learning] B -->|No| D{Goal?} D -->|Find groups| E["Clustering"] D -->|Reduce dimensions| F["Dimensionality Reduction"] D -->|Find outliers| G["Anomaly Detection"] style E fill:#c8e6c9 style F fill:#fff9c4 style G fill:#ffcdd2

Pro tip: Unsupervised learning is often used as a preprocessing step for supervised learning:

Cluster data to create pseudo-labels
Reduce dimensions before training classifiers
Detect and remove anomalies from training data

Real-World Applications

Domain	Application	Method
E-commerce	Customer segmentation	K-Means, GMM
Finance	Fraud detection	Isolation Forest
Healthcare	Disease subtyping	Hierarchical Clustering
NLP	Topic modeling	LDA, Clustering
Computer Vision	Image compression	PCA
Bioinformatics	Gene clustering	DBSCAN
Recommendation	User behavior analysis	t-SNE + Clustering

Code Practice

Let’s put on our boots. In this section, we will take a raw, unlabeled dataset (Dark Data) and act as the explorer. We’ll attempt to rediscover hidden structure without any guide to help us.

Loading Unlabeled Data

In supervised learning, we always loaded data with labels. Now, let’s see what working with unlabeled data looks like:

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features only (sepal/petal lengths/widths)

# ⚠️ CRITICAL: In unsupervised learning, we DO NOT use the target (y labels)
# We pretend they don't exist and let the data speak for itself.
# y = iris.target  <-- We ignore this!

feature_names = iris.feature_names

print("=" * 50)
print("UNLABELED DATA EXPLORATION")
print("=" * 50)
print(f"\n📊 Dataset shape: {X.shape}")
print(f"📐 Features: {feature_names}")
print(f"\n🔢 Sample data (first 5 rows):")
print(X[:5])

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
==================================================
UNLABELED DATA EXPLORATION
==================================================

📊 Dataset shape: (150, 4)
📐 Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

🔢 Sample data (first 5 rows):
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]

Notice there are no labels! In supervised learning, we’d have y = iris.target with values like 0, 1, 2 for the three species. Here, we pretend we don’t know those labels — can the algorithm discover the groups on its own?

Visualizing Unlabeled Data

Before applying any algorithm, let’s visualize our data to see if natural groupings exist:

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Visualize data using two features
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Sepal dimensions
axes[0].scatter(X[:, 0], X[:, 1], c='steelblue', alpha=0.6, s=50)
axes[0].set_xlabel('Sepal Length (cm)')
axes[0].set_ylabel('Sepal Width (cm)')
axes[0].set_title('Iris Data: Sepal Dimensions')
axes[0].grid(True, alpha=0.3)

# Plot 2: Petal dimensions
axes[1].scatter(X[:, 2], X[:, 3], c='steelblue', alpha=0.6, s=50)
axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_ylabel('Petal Width (cm)')
axes[1].set_title('Iris Data: Petal Dimensions')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/unlabeled_data.png', dpi=150)
plt.show()

Scatter plots of Iris dataset without labels — Visualizing unlabeled Iris data: Sepal dimensions (left) and Petal dimensions (right). Can you spot natural groupings?

Looking at the petal dimensions plot, you might already notice some natural clusters. This is exactly what clustering algorithms will help us find!

Preview: K-Means Clustering

Let’s get a sneak peek of what clustering can do. We’ll cover K-Means in detail in the next post, but here’s a quick demo:

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from sklearn.cluster import KMeans

# Apply K-Means clustering
# n_clusters=3: We (the humans) know there are 3 species, but in real life, 
#               you might need to use the 'Elbow Method' to find this number.
# random_state=42: Ensures reproducible results.
# n_init=10: Run the algorithm 10 times with different starting points 
#            to avoid getting stuck in bad local situations.
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)

# The algorithm learns the structure (fit) and assigns labels (predict)
clusters = kmeans.fit_predict(X)

# Visualize the clustering result
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original data (no labels)
axes[0].scatter(X[:, 2], X[:, 3], c='steelblue', alpha=0.6, s=50)
axes[0].set_xlabel('Petal Length (cm)')
axes[0].set_ylabel('Petal Width (cm)')
axes[0].set_title('Before: Unlabeled Data')
axes[0].grid(True, alpha=0.3)

# Clustered data
colors = ['#e74c3c', '#3498db', '#2ecc71']
for i in range(3):
    mask = clusters == i
    axes[1].scatter(X[mask, 2], X[mask, 3], 
                    c=colors[i], alpha=0.6, s=50, 
                    label=f'Cluster {i+1}')

# Plot cluster centers
centers = kmeans.cluster_centers_
axes[1].scatter(centers[:, 2], centers[:, 3], 
                c='black', marker='X', s=200, 
                edgecolors='white', linewidths=2,
                label='Centroids')

axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_ylabel('Petal Width (cm)')
axes[1].set_title('After: K-Means Clustering (k=3)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/kmeans_preview.png', dpi=150)
plt.show()

print(f"\n📊 Clustering Results:")
print(f"   Cluster sizes: {np.bincount(clusters)}")

K-Means clustering result on Iris data — K-Means clustering discovers 3 natural groups in the Iris data — without ever seeing the true species labels!

Output:

1
2
📊 Clustering Results:
   Cluster sizes: [62 50 38]

Amazing! K-Means found 3 clusters that closely match the true Iris species — all without seeing any labels! The algorithm discovered the natural structure in the data purely from the feature values.

Comparing with True Labels (Cheating a Little)

Let’s peek at how well our unsupervised clustering matches the true labels:

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# True labels (we pretend we didn't have these!)
y_true = iris.target

# Evaluate clustering quality
# Adjusted Rand Index (ARI): Measures similarity between true and predicted labels.
# 0.0 = Random labeling
# 1.0 = Perfect match
ari = adjusted_rand_score(y_true, clusters)

# Normalized Mutual Information (NMI): Just another way to measure agreement.
nmi = normalized_mutual_info_score(y_true, clusters)

print("=" * 50)
print("CLUSTERING EVALUATION (using hidden labels)")
print("=" * 50)
print(f"\n📊 Adjusted Rand Index: {ari:.4f}")
print(f"📊 Normalized Mutual Info: {nmi:.4f}")
print(f"\n💡 Note: In real unsupervised learning, you wouldn't have 'y_true'!")
print(f"   You would rely on business logic or internal metrics like Silhouette Score.")

Output:

1
2
3
4
5
6
7
8
9
==================================================
CLUSTERING EVALUATION (using hidden labels)
==================================================

📊 Adjusted Rand Index: 0.7302
📊 Normalized Mutual Info: 0.7582

💡 Note: In real unsupervised learning, you wouldn't have 'y_true'!
   You would rely on business logic or internal metrics like Silhouette Score.

Evaluation note: In real unsupervised learning, you typically don’t have true labels to compare against. These metrics are used here purely for demonstration. We’ll discuss proper evaluation techniques for unsupervised learning in later posts.

Deep Dive

Frequently Asked Questions

Q1: How do you evaluate unsupervised learning without labels?

This is the hardest part of being an explorer: you don’t have an answer key. Instead of checking against “correct” labels, we measure success by the utility of the discovery. We rely on a combination of mathematical heuristics and practical validation:

Approach	Method	Description
Internal metrics	Silhouette Score	Measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
	Calinski-Harabasz	Ratio of between-cluster dispersion to within-cluster dispersion.
External metrics	Adjusted Rand Index	Only possible if you have a small subset of labeled data for validation.
Visual inspection	t-SNE / PCA	Projecting data to 2D/3D to visually verify if the clusters “look” separated.
Business Validation	A/B Testing	The gold standard. For example, if you cluster customers, send different marketing campaigns to each cluster and see if conversion rates improve.

The “So What?” Test: The best evaluation metric for unsupervised learning is often utility. Does the new structure help solve the business problem? If a clustering model groups customers in a way that allows the marketing team to craft better campaigns, it is a good model, regardless of its Silhouette Score.

Q2: When should I choose unsupervised over supervised learning?

Use unsupervised learning when:

✅ You don’t have labeled data
✅ Labeling is too expensive or impossible
✅ You want to explore data structure before building models
✅ You’re looking for anomalies or unusual patterns
✅ You need to reduce dimensionality for visualization or efficiency

Q3: Can unsupervised learning create labels for supervised learning?

Yes! This is called semi-supervised learning or self-training:

graph LR A["Unlabeled Data"] --> B["Clustering"] B --> C["Pseudo-labels"] C --> D["Train Classifier"] D --> E["Final Model"] style C fill:#fff9c4

This approach can leverage large amounts of unlabeled data to improve models when labeled data is scarce.

Q4: What’s the difference between clustering and classification?

Aspect	Clustering	Classification
Labels	No predefined labels	Known classes
Goal	Discover groups	Assign to known groups
Evaluation	Internal metrics	Accuracy, F1, etc.
Learning type	Unsupervised	Supervised

The Challenge of Unsupervised Learning

Key insight: Unsupervised learning has no single “correct answer.” Different algorithms may produce very different results on the same data. Choosing the right number of clusters or the right algorithm requires domain knowledge and experimentation.

UML Series Roadmap

This series will cover the following topics:

Post	Topic	Key Concepts
UML-01	Introduction (this post)	Overview and taxonomy
UML-02	K-Means Clustering	Lloyd’s algorithm, initialization, elbow method
UML-03	Hierarchical Clustering	Dendrograms, linkage methods
UML-04	DBSCAN	Density-based clustering, core points
UML-05	Gaussian Mixture Models	EM algorithm, soft clustering
UML-06	PCA	Eigendecomposition, variance explained
UML-07	t-SNE & UMAP	Visualization techniques
UML-08	Anomaly Detection	Isolation Forest, LOF
UML-09	Association Rules	Apriori, market basket analysis
UML-10	Conclusion	Algorithm selection guide

Summary

Concept	Key Points
Unsupervised Learning	Learning from unlabeled data
Clustering	Grouping similar data points together
Dimensionality Reduction	Compressing data to fewer dimensions
Anomaly Detection	Finding unusual data points
Key Challenge	No labels means no single “correct” answer
Evaluation	Requires internal metrics or domain knowledge

References

sklearn Clustering Documentation
sklearn Dimensionality Reduction
“Pattern Recognition and Machine Learning” by Christopher Bishop - Chapter 9
“The Elements of Statistical Learning” by Hastie, Tibshirani, Friedman - Chapter 14