UML-01: Introduction to Unsupervised Learning

Summary
Unlock the hidden potential of 'Dark Data'. Master Unsupervised Learning fundamentals, explore the 'Student vs. Explorer' analogy, and navigate the taxonomy of Clustering, Dimensionality Reduction, and Anomaly Detection.

Learning Objectives

After reading this post, you will be able to:

  • Understand the fundamental differences between supervised and unsupervised learning
  • Know the three major categories of unsupervised methods: clustering, dimensionality reduction, and anomaly detection
  • Identify appropriate use cases for unsupervised techniques
  • Preview the UML series roadmap and what’s coming next
Acronym Clarification: In the context of this series, UML stands for Unsupervised Machine Learning, not the Unified Modeling Language used in software engineering. We use this abundance of caution because, ironically, “ambiguity” is a core theme of unsupervised learning!

Theory

What is Unsupervised Learning?

Imagine you are an explorer landing on an alien planet. You encounter strange plants and animals you’ve never seen before. There is no guidebook, no teacher, and no labels telling you “this is a tree” or “that is a wolf.”

What do you do?

You start observing. You notice that some creatures have wings and fly (Group A), while others have fins and swim (Group B). You notice that some plants are tall with wood (structure), while others are small and green. You are learning by observation. This is the essence of Unsupervised Learning.

In the Supervised Learning series, every training sample had a label โ€” a “correct answer” provided by a “teacher.” But in the real world, most data is like that alien planet: vast, complex, and completely unlabeled. This is often called Dark Data.

Unsupervised learning discovers hidden patterns, structures, and relationships in this “dark” data without any labels. Instead of learning input-output mappings (like a student preparing for a test), these algorithms find the underlying structure in the data itself (like a scientist discovering natural laws).

Supervised vs Unsupervised: A Comparison

graph TD subgraph Supervised [๐ŸŽ“ Supervised Learning: The Student] direction TB S_Data[("Input Data + Correct Answers\n(Images + Labels)")] S_Algo["๐Ÿง  Model (Student)"] S_Pred["๐Ÿ“ Prediction"] S_Teacher["๐Ÿ‘จโ€๐Ÿซ Teacher (Loss Function)"] S_Data --> S_Algo S_Algo --> S_Pred S_Pred --> S_Teacher S_Data -.->|"Correct Answer"| S_Teacher S_Teacher --"Feedback / Correction"--> S_Algo end subgraph Unsupervised [๐Ÿ” Unsupervised Learning: The Explorer] direction TB U_Data[("Input Data Only\n(Raw Observations)")] U_Algo["๐Ÿง  Model (Explorer)"] U_Process{{"Finding Similarities"}} U_Structure["๐Ÿ“ Hidden Structure\n(Clusters / Rules)"] U_Data --> U_Algo U_Algo --> U_Process U_Process --> U_Structure end style S_Teacher fill:#ffccbc,stroke:#d35400,stroke-width:2px style S_Algo fill:#e1f5fe style U_Algo fill:#fff9c4 style U_Process fill:#e1bee7
AspectUnsupervised LearningSupervised Learning
DataUnlabeled (input only)Labeled (input + output pairs)
GoalDiscover patterns/structure in dataPredict labels for new data
EvaluationSubjective; harder to evaluateClear metrics (accuracy, MSE)
AnalogyLearning by Exploration: You figure out the rules yourself.Learning with a Teacher: The teacher corrects your mistakes.
ExamplesClustering, Dimensionality ReductionClassification, Regression
Real-world insight: Labeling data is expensive, slow, and human-intensive. Unsupervised learning unlocks the potential of the remaining 95%+ of your data that sits unused.

The Explorer’s Toolkit

Just as an explorer uses different tools for different terrains (maps, compasses, drills), we use three main types of unsupervised learning to navigate unknown data:

graph TD Root["๐Ÿ” Unsupervised Learning"] subgraph Cluster [๐Ÿ“Š Clustering] direction TB C_Goal[("Goal: Find Groups")] C1["K-Means"] C2["Hierarchical"] C3["DBSCAN"] C4["GMM"] end subgraph DimRed [๐Ÿ“‰ Dimensionality Reduction] direction TB DR_Goal[("Goal: Compress")] DR1["PCA"] DR2["t-SNE"] DR3["UMAP"] end subgraph Anomaly [โš ๏ธ Anomaly Detection] direction TB A_Goal[("Goal: Find Outliers")] A1["Isolation Forest"] A2["One-Class SVM"] A3["Local Outlier Factor"] end Root --> Cluster Root --> DimRed Root --> Anomaly %% Connections inside subgraphs for vertical alignment C_Goal ~~~ C1 --> C2 --> C3 --> C4 DR_Goal ~~~ DR1 --> DR2 --> DR3 A_Goal ~~~ A1 --> A2 --> A3 %% Styling style Root fill:#e1f5fe,stroke:#01579b,stroke-width:2px,font-size:16px style Cluster fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px style DimRed fill:#fff3e0,stroke:#ef6c00,stroke-width:2px style Anomaly fill:#ffebee,stroke:#c62828,stroke-width:2px style C_Goal fill:#c8e6c9,stroke:none style DR_Goal fill:#ffe0b2,stroke:none style A_Goal fill:#ffcdd2,stroke:none

Clustering: Finding Groups

Clustering partitions data into groups (clusters) where:

  • Points within a cluster are similar to each other
  • Points in different clusters are dissimilar

Use cases:

  • Customer segmentation (group customers by behavior)
  • Document categorization (group similar articles)
  • Image segmentation (group similar pixels)
  • Gene expression analysis (group similar genes)

Dimensionality Reduction: Compressing Information

Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving important information.

Use cases:

  • Visualization (plot 100D data in 2D)
  • Feature compression (reduce storage/computation)
  • Noise reduction (remove irrelevant dimensions)
  • Preprocessing for other ML algorithms

Anomaly Detection: Identifying Outliers

Anomaly detection finds data points that differ significantly from the majority โ€” the “unusual” observations.

Use cases:

  • Fraud detection (unusual credit card transactions)
  • System monitoring (server failures, network intrusions)
  • Quality control (defective products)
  • Medical diagnosis (rare diseases)

When to Use Unsupervised Learning

graph LR A[Your Problem] --> B{Have labels?} B -->|Yes| C[Supervised Learning] B -->|No| D{Goal?} D -->|Find groups| E["Clustering"] D -->|Reduce dimensions| F["Dimensionality Reduction"] D -->|Find outliers| G["Anomaly Detection"] style E fill:#c8e6c9 style F fill:#fff9c4 style G fill:#ffcdd2

Pro tip: Unsupervised learning is often used as a preprocessing step for supervised learning:

  • Cluster data to create pseudo-labels
  • Reduce dimensions before training classifiers
  • Detect and remove anomalies from training data

Real-World Applications

DomainApplicationMethod
E-commerceCustomer segmentationK-Means, GMM
FinanceFraud detectionIsolation Forest
HealthcareDisease subtypingHierarchical Clustering
NLPTopic modelingLDA, Clustering
Computer VisionImage compressionPCA
BioinformaticsGene clusteringDBSCAN
RecommendationUser behavior analysist-SNE + Clustering

Code Practice

Let’s put on our boots. In this section, we will take a raw, unlabeled dataset (Dark Data) and act as the explorer. We’ll attempt to rediscover hidden structure without any guide to help us.

Loading Unlabeled Data

In supervised learning, we always loaded data with labels. Now, let’s see what working with unlabeled data looks like:

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features only (sepal/petal lengths/widths)

# โš ๏ธ CRITICAL: In unsupervised learning, we DO NOT use the target (y labels)
# We pretend they don't exist and let the data speak for itself.
# y = iris.target  <-- We ignore this!

feature_names = iris.feature_names

print("=" * 50)
print("UNLABELED DATA EXPLORATION")
print("=" * 50)
print(f"\n๐Ÿ“Š Dataset shape: {X.shape}")
print(f"๐Ÿ“ Features: {feature_names}")
print(f"\n๐Ÿ”ข Sample data (first 5 rows):")
print(X[:5])

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
==================================================
UNLABELED DATA EXPLORATION
==================================================

๐Ÿ“Š Dataset shape: (150, 4)
๐Ÿ“ Features: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

๐Ÿ”ข Sample data (first 5 rows):
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
Notice there are no labels! In supervised learning, we’d have y = iris.target with values like 0, 1, 2 for the three species. Here, we pretend we don’t know those labels โ€” can the algorithm discover the groups on its own?

Visualizing Unlabeled Data

Before applying any algorithm, let’s visualize our data to see if natural groupings exist:

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Visualize data using two features
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Sepal dimensions
axes[0].scatter(X[:, 0], X[:, 1], c='steelblue', alpha=0.6, s=50)
axes[0].set_xlabel('Sepal Length (cm)')
axes[0].set_ylabel('Sepal Width (cm)')
axes[0].set_title('Iris Data: Sepal Dimensions')
axes[0].grid(True, alpha=0.3)

# Plot 2: Petal dimensions
axes[1].scatter(X[:, 2], X[:, 3], c='steelblue', alpha=0.6, s=50)
axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_ylabel('Petal Width (cm)')
axes[1].set_title('Iris Data: Petal Dimensions')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/unlabeled_data.png', dpi=150)
plt.show()
Scatter plots of Iris dataset without labels
Visualizing unlabeled Iris data: Sepal dimensions (left) and Petal dimensions (right). Can you spot natural groupings?

Looking at the petal dimensions plot, you might already notice some natural clusters. This is exactly what clustering algorithms will help us find!

Preview: K-Means Clustering

Let’s get a sneak peek of what clustering can do. We’ll cover K-Means in detail in the next post, but here’s a quick demo:

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from sklearn.cluster import KMeans

# Apply K-Means clustering
# n_clusters=3: We (the humans) know there are 3 species, but in real life, 
#               you might need to use the 'Elbow Method' to find this number.
# random_state=42: Ensures reproducible results.
# n_init=10: Run the algorithm 10 times with different starting points 
#            to avoid getting stuck in bad local situations.
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)

# The algorithm learns the structure (fit) and assigns labels (predict)
clusters = kmeans.fit_predict(X)

# Visualize the clustering result
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Original data (no labels)
axes[0].scatter(X[:, 2], X[:, 3], c='steelblue', alpha=0.6, s=50)
axes[0].set_xlabel('Petal Length (cm)')
axes[0].set_ylabel('Petal Width (cm)')
axes[0].set_title('Before: Unlabeled Data')
axes[0].grid(True, alpha=0.3)

# Clustered data
colors = ['#e74c3c', '#3498db', '#2ecc71']
for i in range(3):
    mask = clusters == i
    axes[1].scatter(X[mask, 2], X[mask, 3], 
                    c=colors[i], alpha=0.6, s=50, 
                    label=f'Cluster {i+1}')

# Plot cluster centers
centers = kmeans.cluster_centers_
axes[1].scatter(centers[:, 2], centers[:, 3], 
                c='black', marker='X', s=200, 
                edgecolors='white', linewidths=2,
                label='Centroids')

axes[1].set_xlabel('Petal Length (cm)')
axes[1].set_ylabel('Petal Width (cm)')
axes[1].set_title('After: K-Means Clustering (k=3)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('assets/kmeans_preview.png', dpi=150)
plt.show()

print(f"\n๐Ÿ“Š Clustering Results:")
print(f"   Cluster sizes: {np.bincount(clusters)}")
K-Means clustering result on Iris data
K-Means clustering discovers 3 natural groups in the Iris data โ€” without ever seeing the true species labels!

Output:

1
2
๐Ÿ“Š Clustering Results:
   Cluster sizes: [62 50 38]
Amazing! K-Means found 3 clusters that closely match the true Iris species โ€” all without seeing any labels! The algorithm discovered the natural structure in the data purely from the feature values.

Comparing with True Labels (Cheating a Little)

Let’s peek at how well our unsupervised clustering matches the true labels:

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# True labels (we pretend we didn't have these!)
y_true = iris.target

# Evaluate clustering quality
# Adjusted Rand Index (ARI): Measures similarity between true and predicted labels.
# 0.0 = Random labeling
# 1.0 = Perfect match
ari = adjusted_rand_score(y_true, clusters)

# Normalized Mutual Information (NMI): Just another way to measure agreement.
nmi = normalized_mutual_info_score(y_true, clusters)

print("=" * 50)
print("CLUSTERING EVALUATION (using hidden labels)")
print("=" * 50)
print(f"\n๐Ÿ“Š Adjusted Rand Index: {ari:.4f}")
print(f"๐Ÿ“Š Normalized Mutual Info: {nmi:.4f}")
print(f"\n๐Ÿ’ก Note: In real unsupervised learning, you wouldn't have 'y_true'!")
print(f"   You would rely on business logic or internal metrics like Silhouette Score.")

Output:

1
2
3
4
5
6
7
8
9
==================================================
CLUSTERING EVALUATION (using hidden labels)
==================================================

๐Ÿ“Š Adjusted Rand Index: 0.7302
๐Ÿ“Š Normalized Mutual Info: 0.7582

๐Ÿ’ก Note: In real unsupervised learning, you wouldn't have 'y_true'!
   You would rely on business logic or internal metrics like Silhouette Score.
Evaluation note: In real unsupervised learning, you typically don’t have true labels to compare against. These metrics are used here purely for demonstration. We’ll discuss proper evaluation techniques for unsupervised learning in later posts.

Deep Dive

Frequently Asked Questions

Q1: How do you evaluate unsupervised learning without labels?

This is the hardest part of being an explorer: you don’t have an answer key. Instead of checking against “correct” labels, we measure success by the utility of the discovery. We rely on a combination of mathematical heuristics and practical validation:

ApproachMethodDescription
Internal metricsSilhouette ScoreMeasures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
Calinski-HarabaszRatio of between-cluster dispersion to within-cluster dispersion.
External metricsAdjusted Rand IndexOnly possible if you have a small subset of labeled data for validation.
Visual inspectiont-SNE / PCAProjecting data to 2D/3D to visually verify if the clusters “look” separated.
Business ValidationA/B TestingThe gold standard. For example, if you cluster customers, send different marketing campaigns to each cluster and see if conversion rates improve.
The “So What?” Test: The best evaluation metric for unsupervised learning is often utility. Does the new structure help solve the business problem? If a clustering model groups customers in a way that allows the marketing team to craft better campaigns, it is a good model, regardless of its Silhouette Score.

Q2: When should I choose unsupervised over supervised learning?

Use unsupervised learning when:

  • โœ… You don’t have labeled data
  • โœ… Labeling is too expensive or impossible
  • โœ… You want to explore data structure before building models
  • โœ… You’re looking for anomalies or unusual patterns
  • โœ… You need to reduce dimensionality for visualization or efficiency

Q3: Can unsupervised learning create labels for supervised learning?

Yes! This is called semi-supervised learning or self-training:

graph LR A["Unlabeled Data"] --> B["Clustering"] B --> C["Pseudo-labels"] C --> D["Train Classifier"] D --> E["Final Model"] style C fill:#fff9c4

This approach can leverage large amounts of unlabeled data to improve models when labeled data is scarce.

Q4: What’s the difference between clustering and classification?

AspectClusteringClassification
LabelsNo predefined labelsKnown classes
GoalDiscover groupsAssign to known groups
EvaluationInternal metricsAccuracy, F1, etc.
Learning typeUnsupervisedSupervised

The Challenge of Unsupervised Learning

Key insight: Unsupervised learning has no single “correct answer.” Different algorithms may produce very different results on the same data. Choosing the right number of clusters or the right algorithm requires domain knowledge and experimentation.

UML Series Roadmap

This series will cover the following topics:

PostTopicKey Concepts
UML-01Introduction (this post)Overview and taxonomy
UML-02K-Means ClusteringLloyd’s algorithm, initialization, elbow method
UML-03Hierarchical ClusteringDendrograms, linkage methods
UML-04DBSCANDensity-based clustering, core points
UML-05Gaussian Mixture ModelsEM algorithm, soft clustering
UML-06PCAEigendecomposition, variance explained
UML-07t-SNE & UMAPVisualization techniques
UML-08Anomaly DetectionIsolation Forest, LOF
UML-09Association RulesApriori, market basket analysis
UML-10ConclusionAlgorithm selection guide

Summary

ConceptKey Points
Unsupervised LearningLearning from unlabeled data
ClusteringGrouping similar data points together
Dimensionality ReductionCompressing data to fewer dimensions
Anomaly DetectionFinding unusual data points
Key ChallengeNo labels means no single “correct” answer
EvaluationRequires internal metrics or domain knowledge

References