UML-10: Unsupervised Learning Series — Conclusion and Algorithm Guide

Publish on: 2022/10/10 Classify at: CODE/Unsupervised Machine Learning

Words: 601 Read:≈ 3min

Summary

The Explorer's Field Guide: A complete index of the Unsupervised Learning series, summarizing 9 algorithms with their core analogies and a master decision compass.

The Explorer’s Journal

We began this journey with a Blank Map (“Dark Data”). Over 9 expeditions, we have filled it with tools, sketches, and discoveries. You now possess a complete Unsupervised Learning Toolkit.

The Toolkit Summary

Here is your index of the 9 powerful tools we’ve mastered:

Algorithm	Analogy	Best For…
K-Means	The Delivery Center	Simple, spherical groups (Optimization).
Hierarchical	The Digital Librarian	Organizing data into a taxonomy.
DBSCAN	The Island Finder	Weird shapes & checking for noise.
GMM	The Foggy Explorer	Overlapping, soft clusters (Probability).
PCA	The Photographer	Compression & finding the “best angle”.
t-SNE	The Social Planner	Visualizing local relationships details.
UMAP	The Sketch Artist	Fast, global structure visualization.
Anomaly	The Rare Species Hunter	Finding fraud & outliers (Phoenixes).
Assoc. Rules	The Symbiosis Scouter	Finding interactions (Clownfish & Anemone).

The Decision Compass

Where should you go next? Use this map to navigate your future data expeditions.

graph TD
    A["🎯 Start: Unknown Data"] --> B{"What is your Goal?"}
    
    B -->|Find Groups| C{"Do you know how
many groups (K)?"}
    B -->|Simplify Data| D{"Preprocessing or
Visualization?"}
    B -->|Find Outliers| E["Anomaly Detection"]
    B -->|Find Rules| F["Association Rules
(Apriori)"]
    
    C -->|Yes e.g. Size S/M/L| G["K-Means / GMM"]
    C -->|No Discover K| H{"Is the shape
complex?"}
    
    H -->|No Hierarchical| I["Hierarchical Clustering"]
    H -->|Yes Arbitrary| J["DBSCAN / HDBSCAN"]
    
    D -->|Visualize Non-Linear| K["t-SNE / UMAP"]
    D -->|Compress Linear| L["PCA"]
    
    E --> M["Isolation Forest / LOF"]
    
    style G fill:#c8e6c9
    style J fill:#c8e6c9
    style K fill:#fff9c4
    style L fill:#fff9c4
    style M fill:#ffcdd2

Algorithm Comparison

Clustering Algorithms

Algorithm	Specify K?	Cluster Shape	Noise Handling	Complexity
K-Means	✓ Yes	Spherical	✗ Poor	$O(NKdI)$
Hierarchical	✗ No	Depends on linkage	✗ Poor	$O(N^2 \log N)$
DBSCAN	✗ No	Any	✓ Built-in	$O(N \log N)$
GMM	✓ Yes	Elliptical	✗ Poor	$O(NK^2d)$

Dimensionality Reduction

Algorithm	Linear?	Preserves	Speed	Use Case
PCA	✓ Yes	Global variance	Fast	Preprocessing, compression
t-SNE	✗ No	Local structure	Slow	Visualization (small data)
UMAP	✗ No	Local + some global	Fast	Visualization (large data)

Quick Reference

Scenario	Recommended Algorithm
Customer segmentation	K-Means → GMM
Document clustering	K-Means with TF-IDF
Anomaly detection	Isolation Forest
High-D visualization	UMAP → t-SNE
Feature compression	PCA
Market basket analysis	Apriori / FP-Growth
Unknown cluster count	DBSCAN / Hierarchical

Key Concepts Summary

Core Principles

Concept	Key Insight
No labels needed	Discover structure from data alone
Evaluation is hard	No ground truth — use internal metrics + domain knowledge
Preprocessing matters	Scaling is often required
Multiple methods	Try several algorithms, compare results
Domain knowledge	Results need interpretation

Method Selection Checklist

Before choosing an algorithm:

✅ What’s your goal? (clustering, reduction, anomaly, patterns)
✅ How many clusters expected? (known K or not)
✅ What shape are clusters? (spherical, arbitrary)
✅ Is there noise? (need noise handling)
✅ How big is the data? (scalability concerns)
✅ Do you need probabilities? (GMM vs K-Means)

Recommended Next Steps

Deep Learning Extensions

Autoencoders: Neural network dimensionality reduction
VAE: Variational Autoencoders for generative modeling
Self-supervised learning: Learn representations without labels

Advanced Topics

Spectral clustering: Graph-based clustering
Gaussian Processes: Probabilistic function learning
Topic modeling: LDA for text analysis

Real-World Practice

Kaggle unsupervised competitions
Customer segmentation projects
Anomaly detection in time series
Image clustering and retrieval

Series at a Glance

Post	Algorithm	Key Takeaway
UML-01	Introduction	Taxonomy of unsupervised methods
UML-02	K-Means	Lloyd’s algorithm, elbow method
UML-03	Hierarchical	Dendrograms, linkage methods
UML-04	DBSCAN	Density-based, noise handling
UML-05	GMM	Soft clustering, EM algorithm
UML-06	PCA	Variance explained, eigendecomposition
UML-07	t-SNE/UMAP	Non-linear visualization
UML-08	Anomaly Detection	Isolation Forest, LOF
UML-09	Association Rules	Support, confidence, lift

References

sklearn Unsupervised Learning
“Pattern Recognition and Machine Learning” by Christopher Bishop
“The Elements of Statistical Learning” by Hastie, Tibshirani, Friedman
UMAP Documentation
mlxtend Documentation