ML-13: Naive Bayes: The Power of Generative Models
Learning Objectives
- Distinguish discriminative vs generative models
- Apply Bayes’ theorem for classification
- Understand the “naive” independence assumption
- Handle zero probabilities with Laplace smoothing
Theory
Generative vs Discriminative Models
Before diving into Naive Bayes, two fundamentally different approaches to classification should be understood.
Discriminative Models
Directly model $P(y|x)$ — the probability of the class given the features.
Analogy: A discriminative approach to identifying dogs vs cats is like memorizing the differences between them: “If it has pointy ears and meows, it’s a cat.”
Examples: Logistic Regression, SVM, Neural Networks
Generative Models
Model $P(x|y)$ — the probability of features given each class — then use Bayes’ theorem.
Analogy: A generative approach is like learning what dogs look like and what cats look like separately, then asking “Which model better explains this animal?”
Examples: Naive Bayes, Gaussian Mixture Models, Hidden Markov Models
| Aspect | Discriminative | Generative |
|---|---|---|
| Models | Decision boundary directly | How data is generated |
| Learns | $P(y \mid x)$ | $P(x \mid y)$ and $P(y)$ |
| Pros | Often more accurate | Can generate new data, handles missing values |
| Cons | Can’t generate data | Makes stronger assumptions |
Bayes’ Theorem: The Foundation
The heart of Naive Bayes:
$$\boxed{P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)}}$$
Each term has a specific meaning:
| Term | Name | Meaning |
|---|---|---|
| $P(y \mid x)$ | Posterior | Probability of class $y$ after seeing features $x$ |
| $P(x \mid y)$ | Likelihood | How likely are these features for class $y$ |
| $P(y)$ | Prior | Base probability of class $y$ (before seeing data) |
| $P(x)$ | Evidence | Probability of observing these features |
Spam Classification Example
Suppose we want to classify: “Free money now!”
| Term | Calculation |
|---|---|
| Prior $P(\text{spam})$ | 30% of all emails are spam → $P(\text{spam}) = 0.3$ |
| Likelihood $P(\text{“free money”} \mid \text{spam})$ | 80% of spam emails contain these words → $P(x \mid \text{spam}) = 0.8$ |
| Likelihood $P(\text{“free money”} \mid \text{ham})$ | Only 1% of legit emails have these words → $P(x \mid \text{ham}) = 0.01$ |
Applying Bayes: $$P(\text{spam}|\text{“free money”}) = \frac{0.8 \times 0.3}{0.8 \times 0.3 + 0.01 \times 0.7} = \frac{0.24}{0.247} \approx 97\%$$
The “Naive” Independence Assumption
Here’s the problem: with many features, $P(x_1, x_2, …, x_n | y)$ has exponentially many parameters to estimate.
The Solution: Assume Independence
Naive Bayes assumes that features are conditionally independent given the class:
$$P(x_1, x_2, …, x_n | y) = \prod_{i=1}^n P(x_i | y)$$
This reduces parameters from $O(|V|^n)$ to $O(n \cdot |V|)$!
Why “Naive”?
This assumption is often wrong! In spam detection:
- “Free” and “money” are highly correlated
- “Nigerian” and “prince” appear together
Yet Naive Bayes works surprisingly well because:
- Classification only needs the argmax — exact probabilities don’t matter
- Errors often cancel out across features
- Simple models with lots of data often beat complex models with little data
The Three Variants
Multinomial Naive Bayes
For count data — how many times each feature appears.
$$P(x|y) = \frac{(\sum_i x_i)!}{\prod_i x_i!} \prod_i P(w_i|y)^{x_i}$$
In practice, we use log-probabilities:
$$\log P(y|x) \propto \log P(y) + \sum_i x_i \cdot \log P(w_i|y)$$
| Use Case | Example |
|---|---|
| Text classification | Word counts in documents |
| Topic modeling | Term frequency vectors |
Bernoulli Naive Bayes
For binary features — presence/absence only.
$$P(x|y) = \prod_i P(w_i|y)^{x_i} \cdot (1 - P(w_i|y))^{(1-x_i)}$$
Key difference from Multinomial: explicitly penalizes absence of features.
| Use Case | Example |
|---|---|
| Short text (tweets) | Word presence, not count |
| Binary features | “Has feature X?” questions |
Gaussian Naive Bayes
For continuous features — assumes normal distribution.
$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)$$
For each class $y$ and feature $i$, estimate:
- $\mu_{y,i}$ = mean of feature $i$ for class $y$
- $\sigma_{y,i}^2$ = variance of feature $i$ for class $y$
| Use Case | Example |
|---|---|
| Numeric data | Height, weight, sensor readings |
| Mixed with other models | Baseline comparison |
Laplace Smoothing: Handling Zero Probabilities
The Problem
If a word never appears in spam training emails, its probability is zero:
$$P(\text{“meeting”}|\text{spam}) = \frac{0}{\text{total spam words}} = 0$$
This zeros out the entire product, regardless of other evidence!
The Solution: Add-α Smoothing
Add a small count $\alpha$ to every feature:
$$P(x_i|y) = \frac{\text{count}(x_i, y) + \alpha}{\text{count}(y) + \alpha \cdot |V|}$$
where $|V|$ is the vocabulary size.
| α Value | Effect |
|---|---|
| $\alpha = 0$ | No smoothing (risk of zeros) |
| $\alpha = 1$ | Laplace smoothing (most common) |
| $\alpha > 1$ | Stronger smoothing → more uniform |
| $\alpha < 1$ | Lidstone smoothing (less shrinkage) |
alpha=1.0 (Laplace smoothing). For text classification, try values in [0.01, 1.0] via cross-validation.Code Practice
The following examples demonstrate Naive Bayes in action for text classification.
Text Classification: Spam Detection
🐍 Python
| |
Output:
Gaussian Naive Bayes for Continuous Data
🐍 Python
| |
Output:
| |
Effect of Laplace Smoothing (α)
🐍 Python
| |
Output:
Real-World: Text Classification Pipeline
🐍 Python
| |
Output:
Deep Dive
Frequently Asked Questions
Q1: Why is it called “naive”?
The conditional independence assumption is almost always wrong in practice:
| Example | Reality | Naive Assumption |
|---|---|---|
| “free” and “money” in spam | Highly correlated | Treated as independent |
| Pixel intensities in images | Neighboring pixels are similar | Treated as independent |
Yet it works because:
- Only the argmax is needed — relative ordering matters, not exact probabilities
- Errors tend to cancel out when averaged over many features
- Fewer parameters = less overfitting with limited data
Q2: When does Naive Bayes excel?
| Scenario | Why NB Works Well |
|---|---|
| Text classification | High-dim sparse features, independence “good enough” |
| Spam filtering | Fast, incremental updates, interpretable |
| As a baseline | Quick to train, hard to beat on small data |
| Multi-class problems | Naturally handles many classes |
| Real-time systems | O(1) prediction per class |
Q3: When does Naive Bayes fail?
| Scenario | Why It Fails | Alternative |
|---|---|---|
| Highly correlated features | Independence assumption badly violated | Logistic Regression, SVM |
| Need calibrated probabilities | NB probabilities are often extreme | Use CalibratedClassifierCV |
| Complex decision boundaries | Linear in log-space only | Random Forest, Neural Networks |
| Numeric features, non-Gaussian | Gaussian NB assumes normality | Transform data or use other models |
Q4: Naive Bayes vs. Logistic Regression vs. SVM
| Aspect | Naive Bayes | Logistic Regression | SVM |
|---|---|---|---|
| Type | Generative | Discriminative | Discriminative |
| Training speed | ⚡ Very fast | 🔶 Fast | 🐢 Slow (RBF) |
| Prediction speed | ⚡ Very fast | ⚡ Very fast | 🔶 Depends on SVs |
| Probability output | ⚠️ Uncalibrated | ✅ Well-calibrated | ⚠️ Needs calibration |
| Handles correlations | ❌ No | ✅ Yes | ✅ Yes |
| Best for | Text, baselines | General classification | Clear margins |
Practical Tips
Text classification checklist:
- Start with MultinomialNB + TF-IDF
- Tune α in [0.01, 0.1, 1.0] via CV
- Compare with Logistic Regression
- Use
ComplementNBfor imbalanced data
Variant Selection Guide
Common Pitfalls
| Pitfall | Symptom | Solution |
|---|---|---|
| Zero probabilities | Prediction always same class | Use α > 0 (Laplace smoothing) |
| Extreme probabilities | All predictions near 0 or 1 | Calibrate with CalibratedClassifierCV |
| Wrong variant | Poor performance | Match variant to feature type |
| No text preprocessing | Low accuracy on text | Use TF-IDF, remove stop words |
Summary
| Concept | Key Points |
|---|---|
| Generative Model | Models $P(x \mid y)$, then applies Bayes’ theorem |
| Naive Assumption | Features conditionally independent given class |
| Laplace Smoothing | Adds α to avoid zero probability |
| Variants | Bernoulli (binary), Multinomial (counts), Gaussian (continuous) |
References
- McCallum, A. & Nigam, K. (1998). “A Comparison of Event Models for Naive Bayes Text Classification”
- sklearn Naive Bayes
- Manning, C. et al. “Introduction to Information Retrieval” - Chapter 13