ML-13: Naive Bayes: The Power of Generative Models

Publish on: 2022/04/18 Classify at: CODE/Supervised Machine Learning

Words: 2116 Read:≈ 10min

Summary

Master Naive Bayes classification: understand generative vs discriminative models, apply Bayes' theorem with the naive independence assumption, and build powerful text classifiers for spam detection and beyond.

Learning Objectives

Distinguish discriminative vs generative models
Apply Bayes’ theorem for classification
Understand the “naive” independence assumption
Handle zero probabilities with Laplace smoothing

Theory

Generative vs Discriminative Models

Before diving into Naive Bayes, two fundamentally different approaches to classification should be understood.

Discriminative Models

Directly model $P(y|x)$ — the probability of the class given the features.

Analogy: A discriminative approach to identifying dogs vs cats is like memorizing the differences between them: “If it has pointy ears and meows, it’s a cat.”

Examples: Logistic Regression, SVM, Neural Networks

Generative Models

Model $P(x|y)$ — the probability of features given each class — then use Bayes’ theorem.

Analogy: A generative approach is like learning what dogs look like and what cats look like separately, then asking “Which model better explains this animal?”

Examples: Naive Bayes, Gaussian Mixture Models, Hidden Markov Models

flowchart LR subgraph disc["🎯 Discriminative"] direction TB X1[/"Input x"/] --> D(("P(y|x)")) --> Y1[\"Output y"\] end subgraph gen["🔮 Generative"] direction TB Y2[/"Class y"/] --> G(("P(x|y)")) --> B{{"Bayes"}} --> P[\"P(y|x)"\] end disc ~~~ gen

Aspect	Discriminative	Generative
Models	Decision boundary directly	How data is generated
Learns	$P(y \mid x)$	$P(x \mid y)$ and $P(y)$
Pros	Often more accurate	Can generate new data, handles missing values
Cons	Can’t generate data	Makes stronger assumptions

Bayes’ Theorem: The Foundation

The heart of Naive Bayes:

$$\boxed{P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)}}$$

Each term has a specific meaning:

Term	Name	Meaning
$P(y \mid x)$	Posterior	Probability of class $y$ after seeing features $x$
$P(x \mid y)$	Likelihood	How likely are these features for class $y$
$P(y)$	Prior	Base probability of class $y$ (before seeing data)
$P(x)$	Evidence	Probability of observing these features

Spam Classification Example

Suppose we want to classify: “Free money now!”

Term	Calculation
Prior $P(\text{spam})$	30% of all emails are spam → $P(\text{spam}) = 0.3$
Likelihood $P(\text{“free money”} \mid \text{spam})$	80% of spam emails contain these words → $P(x \mid \text{spam}) = 0.8$
Likelihood $P(\text{“free money”} \mid \text{ham})$	Only 1% of legit emails have these words → $P(x \mid \text{ham}) = 0.01$

Applying Bayes: $$P(\text{spam}|\text{“free money”}) = \frac{0.8 \times 0.3}{0.8 \times 0.3 + 0.01 \times 0.7} = \frac{0.24}{0.247} \approx 97\%$$

For classification, $P(x)$ can be ignored! Since it’s the same for all classes: $$\hat{y} = \arg\max_y P(y|x) = \arg\max_y P(x|y) \cdot P(y)$$

The “Naive” Independence Assumption

Here’s the problem: with many features, $P(x_1, x_2, …, x_n | y)$ has exponentially many parameters to estimate.

The Solution: Assume Independence

Naive Bayes assumes that features are conditionally independent given the class:

$$P(x_1, x_2, …, x_n | y) = \prod_{i=1}^n P(x_i | y)$$

This reduces parameters from $O(|V|^n)$ to $O(n \cdot |V|)$!

Why “Naive”?

This assumption is often wrong! In spam detection:

“Free” and “money” are highly correlated
“Nigerian” and “prince” appear together

Yet Naive Bayes works surprisingly well because:

Classification only needs the argmax — exact probabilities don’t matter
Errors often cancel out across features
Simple models with lots of data often beat complex models with little data

Naive Bayes is a classic example of the bias-variance tradeoff: high bias (wrong independence assumption) but very low variance (few parameters to estimate).

The Three Variants

Multinomial Naive Bayes

For count data — how many times each feature appears.

$$P(x|y) = \frac{(\sum_i x_i)!}{\prod_i x_i!} \prod_i P(w_i|y)^{x_i}$$

In practice, we use log-probabilities:

$$\log P(y|x) \propto \log P(y) + \sum_i x_i \cdot \log P(w_i|y)$$

Use Case	Example
Text classification	Word counts in documents
Topic modeling	Term frequency vectors

Works great with TF-IDF (Term Frequency-Inverse Document Frequency) weighted features too!

Bernoulli Naive Bayes

For binary features — presence/absence only.

$$P(x|y) = \prod_i P(w_i|y)^{x_i} \cdot (1 - P(w_i|y))^{(1-x_i)}$$

Key difference from Multinomial: explicitly penalizes absence of features.

Use Case	Example
Short text (tweets)	Word presence, not count
Binary features	“Has feature X?” questions

Gaussian Naive Bayes

For continuous features — assumes normal distribution.

$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)$$

For each class $y$ and feature $i$, estimate:

$\mu_{y,i}$ = mean of feature $i$ for class $y$
$\sigma_{y,i}^2$ = variance of feature $i$ for class $y$

Use Case	Example
Numeric data	Height, weight, sensor readings
Mixed with other models	Baseline comparison

Gaussian NB assumes features follow a normal distribution. For skewed data, consider transformations (log, Box-Cox) first.

Laplace Smoothing: Handling Zero Probabilities

The Problem

If a word never appears in spam training emails, its probability is zero:

$$P(\text{“meeting”}|\text{spam}) = \frac{0}{\text{total spam words}} = 0$$

This zeros out the entire product, regardless of other evidence!

The Solution: Add-α Smoothing

Add a small count $\alpha$ to every feature:

$$P(x_i|y) = \frac{\text{count}(x_i, y) + \alpha}{\text{count}(y) + \alpha \cdot |V|}$$

where $|V|$ is the vocabulary size.

α Value	Effect
$\alpha = 0$	No smoothing (risk of zeros)
$\alpha = 1$	Laplace smoothing (most common)
$\alpha > 1$	Stronger smoothing → more uniform
$\alpha < 1$	Lidstone smoothing (less shrinkage)

Default in sklearn: alpha=1.0 (Laplace smoothing). For text classification, try values in [0.01, 1.0] via cross-validation.

Code Practice

The following examples demonstrate Naive Bayes in action for text classification.

Text Classification: Spam Detection

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample email data
texts = [
    "free money now", "win cash prize",       # Spam
    "meeting tomorrow", "project deadline",    # Not spam
    "claim your prize", "urgent meeting"
]
labels = [1, 1, 0, 0, 1, 0]  # 1 = spam, 0 = not spam

# Convert text to word count features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Feature matrix shape:", X.shape)

# Train Naive Bayes
clf = MultinomialNB()
clf.fit(X, labels)

# Predict on new email
test = ["free cash meeting"]
X_test = vectorizer.transform(test)
prediction = clf.predict(X_test)[0]
probabilities = clf.predict_proba(X_test)[0]

print(f"\n📧 New email: '{test[0]}'")
print(f"   Prediction: {'🚨 SPAM' if prediction else '✅ Not Spam'}")
print(f"   P(not spam) = {probabilities[0]:.2%}")
print(f"   P(spam) = {probabilities[1]:.2%}")

Output:

1
2
3
4
5
6
7
8
Vocabulary: ['cash' 'claim' 'deadline' 'free' 'meeting' 'money' 'now' 'prize'
 'project' 'tomorrow' 'urgent' 'win' 'your']
Feature matrix shape: (6, 13)

📧 New email: 'free cash meeting'
   Prediction: ✅ Not Spam
   P(not spam) = 53.80%
   P(spam) = 46.20%

The email contains “free” and “cash” (spam words) but also “meeting” (ham word). With limited training data, the model slightly favors Not Spam — showing how NB weighs all word evidence together.

Gaussian Naive Bayes for Continuous Data

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load Iris dataset
X, y = load_iris(return_X_y=True)
feature_names = load_iris().feature_names
target_names = load_iris().target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Gaussian NB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

print("📊 Gaussian Naive Bayes on Iris Dataset")
print("=" * 45)
print(f"Accuracy: {gnb.score(X_test, y_test):.2%}")
print(f"\n📈 Class priors (learned):")
for name, prior in zip(target_names, gnb.class_prior_):
    print(f"   P({name}) = {prior:.2%}")

print(f"\n📐 Feature means per class:")
for i, name in enumerate(target_names):
    print(f"   {name}: {gnb.theta_[i].round(2)}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
📊 Gaussian Naive Bayes on Iris Dataset
=============================================
Accuracy: 100.00%

📈 Class priors (learned):
   P(setosa) = 33.33%
   P(versicolor) = 34.17%
   P(virginica) = 32.50%

📐 Feature means per class:
   setosa: [4.99 3.45 1.45 0.24]
   versicolor: [5.92 2.77 4.24 1.32]
   virginica: [6.53 2.97 5.52 2.  ]

Gaussian NB achieves 100% accuracy on Iris — competitive with more complex models! The simplicity and speed make it an excellent baseline.

Effect of Laplace Smoothing (α)

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
emails = ["free money", "meeting schedule", "free cash prize", 
          "project update", "win free", "team meeting"]
labels = [1, 0, 1, 0, 1, 0]  # 1=spam, 0=not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
X_test = vectorizer.transform(["free cash meeting"])

# Compare different alpha values
print("📊 Effect of Smoothing Parameter α")
print("=" * 50)
print(f"{'Alpha':<10} {'P(not spam)':<15} {'P(spam)':<15}")
print("-" * 50)

for alpha in [0.001, 0.1, 1.0, 10.0]:
    clf = MultinomialNB(alpha=alpha)
    clf.fit(X, labels)
    probs = clf.predict_proba(X_test)[0]
    print(f"{alpha:<10} {probs[0]:<15.4f} {probs[1]:<15.4f}")

Output:

1
2
3
4
5
6
7
8
📊 Effect of Smoothing Parameter α
==================================================
Alpha      P(not spam)     P(spam)        
--------------------------------------------------
0.001      0.0011          0.9989         
0.1        0.0842          0.9158         
1.0        0.3102          0.6898         
10.0       0.4633          0.5367  

Higher α → probabilities closer to uniform (more smoothing). Lower α → more extreme probabilities. Use cross-validation to find the best α!

Real-World: Text Classification Pipeline

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Larger sample dataset for text classification
texts = [
    "The game was exciting with great plays", "Baseball scores and statistics",
    "Home run in the ninth inning", "Team wins championship game",
    "NASA launches new satellite", "Astronauts on space station",
    "Mars rover discovers water", "Telescope captures galaxy images",
    "Pitcher throws perfect game", "Stadium filled with fans cheering",
    "Rocket launch successful today", "Space exploration advances",
]
labels = [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1]  # 0=sports, 1=space

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

# Build pipeline: TF-IDF + Naive Bayes
clf = make_pipeline(
    TfidfVectorizer(stop_words='english'),
    MultinomialNB(alpha=0.1)
)
clf.fit(X_train, y_train)

# Evaluate
accuracy = clf.score(X_test, y_test)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\n🎯 Test Accuracy: {accuracy:.2%}")

# Test prediction
new_text = ["The rocket launched into space successfully"]
pred = clf.predict(new_text)[0]
print(f"\n📧 '{new_text[0]}'")
print(f"   Prediction: {'🚀 Space' if pred == 1 else '⚾ Sports'}")

Output:

1
2
3
4
5
6
7
Training samples: 9
Test samples: 3

🎯 Test Accuracy: 66.67%

📧 'The rocket launched into space successfully'
   Prediction: 🚀 Space

TF-IDF + Naive Bayes is a powerful text classification pipeline. The same approach scales to millions of documents with similar accuracy.

Deep Dive

Frequently Asked Questions

Q1: Why is it called “naive”?

The conditional independence assumption is almost always wrong in practice:

Example	Reality	Naive Assumption
“free” and “money” in spam	Highly correlated	Treated as independent
Pixel intensities in images	Neighboring pixels are similar	Treated as independent

Yet it works because:

Only the argmax is needed — relative ordering matters, not exact probabilities
Errors tend to cancel out when averaged over many features
Fewer parameters = less overfitting with limited data

Q2: When does Naive Bayes excel?

Scenario	Why NB Works Well
Text classification	High-dim sparse features, independence “good enough”
Spam filtering	Fast, incremental updates, interpretable
As a baseline	Quick to train, hard to beat on small data
Multi-class problems	Naturally handles many classes
Real-time systems	O(1) prediction per class

Q3: When does Naive Bayes fail?

Scenario	Why It Fails	Alternative
Highly correlated features	Independence assumption badly violated	Logistic Regression, SVM
Need calibrated probabilities	NB probabilities are often extreme	Use `CalibratedClassifierCV`
Complex decision boundaries	Linear in log-space only	Random Forest, Neural Networks
Numeric features, non-Gaussian	Gaussian NB assumes normality	Transform data or use other models

Q4: Naive Bayes vs. Logistic Regression vs. SVM

Aspect	Naive Bayes	Logistic Regression	SVM
Type	Generative	Discriminative	Discriminative
Training speed	⚡ Very fast	🔶 Fast	🐢 Slow (RBF)
Prediction speed	⚡ Very fast	⚡ Very fast	🔶 Depends on SVs
Probability output	⚠️ Uncalibrated	✅ Well-calibrated	⚠️ Needs calibration
Handles correlations	❌ No	✅ Yes	✅ Yes
Best for	Text, baselines	General classification	Clear margins

Practical Tips

Text classification checklist:

Start with MultinomialNB + TF-IDF
Tune α in [0.01, 0.1, 1.0] via CV
Compare with Logistic Regression
Use ComplementNB for imbalanced data

Variant Selection Guide

graph TD A["What type of features?"] --> B{"Binary?"} B -->|Yes| C["BernoulliNB"] B -->|No| D{"Counts?"} D -->|Yes| E["MultinomialNB"] D -->|No| F{"Continuous?"} F -->|Yes| G["GaussianNB"] F -->|"Mixed"| H["Combine variants or transform"]

Common Pitfalls

Pitfall	Symptom	Solution
Zero probabilities	Prediction always same class	Use α > 0 (Laplace smoothing)
Extreme probabilities	All predictions near 0 or 1	Calibrate with `CalibratedClassifierCV`
Wrong variant	Poor performance	Match variant to feature type
No text preprocessing	Low accuracy on text	Use TF-IDF, remove stop words

Summary

Concept	Key Points
Generative Model	Models $P(x \mid y)$, then applies Bayes’ theorem
Naive Assumption	Features conditionally independent given class
Laplace Smoothing	Adds α to avoid zero probability
Variants	Bernoulli (binary), Multinomial (counts), Gaussian (continuous)

References

McCallum, A. & Nigam, K. (1998). “A Comparison of Event Models for Naive Bayes Text Classification”
sklearn Naive Bayes
Manning, C. et al. “Introduction to Information Retrieval” - Chapter 13