ML-13: Naive Bayes: The Power of Generative Models

Summary
Master Naive Bayes classification: understand generative vs discriminative models, apply Bayes' theorem with the naive independence assumption, and build powerful text classifiers for spam detection and beyond.

Learning Objectives

  • Distinguish discriminative vs generative models
  • Apply Bayes’ theorem for classification
  • Understand the “naive” independence assumption
  • Handle zero probabilities with Laplace smoothing

Theory

Generative vs Discriminative Models

Before diving into Naive Bayes, two fundamentally different approaches to classification should be understood.

Discriminative Models

Directly model $P(y|x)$ — the probability of the class given the features.

Analogy: A discriminative approach to identifying dogs vs cats is like memorizing the differences between them: “If it has pointy ears and meows, it’s a cat.”

Examples: Logistic Regression, SVM, Neural Networks

Generative Models

Model $P(x|y)$ — the probability of features given each class — then use Bayes’ theorem.

Analogy: A generative approach is like learning what dogs look like and what cats look like separately, then asking “Which model better explains this animal?”

Examples: Naive Bayes, Gaussian Mixture Models, Hidden Markov Models

flowchart LR subgraph disc["🎯 Discriminative"] direction TB X1[/"Input x"/] --> D(("P(y|x)")) --> Y1[\"Output y"\] end subgraph gen["🔮 Generative"] direction TB Y2[/"Class y"/] --> G(("P(x|y)")) --> B{{"Bayes"}} --> P[\"P(y|x)"\] end disc ~~~ gen
AspectDiscriminativeGenerative
ModelsDecision boundary directlyHow data is generated
Learns$P(y \mid x)$$P(x \mid y)$ and $P(y)$
ProsOften more accurateCan generate new data, handles missing values
ConsCan’t generate dataMakes stronger assumptions

Bayes’ Theorem: The Foundation

The heart of Naive Bayes:

$$\boxed{P(y|x) = \frac{P(x|y) \cdot P(y)}{P(x)}}$$

Each term has a specific meaning:

TermNameMeaning
$P(y \mid x)$PosteriorProbability of class $y$ after seeing features $x$
$P(x \mid y)$LikelihoodHow likely are these features for class $y$
$P(y)$PriorBase probability of class $y$ (before seeing data)
$P(x)$EvidenceProbability of observing these features

Spam Classification Example

Suppose we want to classify: “Free money now!”

TermCalculation
Prior $P(\text{spam})$30% of all emails are spam → $P(\text{spam}) = 0.3$
Likelihood $P(\text{“free money”} \mid \text{spam})$80% of spam emails contain these words → $P(x \mid \text{spam}) = 0.8$
Likelihood $P(\text{“free money”} \mid \text{ham})$Only 1% of legit emails have these words → $P(x \mid \text{ham}) = 0.01$

Applying Bayes: $$P(\text{spam}|\text{“free money”}) = \frac{0.8 \times 0.3}{0.8 \times 0.3 + 0.01 \times 0.7} = \frac{0.24}{0.247} \approx 97\%$$

For classification, $P(x)$ can be ignored! Since it’s the same for all classes: $$\hat{y} = \arg\max_y P(y|x) = \arg\max_y P(x|y) \cdot P(y)$$

The “Naive” Independence Assumption

Here’s the problem: with many features, $P(x_1, x_2, …, x_n | y)$ has exponentially many parameters to estimate.

The Solution: Assume Independence

Naive Bayes assumes that features are conditionally independent given the class:

$$P(x_1, x_2, …, x_n | y) = \prod_{i=1}^n P(x_i | y)$$

This reduces parameters from $O(|V|^n)$ to $O(n \cdot |V|)$!

Why “Naive”?

This assumption is often wrong! In spam detection:

  • “Free” and “money” are highly correlated
  • “Nigerian” and “prince” appear together

Yet Naive Bayes works surprisingly well because:

  1. Classification only needs the argmax — exact probabilities don’t matter
  2. Errors often cancel out across features
  3. Simple models with lots of data often beat complex models with little data
Naive Bayes is a classic example of the bias-variance tradeoff: high bias (wrong independence assumption) but very low variance (few parameters to estimate).

The Three Variants

Multinomial Naive Bayes

For count data — how many times each feature appears.

$$P(x|y) = \frac{(\sum_i x_i)!}{\prod_i x_i!} \prod_i P(w_i|y)^{x_i}$$

In practice, we use log-probabilities:

$$\log P(y|x) \propto \log P(y) + \sum_i x_i \cdot \log P(w_i|y)$$

Use CaseExample
Text classificationWord counts in documents
Topic modelingTerm frequency vectors
Works great with TF-IDF (Term Frequency-Inverse Document Frequency) weighted features too!

Bernoulli Naive Bayes

For binary features — presence/absence only.

$$P(x|y) = \prod_i P(w_i|y)^{x_i} \cdot (1 - P(w_i|y))^{(1-x_i)}$$

Key difference from Multinomial: explicitly penalizes absence of features.

Use CaseExample
Short text (tweets)Word presence, not count
Binary features“Has feature X?” questions

Gaussian Naive Bayes

For continuous features — assumes normal distribution.

$$P(x_i|y) = \frac{1}{\sqrt{2\pi\sigma_y^2}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}\right)$$

For each class $y$ and feature $i$, estimate:

  • $\mu_{y,i}$ = mean of feature $i$ for class $y$
  • $\sigma_{y,i}^2$ = variance of feature $i$ for class $y$
Use CaseExample
Numeric dataHeight, weight, sensor readings
Mixed with other modelsBaseline comparison
Gaussian NB assumes features follow a normal distribution. For skewed data, consider transformations (log, Box-Cox) first.

Laplace Smoothing: Handling Zero Probabilities

The Problem

If a word never appears in spam training emails, its probability is zero:

$$P(\text{“meeting”}|\text{spam}) = \frac{0}{\text{total spam words}} = 0$$

This zeros out the entire product, regardless of other evidence!

The Solution: Add-α Smoothing

Add a small count $\alpha$ to every feature:

$$P(x_i|y) = \frac{\text{count}(x_i, y) + \alpha}{\text{count}(y) + \alpha \cdot |V|}$$

where $|V|$ is the vocabulary size.

α ValueEffect
$\alpha = 0$No smoothing (risk of zeros)
$\alpha = 1$Laplace smoothing (most common)
$\alpha > 1$Stronger smoothing → more uniform
$\alpha < 1$Lidstone smoothing (less shrinkage)
Default in sklearn: alpha=1.0 (Laplace smoothing). For text classification, try values in [0.01, 1.0] via cross-validation.

Code Practice

The following examples demonstrate Naive Bayes in action for text classification.

Text Classification: Spam Detection

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample email data
texts = [
    "free money now", "win cash prize",       # Spam
    "meeting tomorrow", "project deadline",    # Not spam
    "claim your prize", "urgent meeting"
]
labels = [1, 1, 0, 0, 1, 0]  # 1 = spam, 0 = not spam

# Convert text to word count features
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

print("Vocabulary:", vectorizer.get_feature_names_out())
print("Feature matrix shape:", X.shape)

# Train Naive Bayes
clf = MultinomialNB()
clf.fit(X, labels)

# Predict on new email
test = ["free cash meeting"]
X_test = vectorizer.transform(test)
prediction = clf.predict(X_test)[0]
probabilities = clf.predict_proba(X_test)[0]

print(f"\n📧 New email: '{test[0]}'")
print(f"   Prediction: {'🚨 SPAM' if prediction else '✅ Not Spam'}")
print(f"   P(not spam) = {probabilities[0]:.2%}")
print(f"   P(spam) = {probabilities[1]:.2%}")

Output:

1
2
3
4
5
6
7
8
Vocabulary: ['cash' 'claim' 'deadline' 'free' 'meeting' 'money' 'now' 'prize'
 'project' 'tomorrow' 'urgent' 'win' 'your']
Feature matrix shape: (6, 13)

📧 New email: 'free cash meeting'
   Prediction: ✅ Not Spam
   P(not spam) = 53.80%
   P(spam) = 46.20%
The email contains “free” and “cash” (spam words) but also “meeting” (ham word). With limited training data, the model slightly favors Not Spam — showing how NB weighs all word evidence together.

Gaussian Naive Bayes for Continuous Data

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load Iris dataset
X, y = load_iris(return_X_y=True)
feature_names = load_iris().feature_names
target_names = load_iris().target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Gaussian NB
gnb = GaussianNB()
gnb.fit(X_train, y_train)

print("📊 Gaussian Naive Bayes on Iris Dataset")
print("=" * 45)
print(f"Accuracy: {gnb.score(X_test, y_test):.2%}")
print(f"\n📈 Class priors (learned):")
for name, prior in zip(target_names, gnb.class_prior_):
    print(f"   P({name}) = {prior:.2%}")

print(f"\n📐 Feature means per class:")
for i, name in enumerate(target_names):
    print(f"   {name}: {gnb.theta_[i].round(2)}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
📊 Gaussian Naive Bayes on Iris Dataset
=============================================
Accuracy: 100.00%

📈 Class priors (learned):
   P(setosa) = 33.33%
   P(versicolor) = 34.17%
   P(virginica) = 32.50%

📐 Feature means per class:
   setosa: [4.99 3.45 1.45 0.24]
   versicolor: [5.92 2.77 4.24 1.32]
   virginica: [6.53 2.97 5.52 2.  ]
Gaussian NB achieves 100% accuracy on Iris — competitive with more complex models! The simplicity and speed make it an excellent baseline.

Effect of Laplace Smoothing (α)

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer

# Sample data
emails = ["free money", "meeting schedule", "free cash prize", 
          "project update", "win free", "team meeting"]
labels = [1, 0, 1, 0, 1, 0]  # 1=spam, 0=not spam

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(emails)
X_test = vectorizer.transform(["free cash meeting"])

# Compare different alpha values
print("📊 Effect of Smoothing Parameter α")
print("=" * 50)
print(f"{'Alpha':<10} {'P(not spam)':<15} {'P(spam)':<15}")
print("-" * 50)

for alpha in [0.001, 0.1, 1.0, 10.0]:
    clf = MultinomialNB(alpha=alpha)
    clf.fit(X, labels)
    probs = clf.predict_proba(X_test)[0]
    print(f"{alpha:<10} {probs[0]:<15.4f} {probs[1]:<15.4f}")

Output:

1
2
3
4
5
6
7
8
📊 Effect of Smoothing Parameter α
==================================================
Alpha      P(not spam)     P(spam)        
--------------------------------------------------
0.001      0.0011          0.9989         
0.1        0.0842          0.9158         
1.0        0.3102          0.6898         
10.0       0.4633          0.5367  
Higher α → probabilities closer to uniform (more smoothing). Lower α → more extreme probabilities. Use cross-validation to find the best α!

Real-World: Text Classification Pipeline

🐍 Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# Larger sample dataset for text classification
texts = [
    "The game was exciting with great plays", "Baseball scores and statistics",
    "Home run in the ninth inning", "Team wins championship game",
    "NASA launches new satellite", "Astronauts on space station",
    "Mars rover discovers water", "Telescope captures galaxy images",
    "Pitcher throws perfect game", "Stadium filled with fans cheering",
    "Rocket launch successful today", "Space exploration advances",
]
labels = [0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1]  # 0=sports, 1=space

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.25, random_state=42
)

# Build pipeline: TF-IDF + Naive Bayes
clf = make_pipeline(
    TfidfVectorizer(stop_words='english'),
    MultinomialNB(alpha=0.1)
)
clf.fit(X_train, y_train)

# Evaluate
accuracy = clf.score(X_test, y_test)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\n🎯 Test Accuracy: {accuracy:.2%}")

# Test prediction
new_text = ["The rocket launched into space successfully"]
pred = clf.predict(new_text)[0]
print(f"\n📧 '{new_text[0]}'")
print(f"   Prediction: {'🚀 Space' if pred == 1 else '⚾ Sports'}")

Output:

1
2
3
4
5
6
7
Training samples: 9
Test samples: 3

🎯 Test Accuracy: 66.67%

📧 'The rocket launched into space successfully'
   Prediction: 🚀 Space
TF-IDF + Naive Bayes is a powerful text classification pipeline. The same approach scales to millions of documents with similar accuracy.

Deep Dive

Frequently Asked Questions

Q1: Why is it called “naive”?

The conditional independence assumption is almost always wrong in practice:

ExampleRealityNaive Assumption
“free” and “money” in spamHighly correlatedTreated as independent
Pixel intensities in imagesNeighboring pixels are similarTreated as independent

Yet it works because:

  1. Only the argmax is needed — relative ordering matters, not exact probabilities
  2. Errors tend to cancel out when averaged over many features
  3. Fewer parameters = less overfitting with limited data

Q2: When does Naive Bayes excel?

ScenarioWhy NB Works Well
Text classificationHigh-dim sparse features, independence “good enough”
Spam filteringFast, incremental updates, interpretable
As a baselineQuick to train, hard to beat on small data
Multi-class problemsNaturally handles many classes
Real-time systemsO(1) prediction per class

Q3: When does Naive Bayes fail?

ScenarioWhy It FailsAlternative
Highly correlated featuresIndependence assumption badly violatedLogistic Regression, SVM
Need calibrated probabilitiesNB probabilities are often extremeUse CalibratedClassifierCV
Complex decision boundariesLinear in log-space onlyRandom Forest, Neural Networks
Numeric features, non-GaussianGaussian NB assumes normalityTransform data or use other models

Q4: Naive Bayes vs. Logistic Regression vs. SVM

AspectNaive BayesLogistic RegressionSVM
TypeGenerativeDiscriminativeDiscriminative
Training speed⚡ Very fast🔶 Fast🐢 Slow (RBF)
Prediction speed⚡ Very fast⚡ Very fast🔶 Depends on SVs
Probability output⚠️ Uncalibrated✅ Well-calibrated⚠️ Needs calibration
Handles correlations❌ No✅ Yes✅ Yes
Best forText, baselinesGeneral classificationClear margins

Practical Tips

Text classification checklist:

  1. Start with MultinomialNB + TF-IDF
  2. Tune α in [0.01, 0.1, 1.0] via CV
  3. Compare with Logistic Regression
  4. Use ComplementNB for imbalanced data

Variant Selection Guide

graph TD A["What type of features?"] --> B{"Binary?"} B -->|Yes| C["BernoulliNB"] B -->|No| D{"Counts?"} D -->|Yes| E["MultinomialNB"] D -->|No| F{"Continuous?"} F -->|Yes| G["GaussianNB"] F -->|"Mixed"| H["Combine variants or transform"]

Common Pitfalls

PitfallSymptomSolution
Zero probabilitiesPrediction always same classUse α > 0 (Laplace smoothing)
Extreme probabilitiesAll predictions near 0 or 1Calibrate with CalibratedClassifierCV
Wrong variantPoor performanceMatch variant to feature type
No text preprocessingLow accuracy on textUse TF-IDF, remove stop words

Summary

ConceptKey Points
Generative ModelModels $P(x \mid y)$, then applies Bayes’ theorem
Naive AssumptionFeatures conditionally independent given class
Laplace SmoothingAdds α to avoid zero probability
VariantsBernoulli (binary), Multinomial (counts), Gaussian (continuous)

References

  • McCallum, A. & Nigam, K. (1998). “A Comparison of Event Models for Naive Bayes Text Classification”
  • sklearn Naive Bayes
  • Manning, C. et al. “Introduction to Information Retrieval” - Chapter 13