ML-16: Boosting and AdaBoost

Publish on: 2022/04/21 Classify at: CODE/Supervised Machine Learning

Words: 1609 Read:≈ 8min

Summary

Master the Boosting paradigm: understand sequential ensemble learning, implement AdaBoost step-by-step with sample weight updates, and choose between Random Forest, AdaBoost, and Gradient Boosting.

Learning Objectives

Understand Boosting vs Bagging
Master AdaBoost algorithm
Implement weighted learning
Choose between ensemble methods

Theory

Boosting vs Bagging: Two Philosophies

In last blog, it was shown that Bagging (Random Forest) builds trees in parallel on bootstrap samples. Boosting takes a fundamentally different approach: build models sequentially, where each new model focuses on the mistakes of previous ones.

graph TD subgraph "Boosting (AdaBoost)" D2[(Data)] --> S1["🌱 Stump 1"] S1 -->|"Reweight"| S2["🌱 Stump 2"] S2 -->|"Reweight"| S3["🌱 Stump 3"] S3 --> W{{"Weighted Sum"}} end subgraph "Bagging (Random Forest)" D1[(Data)] --> T1["🌳 Tree 1"] D1 --> T2["🌳 Tree 2"] D1 --> T3["🌳 Tree 3"] T1 --> V{{"Vote"}} T2 --> V T3 --> V end

Aspect	Bagging	Boosting
Training	Parallel (independent)	Sequential (each depends on previous)
Focus	Reduce variance	Reduce bias
Samples	Random bootstrap subsets	Reweighted (hard examples get more weight)
Aggregation	Equal vote / average	Weighted vote (better models count more)
Overfitting	Resistant	Can overfit noisy data

Key Insight: Bagging asks “How can I average out noise?” while Boosting asks “How can I learn from my mistakes?”

The Boosting Intuition

Imagine you’re studying for an exam:

Round 1: Take a practice test, identify problems you got wrong
Round 2: Study harder on the topics you missed, take another test
Round 3: Focus even more on remaining weak spots
Final: Combine all your knowledge, weighted by how much you improved

This is exactly what Boosting does with classifiers!

graph TD A["📝 Round 1: Equal focus on all topics"] --> B["❌ Missed Q3, Q7, Q12"] B --> C["📝 Round 2: Extra focus on Q3, Q7, Q12"] C --> D["❌ Still struggling with Q7"] D --> E["📝 Round 3: Heavy focus on Q7"] E --> F["🎓 Final: Combine all rounds"] style F fill:#c8e6c9

AdaBoost: Adaptive Boosting

AdaBoost (Adaptive Boosting) by Freund & Schapire (1997) is the most famous boosting algorithm.

The Algorithm Step-by-Step

Input: Training data $(x_1, y_1), …, (x_N, y_N)$ where $y_i \in {-1, +1}$

Step 1: Initialize sample weights

$$w_i^{(1)} = \frac{1}{N} \quad \text{for all } i$$

All samples start with equal importance.

Step 2: For each round $t = 1, 2, …, T$:

(a) Train weak learner $h_t$ on weighted samples

(b) Compute weighted error:

$$\epsilon_t = \frac{\sum_{i: h_t(x_i) \neq y_i} w_i^{(t)}}{\sum_i w_i^{(t)}} = \sum_{i: \text{wrong}} w_i^{(t)}$$

(c) Compute model weight (how much this model counts):

$$\boxed{\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)}$$

$\epsilon_t$	$\alpha_t$	Meaning
0.5	0	Random guessing → no contribution
0.3	0.42	Good classifier → moderate weight
0.1	1.10	Great classifier → high weight
0.01	2.29	Excellent → very high weight

(d) Update sample weights:

$$\boxed{w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha_t y_i h_t(x_i)}}$$

Then normalize so weights sum to 1.

Classification	$y_i \cdot h_t(x_i)$	Weight change
Correct	+1	$w_i \cdot e^{-\alpha_t}$ → decreases
Incorrect	-1	$w_i \cdot e^{+\alpha_t}$ → increases

Step 3: Final classifier

$$\boxed{H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right)}$$

Each weak learner votes, weighted by its quality!

graph LR A["Initialize<br/>wᵢ = 1/N"] --> B["Train weak learner hₜ"] B --> C["Compute error εₜ"] C --> D["Compute weight<br/>αₜ = ½ln((1-ε)/ε)"] D --> E["Update sample weights"] E --> F{{"t < T?"}} F -->|Yes| B F -->|No| G["H(x) = sign(Σαₜhₜ(x))"] style G fill:#c8e6c9

Worked Example: AdaBoost in Action

Consider 10 samples: 7 correct (✓) and 3 incorrect (✗) after first stump.

Initial weights: $w_i = 0.1$ for all

After Round 1:

Error: $\epsilon_1 = 0.1 + 0.1 + 0.1 = 0.3$
Model weight: $\alpha_1 = \frac{1}{2}\ln\frac{0.7}{0.3} = 0.42$
New weights:
- Correct: $0.1 \times e^{-0.42} = 0.066$
- Incorrect: $0.1 \times e^{+0.42} = 0.152$
- After normalization: correct ≈ 0.07, incorrect ≈ 0.19

Now the misclassified samples have ~3x more weight! The next stump will focus on getting these right.

Why Weak Learners?

Weak learner: A classifier that’s only slightly better than random guessing ($\epsilon < 0.5$).

Weak Learner	Description
Decision Stump	Tree with depth 1 (most common)
Linear classifier	Single feature threshold
Small tree	Depth 2-3

Why weak? Strong learners might overfit individual samples. Weak learners + boosting = strong ensemble without overfitting.

Code Practice

The following examples demonstrate AdaBoost in action and compare it with other ensemble methods.

Single Stump vs AdaBoost

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("📊 Weak Learner vs AdaBoost")
print("=" * 50)

# Single stump (weak learner)
stump = DecisionTreeClassifier(max_depth=1, random_state=42)
stump.fit(X_train, y_train)
print(f"Single Stump:       Train={stump.score(X_train, y_train):.2%}, "
      f"Test={stump.score(X_test, y_test):.2%}")

# AdaBoost with 50 stumps
ada = AdaBoostClassifier(n_estimators=50, random_state=42)
ada.fit(X_train, y_train)
print(f"AdaBoost (50):      Train={ada.score(X_train, y_train):.2%}, "
      f"Test={ada.score(X_test, y_test):.2%}")

Output:

1
2
3
4
📊 Weak Learner vs AdaBoost
==================================================
Single Stump:       Train=70.50%, Test=66.00%
AdaBoost (50):      Train=91.38%, Test=85.00%

A single stump is barely better than random (~70%)! But 50 stumps combined via AdaBoost achieve 85% — the power of “weak to strong” learning!

Learning Curve: How Many Estimators?

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
print("📊 AdaBoost Learning Curve")
print("=" * 50)
print(f"{'Estimators':<12} {'Train':>10} {'Test':>10}")
print("-" * 50)

for n_est in [1, 5, 10, 25, 50, 100, 200]:
    ada = AdaBoostClassifier(n_estimators=n_est, random_state=42)
    ada.fit(X_train, y_train)
    
    print(f"{n_est:<12} {ada.score(X_train, y_train):>10.2%} "
          f"{ada.score(X_test, y_test):>10.2%}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
📊 AdaBoost Learning Curve
==================================================
Estimators        Train       Test
--------------------------------------------------
1                70.50%     66.00%
5                80.50%     74.50%
10               85.50%     85.00%
25               88.88%     83.50%
50               91.38%     85.00%
100              92.75%     86.00%
200              94.75%     88.00%

Accuracy improvement slows after ~50 estimators. More estimators improve training accuracy but test accuracy gains diminish. Balance complexity with performance.

Comparing All Ensemble Methods

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
                              GradientBoostingClassifier)
import time

classifiers = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}

print("📊 Ensemble Method Comparison")
print("=" * 60)
print(f"{'Method':<20} {'Train':>10} {'Test':>10} {'Time':>10}")
print("-" * 60)

for name, clf in classifiers.items():
    start = time.time()
    clf.fit(X_train, y_train)
    elapsed = time.time() - start
    
    train_acc = clf.score(X_train, y_train)
    test_acc = clf.score(X_test, y_test)
    
    print(f"{name:<20} {train_acc:>10.2%} {test_acc:>10.2%} {elapsed:>9.3f}s")

Output:

1
2
3
4
5
6
7
📊 Ensemble Method Comparison
============================================================
Method                    Train       Test       Time
------------------------------------------------------------
Random Forest           100.00%     92.00%     0.559s
AdaBoost                 92.75%     86.00%     0.647s
Gradient Boosting        99.62%     93.00%     1.078s

Gradient Boosting achieves highest test accuracy (93%) but is slowest. Random Forest balances speed and accuracy well (92%). AdaBoost is simpler but less accurate on complex data.

Visualizing Decision Boundaries

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from sklearn.datasets import make_moons
import numpy as np

# Create moon dataset (harder to separate)
X_moon, y_moon = make_moons(n_samples=200, noise=0.25, random_state=42)

print("📊 AdaBoost on Moon Dataset")
print("=" * 40)

for n_est in [1, 5, 20, 50]:
    ada = AdaBoostClassifier(n_estimators=n_est, random_state=42)
    ada.fit(X_moon, y_moon)
    print(f"{n_est:>3} estimators: accuracy = {ada.score(X_moon, y_moon):.1%}")

Output:

1
2
3
4
5
6
📊 AdaBoost on Moon Dataset
========================================
  1 estimators: accuracy = 84.5%
  5 estimators: accuracy = 91.0%
 20 estimators: accuracy = 96.0%
 50 estimators: accuracy = 97.5%

Deep Dive

Frequently Asked Questions

Q1: Boosting vs Bagging — when to use which?

Your Problem	Recommendation	Why
Overfitting (high variance)	Bagging (Random Forest)	Averaging reduces variance
Underfitting (high bias)	Boosting (AdaBoost/GB)	Sequential correction reduces bias
Noisy data, outliers	Bagging	More robust to noise
Clean data, max accuracy	Gradient Boosting	Best accuracy
Fast training needed	Random Forest	Parallelizable
Interpretability needed	Single Decision Tree	But lower accuracy

Q2: AdaBoost vs Gradient Boosting vs XGBoost

Aspect	AdaBoost	Gradient Boosting	XGBoost/LightGBM
How it learns	Reweight samples	Fit residuals	Fit residuals + regularization
Loss function	Exponential	Any differentiable	Any + custom
Speed	🔶 Medium	🐢 Slow	⚡ Fast (optimized)
Accuracy	🔶 Good	✅ Better	✅✅ Best
Hyperparameters	Few	Medium	Many
Outlier sensitivity	⚠️ High	🔶 Medium	🔶 Medium

In practice: Use XGBoost/LightGBM for competitions. Use AdaBoost for simplicity and interpretability.

Q3: What if AdaBoost overfits?

Symptom	Solution
Train accuracy » Test accuracy	Reduce `n_estimators`
Performance degrades with more trees	Use `learning_rate < 1.0`
Sensitive to outliers	Remove outliers or use RandomForest
Complex base learner	Use simpler stumps (`max_depth=1`)

1
2
3
4
5
6
7
# Example: Prevent overfitting
ada = AdaBoostClassifier(
    n_estimators=50,
    learning_rate=0.5,  # Shrinkage
    estimator=DecisionTreeClassifier(max_depth=1),
    random_state=42
)

Q4: Why is AdaBoost sensitive to outliers?

Outliers get higher weights when misclassified → next learner focuses heavily on them → ensemble is distorted by outliers.

Solutions:

Use Gradient Boosting (Huber loss for robustness)
Remove outliers before training
Use Random Forest instead

Practical Tips

AdaBoost tuning checklist:

Start with n_estimators=50, learning_rate=1.0
If overfitting: reduce n_estimators or learning_rate
Use algorithm='SAMME.R' (real-valued, usually better)
For noisy data, consider Random Forest instead
Monitor train vs test accuracy gap

Key Hyperparameters

Parameter	Default	Effect
`n_estimators`	50	More → stronger but risk overfit
`learning_rate`	1.0	Lower → more conservative updates
`estimator`	DecisionTreeClassifier(max_depth=1)	Weak learner type
`algorithm`	‘SAMME.R’	‘SAMME’ for discrete, ‘SAMME.R’ for real

Common Pitfalls

Pitfall	Symptom	Solution
Too many estimators	Overfitting	Reduce or use early stopping
Strong base learners	Poor ensemble	Use depth=1 stumps
Outliers in data	Distorted predictions	Clean data or use RF
Class imbalance	Ignores minority	Use class weights
Algorithm choice	Poor probabilities	Use ‘SAMME.R’ for predict_proba

Summary

Concept	Key Points
Boosting	Sequential learning, focus on errors
AdaBoost	Reweight samples, weak → strong
$\alpha_t$	Model weight based on error
Weight Update	Increase for misclassified
vs Bagging	Reduces bias (not variance)

References

Freund, Y. & Schapire, R. (1997). “A Decision-Theoretic Generalization of On-Line Learning”
sklearn Ensemble Documentation
“The Elements of Statistical Learning” - Chapter 10
Abu-Mostafa, Y. “Learning From Data”