ML-16: Boosting and AdaBoost

Summary
Master the Boosting paradigm: understand sequential ensemble learning, implement AdaBoost step-by-step with sample weight updates, and choose between Random Forest, AdaBoost, and Gradient Boosting.

Learning Objectives

  • Understand Boosting vs Bagging
  • Master AdaBoost algorithm
  • Implement weighted learning
  • Choose between ensemble methods

Theory

Boosting vs Bagging: Two Philosophies

In last blog, it was shown that Bagging (Random Forest) builds trees in parallel on bootstrap samples. Boosting takes a fundamentally different approach: build models sequentially, where each new model focuses on the mistakes of previous ones.

graph TD subgraph "Boosting (AdaBoost)" D2[(Data)] --> S1["๐ŸŒฑ Stump 1"] S1 -->|"Reweight"| S2["๐ŸŒฑ Stump 2"] S2 -->|"Reweight"| S3["๐ŸŒฑ Stump 3"] S3 --> W{{"Weighted Sum"}} end subgraph "Bagging (Random Forest)" D1[(Data)] --> T1["๐ŸŒณ Tree 1"] D1 --> T2["๐ŸŒณ Tree 2"] D1 --> T3["๐ŸŒณ Tree 3"] T1 --> V{{"Vote"}} T2 --> V T3 --> V end
AspectBaggingBoosting
TrainingParallel (independent)Sequential (each depends on previous)
FocusReduce varianceReduce bias
SamplesRandom bootstrap subsetsReweighted (hard examples get more weight)
AggregationEqual vote / averageWeighted vote (better models count more)
OverfittingResistantCan overfit noisy data
Key Insight: Bagging asks “How can I average out noise?” while Boosting asks “How can I learn from my mistakes?”

The Boosting Intuition

Imagine you’re studying for an exam:

  1. Round 1: Take a practice test, identify problems you got wrong
  2. Round 2: Study harder on the topics you missed, take another test
  3. Round 3: Focus even more on remaining weak spots
  4. Final: Combine all your knowledge, weighted by how much you improved

This is exactly what Boosting does with classifiers!

graph TD A["๐Ÿ“ Round 1: Equal focus on all topics"] --> B["โŒ Missed Q3, Q7, Q12"] B --> C["๐Ÿ“ Round 2: Extra focus on Q3, Q7, Q12"] C --> D["โŒ Still struggling with Q7"] D --> E["๐Ÿ“ Round 3: Heavy focus on Q7"] E --> F["๐ŸŽ“ Final: Combine all rounds"] style F fill:#c8e6c9

AdaBoost: Adaptive Boosting

AdaBoost (Adaptive Boosting) by Freund & Schapire (1997) is the most famous boosting algorithm.

The Algorithm Step-by-Step

Input: Training data $(x_1, y_1), …, (x_N, y_N)$ where $y_i \in {-1, +1}$

Step 1: Initialize sample weights

$$w_i^{(1)} = \frac{1}{N} \quad \text{for all } i$$

All samples start with equal importance.

Step 2: For each round $t = 1, 2, …, T$:

(a) Train weak learner $h_t$ on weighted samples

(b) Compute weighted error:

$$\epsilon_t = \frac{\sum_{i: h_t(x_i) \neq y_i} w_i^{(t)}}{\sum_i w_i^{(t)}} = \sum_{i: \text{wrong}} w_i^{(t)}$$

(c) Compute model weight (how much this model counts):

$$\boxed{\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)}$$

$\epsilon_t$$\alpha_t$Meaning
0.50Random guessing โ†’ no contribution
0.30.42Good classifier โ†’ moderate weight
0.11.10Great classifier โ†’ high weight
0.012.29Excellent โ†’ very high weight

(d) Update sample weights:

$$\boxed{w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha_t y_i h_t(x_i)}}$$

Then normalize so weights sum to 1.

Classification$y_i \cdot h_t(x_i)$Weight change
Correct+1$w_i \cdot e^{-\alpha_t}$ โ†’ decreases
Incorrect-1$w_i \cdot e^{+\alpha_t}$ โ†’ increases

Step 3: Final classifier

$$\boxed{H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right)}$$

Each weak learner votes, weighted by its quality!

graph LR A["Initialize<br/>wแตข = 1/N"] --> B["Train weak learner hโ‚œ"] B --> C["Compute error ฮตโ‚œ"] C --> D["Compute weight<br/>ฮฑโ‚œ = ยฝln((1-ฮต)/ฮต)"] D --> E["Update sample weights"] E --> F{{"t < T?"}} F -->|Yes| B F -->|No| G["H(x) = sign(ฮฃฮฑโ‚œhโ‚œ(x))"] style G fill:#c8e6c9

Worked Example: AdaBoost in Action

Consider 10 samples: 7 correct (โœ“) and 3 incorrect (โœ—) after first stump.

Initial weights: $w_i = 0.1$ for all

After Round 1:

  • Error: $\epsilon_1 = 0.1 + 0.1 + 0.1 = 0.3$
  • Model weight: $\alpha_1 = \frac{1}{2}\ln\frac{0.7}{0.3} = 0.42$
  • New weights:
    • Correct: $0.1 \times e^{-0.42} = 0.066$
    • Incorrect: $0.1 \times e^{+0.42} = 0.152$
    • After normalization: correct โ‰ˆ 0.07, incorrect โ‰ˆ 0.19
Now the misclassified samples have ~3x more weight! The next stump will focus on getting these right.

Why Weak Learners?

Weak learner: A classifier that’s only slightly better than random guessing ($\epsilon < 0.5$).

Weak LearnerDescription
Decision StumpTree with depth 1 (most common)
Linear classifierSingle feature threshold
Small treeDepth 2-3
Why weak? Strong learners might overfit individual samples. Weak learners + boosting = strong ensemble without overfitting.

Code Practice

The following examples demonstrate AdaBoost in action and compare it with other ensemble methods.

Single Stump vs AdaBoost

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("๐Ÿ“Š Weak Learner vs AdaBoost")
print("=" * 50)

# Single stump (weak learner)
stump = DecisionTreeClassifier(max_depth=1, random_state=42)
stump.fit(X_train, y_train)
print(f"Single Stump:       Train={stump.score(X_train, y_train):.2%}, "
      f"Test={stump.score(X_test, y_test):.2%}")

# AdaBoost with 50 stumps
ada = AdaBoostClassifier(n_estimators=50, random_state=42)
ada.fit(X_train, y_train)
print(f"AdaBoost (50):      Train={ada.score(X_train, y_train):.2%}, "
      f"Test={ada.score(X_test, y_test):.2%}")

Output:

1
2
3
4
๐Ÿ“Š Weak Learner vs AdaBoost
==================================================
Single Stump:       Train=70.50%, Test=66.00%
AdaBoost (50):      Train=91.38%, Test=85.00%
A single stump is barely better than random (~70%)! But 50 stumps combined via AdaBoost achieve 85% โ€” the power of “weak to strong” learning!

Learning Curve: How Many Estimators?

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
print("๐Ÿ“Š AdaBoost Learning Curve")
print("=" * 50)
print(f"{'Estimators':<12} {'Train':>10} {'Test':>10}")
print("-" * 50)

for n_est in [1, 5, 10, 25, 50, 100, 200]:
    ada = AdaBoostClassifier(n_estimators=n_est, random_state=42)
    ada.fit(X_train, y_train)
    
    print(f"{n_est:<12} {ada.score(X_train, y_train):>10.2%} "
          f"{ada.score(X_test, y_test):>10.2%}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
๐Ÿ“Š AdaBoost Learning Curve
==================================================
Estimators        Train       Test
--------------------------------------------------
1                70.50%     66.00%
5                80.50%     74.50%
10               85.50%     85.00%
25               88.88%     83.50%
50               91.38%     85.00%
100              92.75%     86.00%
200              94.75%     88.00%
Accuracy improvement slows after ~50 estimators. More estimators improve training accuracy but test accuracy gains diminish. Balance complexity with performance.

Comparing All Ensemble Methods

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
                              GradientBoostingClassifier)
import time

classifiers = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}

print("๐Ÿ“Š Ensemble Method Comparison")
print("=" * 60)
print(f"{'Method':<20} {'Train':>10} {'Test':>10} {'Time':>10}")
print("-" * 60)

for name, clf in classifiers.items():
    start = time.time()
    clf.fit(X_train, y_train)
    elapsed = time.time() - start
    
    train_acc = clf.score(X_train, y_train)
    test_acc = clf.score(X_test, y_test)
    
    print(f"{name:<20} {train_acc:>10.2%} {test_acc:>10.2%} {elapsed:>9.3f}s")

Output:

1
2
3
4
5
6
7
๐Ÿ“Š Ensemble Method Comparison
============================================================
Method                    Train       Test       Time
------------------------------------------------------------
Random Forest           100.00%     92.00%     0.559s
AdaBoost                 92.75%     86.00%     0.647s
Gradient Boosting        99.62%     93.00%     1.078s
Gradient Boosting achieves highest test accuracy (93%) but is slowest. Random Forest balances speed and accuracy well (92%). AdaBoost is simpler but less accurate on complex data.

Visualizing Decision Boundaries

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from sklearn.datasets import make_moons
import numpy as np

# Create moon dataset (harder to separate)
X_moon, y_moon = make_moons(n_samples=200, noise=0.25, random_state=42)

print("๐Ÿ“Š AdaBoost on Moon Dataset")
print("=" * 40)

for n_est in [1, 5, 20, 50]:
    ada = AdaBoostClassifier(n_estimators=n_est, random_state=42)
    ada.fit(X_moon, y_moon)
    print(f"{n_est:>3} estimators: accuracy = {ada.score(X_moon, y_moon):.1%}")

Output:

1
2
3
4
5
6
๐Ÿ“Š AdaBoost on Moon Dataset
========================================
  1 estimators: accuracy = 84.5%
  5 estimators: accuracy = 91.0%
 20 estimators: accuracy = 96.0%
 50 estimators: accuracy = 97.5%

Deep Dive

Frequently Asked Questions

Q1: Boosting vs Bagging โ€” when to use which?

Your ProblemRecommendationWhy
Overfitting (high variance)Bagging (Random Forest)Averaging reduces variance
Underfitting (high bias)Boosting (AdaBoost/GB)Sequential correction reduces bias
Noisy data, outliersBaggingMore robust to noise
Clean data, max accuracyGradient BoostingBest accuracy
Fast training neededRandom ForestParallelizable
Interpretability neededSingle Decision TreeBut lower accuracy

Q2: AdaBoost vs Gradient Boosting vs XGBoost

AspectAdaBoostGradient BoostingXGBoost/LightGBM
How it learnsReweight samplesFit residualsFit residuals + regularization
Loss functionExponentialAny differentiableAny + custom
Speed๐Ÿ”ถ Medium๐Ÿข Slowโšก Fast (optimized)
Accuracy๐Ÿ”ถ Goodโœ… Betterโœ…โœ… Best
HyperparametersFewMediumMany
Outlier sensitivityโš ๏ธ High๐Ÿ”ถ Medium๐Ÿ”ถ Medium
In practice: Use XGBoost/LightGBM for competitions. Use AdaBoost for simplicity and interpretability.

Q3: What if AdaBoost overfits?

SymptomSolution
Train accuracy » Test accuracyReduce n_estimators
Performance degrades with more treesUse learning_rate < 1.0
Sensitive to outliersRemove outliers or use RandomForest
Complex base learnerUse simpler stumps (max_depth=1)
1
2
3
4
5
6
7
# Example: Prevent overfitting
ada = AdaBoostClassifier(
    n_estimators=50,
    learning_rate=0.5,  # Shrinkage
    estimator=DecisionTreeClassifier(max_depth=1),
    random_state=42
)

Q4: Why is AdaBoost sensitive to outliers?

Outliers get higher weights when misclassified โ†’ next learner focuses heavily on them โ†’ ensemble is distorted by outliers.

Solutions:

  • Use Gradient Boosting (Huber loss for robustness)
  • Remove outliers before training
  • Use Random Forest instead

Practical Tips

AdaBoost tuning checklist:

  1. Start with n_estimators=50, learning_rate=1.0
  2. If overfitting: reduce n_estimators or learning_rate
  3. Use algorithm='SAMME.R' (real-valued, usually better)
  4. For noisy data, consider Random Forest instead
  5. Monitor train vs test accuracy gap

Key Hyperparameters

ParameterDefaultEffect
n_estimators50More โ†’ stronger but risk overfit
learning_rate1.0Lower โ†’ more conservative updates
estimatorDecisionTreeClassifier(max_depth=1)Weak learner type
algorithm‘SAMME.R’‘SAMME’ for discrete, ‘SAMME.R’ for real

Common Pitfalls

PitfallSymptomSolution
Too many estimatorsOverfittingReduce or use early stopping
Strong base learnersPoor ensembleUse depth=1 stumps
Outliers in dataDistorted predictionsClean data or use RF
Class imbalanceIgnores minorityUse class weights
Algorithm choicePoor probabilitiesUse ‘SAMME.R’ for predict_proba

Summary

ConceptKey Points
BoostingSequential learning, focus on errors
AdaBoostReweight samples, weak โ†’ strong
$\alpha_t$Model weight based on error
Weight UpdateIncrease for misclassified
vs BaggingReduces bias (not variance)

References

  • Freund, Y. & Schapire, R. (1997). “A Decision-Theoretic Generalization of On-Line Learning”
  • sklearn Ensemble Documentation
  • “The Elements of Statistical Learning” - Chapter 10
  • Abu-Mostafa, Y. “Learning From Data”