ML-16: Boosting and AdaBoost
Learning Objectives
- Understand Boosting vs Bagging
- Master AdaBoost algorithm
- Implement weighted learning
- Choose between ensemble methods
Theory
Boosting vs Bagging: Two Philosophies
In last blog, it was shown that Bagging (Random Forest) builds trees in parallel on bootstrap samples. Boosting takes a fundamentally different approach: build models sequentially, where each new model focuses on the mistakes of previous ones.
| Aspect | Bagging | Boosting |
|---|---|---|
| Training | Parallel (independent) | Sequential (each depends on previous) |
| Focus | Reduce variance | Reduce bias |
| Samples | Random bootstrap subsets | Reweighted (hard examples get more weight) |
| Aggregation | Equal vote / average | Weighted vote (better models count more) |
| Overfitting | Resistant | Can overfit noisy data |
The Boosting Intuition
Imagine you’re studying for an exam:
- Round 1: Take a practice test, identify problems you got wrong
- Round 2: Study harder on the topics you missed, take another test
- Round 3: Focus even more on remaining weak spots
- Final: Combine all your knowledge, weighted by how much you improved
This is exactly what Boosting does with classifiers!
AdaBoost: Adaptive Boosting
AdaBoost (Adaptive Boosting) by Freund & Schapire (1997) is the most famous boosting algorithm.
The Algorithm Step-by-Step
Input: Training data $(x_1, y_1), …, (x_N, y_N)$ where $y_i \in {-1, +1}$
Step 1: Initialize sample weights
$$w_i^{(1)} = \frac{1}{N} \quad \text{for all } i$$
All samples start with equal importance.
Step 2: For each round $t = 1, 2, …, T$:
(a) Train weak learner $h_t$ on weighted samples
(b) Compute weighted error:
$$\epsilon_t = \frac{\sum_{i: h_t(x_i) \neq y_i} w_i^{(t)}}{\sum_i w_i^{(t)}} = \sum_{i: \text{wrong}} w_i^{(t)}$$
(c) Compute model weight (how much this model counts):
$$\boxed{\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)}$$
| $\epsilon_t$ | $\alpha_t$ | Meaning |
|---|---|---|
| 0.5 | 0 | Random guessing โ no contribution |
| 0.3 | 0.42 | Good classifier โ moderate weight |
| 0.1 | 1.10 | Great classifier โ high weight |
| 0.01 | 2.29 | Excellent โ very high weight |
(d) Update sample weights:
$$\boxed{w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha_t y_i h_t(x_i)}}$$
Then normalize so weights sum to 1.
| Classification | $y_i \cdot h_t(x_i)$ | Weight change |
|---|---|---|
| Correct | +1 | $w_i \cdot e^{-\alpha_t}$ โ decreases |
| Incorrect | -1 | $w_i \cdot e^{+\alpha_t}$ โ increases |
Step 3: Final classifier
$$\boxed{H(x) = \text{sign}\left(\sum_{t=1}^{T} \alpha_t h_t(x)\right)}$$
Each weak learner votes, weighted by its quality!
Worked Example: AdaBoost in Action
Consider 10 samples: 7 correct (โ) and 3 incorrect (โ) after first stump.
Initial weights: $w_i = 0.1$ for all
After Round 1:
- Error: $\epsilon_1 = 0.1 + 0.1 + 0.1 = 0.3$
- Model weight: $\alpha_1 = \frac{1}{2}\ln\frac{0.7}{0.3} = 0.42$
- New weights:
- Correct: $0.1 \times e^{-0.42} = 0.066$
- Incorrect: $0.1 \times e^{+0.42} = 0.152$
- After normalization: correct โ 0.07, incorrect โ 0.19
Why Weak Learners?
Weak learner: A classifier that’s only slightly better than random guessing ($\epsilon < 0.5$).
| Weak Learner | Description |
|---|---|
| Decision Stump | Tree with depth 1 (most common) |
| Linear classifier | Single feature threshold |
| Small tree | Depth 2-3 |
Code Practice
The following examples demonstrate AdaBoost in action and compare it with other ensemble methods.
Single Stump vs AdaBoost
๐ Python
| |
Output:
Learning Curve: How Many Estimators?
๐ Python
| |
Output:
Comparing All Ensemble Methods
๐ Python
| |
Output:
Visualizing Decision Boundaries
๐ Python
| |
Output:
Deep Dive
Frequently Asked Questions
Q1: Boosting vs Bagging โ when to use which?
| Your Problem | Recommendation | Why |
|---|---|---|
| Overfitting (high variance) | Bagging (Random Forest) | Averaging reduces variance |
| Underfitting (high bias) | Boosting (AdaBoost/GB) | Sequential correction reduces bias |
| Noisy data, outliers | Bagging | More robust to noise |
| Clean data, max accuracy | Gradient Boosting | Best accuracy |
| Fast training needed | Random Forest | Parallelizable |
| Interpretability needed | Single Decision Tree | But lower accuracy |
Q2: AdaBoost vs Gradient Boosting vs XGBoost
| Aspect | AdaBoost | Gradient Boosting | XGBoost/LightGBM |
|---|---|---|---|
| How it learns | Reweight samples | Fit residuals | Fit residuals + regularization |
| Loss function | Exponential | Any differentiable | Any + custom |
| Speed | ๐ถ Medium | ๐ข Slow | โก Fast (optimized) |
| Accuracy | ๐ถ Good | โ Better | โ โ Best |
| Hyperparameters | Few | Medium | Many |
| Outlier sensitivity | โ ๏ธ High | ๐ถ Medium | ๐ถ Medium |
Q3: What if AdaBoost overfits?
| Symptom | Solution |
|---|---|
| Train accuracy » Test accuracy | Reduce n_estimators |
| Performance degrades with more trees | Use learning_rate < 1.0 |
| Sensitive to outliers | Remove outliers or use RandomForest |
| Complex base learner | Use simpler stumps (max_depth=1) |
Q4: Why is AdaBoost sensitive to outliers?
Outliers get higher weights when misclassified โ next learner focuses heavily on them โ ensemble is distorted by outliers.
Solutions:
- Use Gradient Boosting (Huber loss for robustness)
- Remove outliers before training
- Use Random Forest instead
Practical Tips
AdaBoost tuning checklist:
- Start with
n_estimators=50,learning_rate=1.0 - If overfitting: reduce
n_estimatorsorlearning_rate - Use
algorithm='SAMME.R'(real-valued, usually better) - For noisy data, consider Random Forest instead
- Monitor train vs test accuracy gap
Key Hyperparameters
| Parameter | Default | Effect |
|---|---|---|
n_estimators | 50 | More โ stronger but risk overfit |
learning_rate | 1.0 | Lower โ more conservative updates |
estimator | DecisionTreeClassifier(max_depth=1) | Weak learner type |
algorithm | ‘SAMME.R’ | ‘SAMME’ for discrete, ‘SAMME.R’ for real |
Common Pitfalls
| Pitfall | Symptom | Solution |
|---|---|---|
| Too many estimators | Overfitting | Reduce or use early stopping |
| Strong base learners | Poor ensemble | Use depth=1 stumps |
| Outliers in data | Distorted predictions | Clean data or use RF |
| Class imbalance | Ignores minority | Use class weights |
| Algorithm choice | Poor probabilities | Use ‘SAMME.R’ for predict_proba |
Summary
| Concept | Key Points |
|---|---|
| Boosting | Sequential learning, focus on errors |
| AdaBoost | Reweight samples, weak โ strong |
| $\alpha_t$ | Model weight based on error |
| Weight Update | Increase for misclassified |
| vs Bagging | Reduces bias (not variance) |
References
- Freund, Y. & Schapire, R. (1997). “A Decision-Theoretic Generalization of On-Line Learning”
- sklearn Ensemble Documentation
- “The Elements of Statistical Learning” - Chapter 10
- Abu-Mostafa, Y. “Learning From Data”