Summary
Master ensemble learning: understand why combining trees reduces variance, learn Bootstrap aggregating (Bagging), build Random Forests with feature randomization, and evaluate models using Out-of-Bag error.
Learning Objectives
- Understand ensemble learning motivation
- Explain Bootstrap aggregating (Bagging)
- Master Random Forest algorithm
- Use Out-of-Bag (OOB) evaluation
Theory
Why Ensemble Learning?
In last blog, it was shown that decision trees are prone to overfitting. A tree with no depth limit memorizes training data perfectly but generalizes poorly.
Key insight: Instead of building one complex model, combine many simple models!
graph TD
subgraph "Ensemble"
C["๐ณ๐ณ๐ณ Many Trees"] --> D["Vote/Average"] --> E["Low Variance"]
end
subgraph "Single Model"
A["๐ณ One Tree"] --> B["High Variance"]
end
| Approach | Bias | Variance | Result |
|---|
| Single deep tree | Low | โ ๏ธ High | Overfits |
| Single shallow tree | โ ๏ธ High | Low | Underfits |
| Ensemble of trees | Low | โ
Low | Best of both! |
The Wisdom of Crowds: Just like averaging many people’s guesses gives a better estimate than any individual, averaging many models often outperforms any single model.
Bootstrap Aggregating (Bagging)
Bagging = Bootstrap Aggregating
Step 1: Create Bootstrap Samples
Bootstrap sampling: Draw $n$ samples with replacement from the original dataset of size $n$.
| Original Data | Bootstrap Sample 1 | Bootstrap Sample 2 |
|---|
| [A, B, C, D, E] | [A, A, C, D, E] | [B, C, C, D, D] |
Key property: On average, each bootstrap sample contains ~63.2% unique samples.
$$P(\text{sample not selected}) = \left(1 - \frac{1}{n}\right)^n \xrightarrow{n \to \infty} \frac{1}{e} \approx 0.368$$
So ~36.8% of samples are left out โ these become the Out-of-Bag (OOB) samples!
Step 2: Train Models on Each Sample
graph LR
D[(Original Data<br/>n samples)] --> S1["Bootstrap 1<br/>~63% unique"]
D --> S2["Bootstrap 2<br/>~63% unique"]
D --> S3["Bootstrap 3<br/>~63% unique"]
S1 --> M1["๐ณ Tree 1"]
S2 --> M2["๐ณ Tree 2"]
S3 --> M3["๐ณ Tree 3"]
M1 --> V{{"๐ณ๏ธ Vote /<br/>Average"}}
M2 --> V
M3 --> V
V --> F["Final Prediction"]
style V fill:#fff9c4
style F fill:#c8e6c9
Step 3: Aggregate Predictions
| Task | Aggregation | Formula |
|---|
| Classification | Majority vote | $\hat{y} = \text{mode}(h_1(x), …, h_T(x))$ |
| Regression | Average | $\hat{y} = \frac{1}{T}\sum_{t=1}^{T} h_t(x)$ |
Why Bagging Reduces Variance
Key mathematical insight: Averaging uncorrelated errors reduces variance.
If each tree has variance $\sigma^2$ and trees are independent:
$$\text{Var}\left(\frac{1}{T}\sum_{t=1}^{T} h_t\right) = \frac{\sigma^2}{T}$$
More trees โ Lower variance! ๐
Catch: Real trees aren’t fully independent โ they’re trained on similar data. This is why Random Forest adds feature randomization.
Random Forest = Bagging + Feature Randomization
Random Forest adds one crucial modification: at each split, only a random subset of features is considered.
| At each split | Bagging | Random Forest |
|---|
| Features considered | All $d$ features | Random subset |
| Classification | All $d$ | $\sqrt{d}$ (default) |
| Regression | All $d$ | $d/3$ (default) |
Why Random Features?
Without feature randomization, all trees tend to split on the same “best” features โ trees become correlated โ less variance reduction.
graph TD
A["Bagging Only"] --> B["Trees split on<br/>same strong features"]
B --> C["Correlated trees"]
C --> D["Limited variance reduction"]
E["Random Forest"] --> F["Trees split on<br/>different features"]
F --> G["Uncorrelated trees"]
G --> H["Maximum variance reduction"]
style D fill:#ffcdd2
style H fill:#c8e6c9
Intuition: Each tree sees a different “view” of the data. Some trees might miss the strongest feature at some splits, forcing them to find alternative patterns.
Out-of-Bag (OOB) Evaluation
Since ~36.8% of samples are left out of each bootstrap, we get a free validation set!
How OOB Works
- For each sample $x_i$, find trees that didn’t use it for training
- Get predictions from only those trees
- Aggregate these predictions
- Compare to true label โ OOB error
1
2
3
| Sample 1: Used in trees [1,2,5,7] โ OOB prediction from trees [3,4,6,8,9,10]
Sample 2: Used in trees [2,3,4,8] โ OOB prediction from trees [1,5,6,7,9,10]
...
|
| Advantage | Description |
|---|
| No separate validation set needed | Uses ~37% left-out samples |
| Unbiased estimate | Similar to cross-validation |
| Computed automatically | Set oob_score=True in sklearn |
OOB error is approximately equal to leave-one-out cross-validation error, but much faster to compute!
Code Practice
The following examples compare single trees, bagging, and Random Forest.
Bagging vs Single Tree
๐ Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print("๐ Single Tree vs Bagging Comparison")
print("=" * 50)
# Single tree (high variance)
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
print(f"Single Tree: Train={tree.score(X_train, y_train):.2%}, "
f"Test={tree.score(X_test, y_test):.2%}")
# Bagging (reduced variance)
bag = BaggingClassifier(n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
print(f"Bagging (50 trees): Train={bag.score(X_train, y_train):.2%}, "
f"Test={bag.score(X_test, y_test):.2%}")
|
Output:
1
2
3
4
| ๐ Single Tree vs Bagging Comparison
==================================================
Single Tree: Train=100.00%, Test=83.50%
Bagging (50 trees): Train=100.00%, Test=91.00%
|
Single tree achieves 100% training accuracy (overfitting!) but only 83.5% test. Bagging improves test accuracy to 91% by averaging out the variance.
Random Forest with OOB Score
๐ Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| from sklearn.ensemble import RandomForestClassifier
# Random Forest with OOB evaluation
rf = RandomForestClassifier(
n_estimators=100,
oob_score=True, # Enable OOB
random_state=42,
n_jobs=-1 # Use all CPUs
)
rf.fit(X_train, y_train)
print("๐ Random Forest Results")
print("=" * 40)
print(f"Train accuracy: {rf.score(X_train, y_train):.2%}")
print(f"Test accuracy: {rf.score(X_test, y_test):.2%}")
print(f"OOB score: {rf.oob_score_:.2%}")
print(f"\n๐ Model Statistics:")
print(f" Number of trees: {rf.n_estimators}")
print(f" Features per split: {rf.max_features}")
print(f" Max depth: {rf.max_depth or 'None (unlimited)'}")
|
Output:
1
2
3
4
5
6
7
8
9
10
| ๐ Random Forest Results
========================================
Train accuracy: 100.00%
Test accuracy: 92.00%
OOB score: 92.25%
๐ Model Statistics:
Number of trees: 100
Features per split: sqrt
Max depth: None (unlimited)
|
OOB score (92.25%) closely matches test accuracy (92.00%) โ it’s a reliable estimate without needing a separate validation set!
Feature Importance
๐ Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| from sklearn.datasets import load_iris
import numpy as np
# Train on Iris
X_iris, y_iris = load_iris(return_X_y=True)
feature_names = load_iris().feature_names
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_iris, y_iris)
print("๐ Feature Importance (Iris Dataset)")
print("=" * 50)
for name, importance in sorted(zip(feature_names, rf.feature_importances_),
key=lambda x: -x[1]):
bar = "โ" * int(importance * 40)
print(f"{name:<18} {importance:.3f} {bar}")
|
Output:
1
2
3
4
5
6
| ๐ Feature Importance (Iris Dataset)
==================================================
petal length (cm) 0.436 โโโโโโโโโโโโโโโโโ
petal width (cm) 0.436 โโโโโโโโโโโโโโโโโ
sepal length (cm) 0.106 โโโโ
sepal width (cm) 0.022
|
Petal measurements dominate! This matches what was learned in Decision Trees โ petal features are most discriminative for Iris classification.
Effect of Number of Trees
๐ Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| import warnings
print("๐ Effect of Number of Trees")
print("=" * 45)
print(f"{'Trees':<8} {'Train':>10} {'Test':>10} {'OOB':>10}")
print("-" * 45)
for n_trees in [1, 5, 10, 25, 50, 100, 200]:
with warnings.catch_warnings():
warnings.simplefilter("ignore")
rf = RandomForestClassifier(n_estimators=n_trees, oob_score=True,
random_state=42)
rf.fit(X_train, y_train)
oob = f"{rf.oob_score_:>10.2%}" if hasattr(rf, 'oob_score_') else "N/A"
print(f"{n_trees:<8} {rf.score(X_train, y_train):>10.2%} "
f"{rf.score(X_test, y_test):>10.2%} {oob}")
|
Output:
1
2
3
4
5
6
7
8
9
10
11
| ๐ Effect of Number of Trees
=============================================
Trees Train Test OOB
---------------------------------------------
1 91.25% 78.00% 60.00%
5 98.50% 84.00% 77.00%
10 99.38% 87.00% 82.38%
25 100.00% 93.00% 87.88%
50 100.00% 91.00% 91.50%
100 100.00% 92.00% 92.25%
200 100.00% 91.00% 92.75%
|
Accuracy stabilizes around 25-100 trees (~91-93%). More trees improve OOB reliability but add training time with diminishing returns.
Hyperparameter Tuning with GridSearchCV
๐ Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100],
'max_depth': [5, 10, None],
'min_samples_split': [2, 5],
'max_features': ['sqrt', 'log2']
}
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
n_jobs=-1,
scoring='accuracy'
)
grid.fit(X_train, y_train)
print("๐ Grid Search Results")
print("=" * 50)
print(f"Best CV score: {grid.best_score_:.2%}")
print(f"\n๐ Best Parameters:")
for param, value in grid.best_params_.items():
print(f" {param}: {value}")
print(f"\nTest accuracy with best model: {grid.score(X_test, y_test):.2%}")
|
Output:
1
2
3
4
5
6
7
8
9
10
11
| ๐ Grid Search Results
==================================================
Best CV score: 92.50%
๐ Best Parameters:
max_depth: None
max_features: sqrt
min_samples_split: 2
n_estimators: 100
Test accuracy with best model: 92.00%
|
Deep Dive
Frequently Asked Questions
Q1: How many trees should I use?
| Trees | Recommendation |
|---|
| 10-50 | Quick experiments, development |
| 100-200 | Production (good tradeoff) |
| 500+ | Diminishing returns, only if compute is cheap |
Start with 100 trees. Increase only if OOB score keeps improving.
Q2: Random Forest vs XGBoost/LightGBM
| Aspect | Random Forest | XGBoost/LightGBM |
|---|
| Building | Parallel (independent trees) | Sequential (each tree corrects previous) |
| Speed | โก Fast training | ๐ถ Slower training, fast prediction |
| Tuning | โ
Few hyperparameters | โ ๏ธ Many hyperparameters |
| Accuracy | ๐ถ Good | โ
Often better |
| Overfitting | โ
Resistant | โ ๏ธ Needs regularization |
| Best for | Quick baseline, robust models | Competitions, max accuracy |
Q3: When does Random Forest struggle?
| Scenario | Issue | Alternative |
|---|
| Very high dimensions | Slow, feature importance unreliable | Use feature selection first |
| Sparse data | Doesn’t handle well | XGBoost, Linear models |
| Extrapolation | Can’t predict outside training range | Gradient Boosting, Linear models |
| Interpretability needed | Can’t explain easily | Single Decision Tree, SHAP values |
Q4: What if feature importance seems wrong?
| Issue | Cause | Solution |
|---|
| High-cardinality features dominate | More split opportunities | Use permutation importance |
| Correlated features share importance | Importance spreads across them | Inspect together or remove one |
| Noise features ranked high | Overfitting | Try permutation importance on test set |
1
2
3
4
| # Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10)
|
Practical Tips
Random Forest tuning checklist:
- Start with defaults (
n_estimators=100) - Use
oob_score=True for quick validation - If overfitting: reduce
max_depth, increase min_samples_leaf - If underfitting: increase
n_estimators, try max_features='sqrt' - Always use
n_jobs=-1 for parallelization
Key Hyperparameters
| Parameter | Default | Effect |
|---|
n_estimators | 100 | More trees โ lower variance โ slower |
max_depth | None | Limits tree complexity |
max_features | ‘sqrt’ | Features per split (decorrelation) |
min_samples_split | 2 | Min samples to split a node |
min_samples_leaf | 1 | Min samples in leaf |
Common Pitfalls
| Pitfall | Symptom | Solution |
|---|
| Too few trees | High variance between runs | Use at least 100 trees |
| Not using OOB | Wasting data on validation | Set oob_score=True |
| Ignoring class imbalance | Poor minority class prediction | Use class_weight='balanced' |
| Relying on MDI importance | Misleading with correlated/high-cardinality features | Use permutation importance |
| Max depth too high | Overfitting on noisy data | Try max_depth=10-20 |
Summary
| Concept | Key Points |
|---|
| Bagging | Bootstrap + Aggregate โ reduces variance |
| Random Forest | Bagging + random feature subsets per split |
| OOB Score | Free validation from ~37% unused samples |
| Feature Importance | MDI (fast) or Permutation (more reliable) |
| # Trees | 50-200 typically sufficient; diminishing returns |
| vs Boosting | RF: parallel, robust; Boosting: sequential, often higher accuracy |
References
- Breiman, L. (2001). “Random Forests”
- sklearn Random Forest
- Hastie, T. et al. “The Elements of Statistical Learning” - Chapter 15