ML-15: Bagging and Random Forest

Publish on: 2022/04/20 Classify at: CODE/Supervised Machine Learning

Words: 1738 Read:≈ 9min

Summary

Master ensemble learning: understand why combining trees reduces variance, learn Bootstrap aggregating (Bagging), build Random Forests with feature randomization, and evaluate models using Out-of-Bag error.

Learning Objectives

Understand ensemble learning motivation
Explain Bootstrap aggregating (Bagging)
Master Random Forest algorithm
Use Out-of-Bag (OOB) evaluation

Theory

Why Ensemble Learning?

In last blog, it was shown that decision trees are prone to overfitting. A tree with no depth limit memorizes training data perfectly but generalizes poorly.

Key insight: Instead of building one complex model, combine many simple models!

graph TD
    subgraph "Ensemble"
        C["🌳🌳🌳 Many Trees"] --> D["Vote/Average"] --> E["Low Variance"]
    end
    subgraph "Single Model"
        A["🌳 One Tree"] --> B["High Variance"]
    end

Approach	Bias	Variance	Result
Single deep tree	Low	⚠️ High	Overfits
Single shallow tree	⚠️ High	Low	Underfits
Ensemble of trees	Low	✅ Low	Best of both!

The Wisdom of Crowds: Just like averaging many people’s guesses gives a better estimate than any individual, averaging many models often outperforms any single model.

Bootstrap Aggregating (Bagging)

Bagging = Bootstrap Aggregating

Step 1: Create Bootstrap Samples

Bootstrap sampling: Draw $n$ samples with replacement from the original dataset of size $n$.

Original Data	Bootstrap Sample 1	Bootstrap Sample 2
[A, B, C, D, E]	[A, A, C, D, E]	[B, C, C, D, D]

Key property: On average, each bootstrap sample contains ~63.2% unique samples.

$$P(\text{sample not selected}) = \left(1 - \frac{1}{n}\right)^n \xrightarrow{n \to \infty} \frac{1}{e} \approx 0.368$$

So ~36.8% of samples are left out → these become the Out-of-Bag (OOB) samples!

Step 2: Train Models on Each Sample

graph LR D[(Original Data n samples)] --> S1["Bootstrap 1 ~63% unique"] D --> S2["Bootstrap 2 ~63% unique"] D --> S3["Bootstrap 3 ~63% unique"] S1 --> M1["🌳 Tree 1"] S2 --> M2["🌳 Tree 2"] S3 --> M3["🌳 Tree 3"] M1 --> V{{"🗳️ Vote / Average"}} M2 --> V M3 --> V V --> F["Final Prediction"] style V fill:#fff9c4 style F fill:#c8e6c9

Step 3: Aggregate Predictions

Task	Aggregation	Formula
Classification	Majority vote	$\hat{y} = \text{mode}(h_1(x), …, h_T(x))$
Regression	Average	$\hat{y} = \frac{1}{T}\sum_{t=1}^{T} h_t(x)$

Why Bagging Reduces Variance

Key mathematical insight: Averaging uncorrelated errors reduces variance.

If each tree has variance $\sigma^2$ and trees are independent:

$$\text{Var}\left(\frac{1}{T}\sum_{t=1}^{T} h_t\right) = \frac{\sigma^2}{T}$$

More trees → Lower variance! 🎉

Catch: Real trees aren’t fully independent — they’re trained on similar data. This is why Random Forest adds feature randomization.

Random Forest = Bagging + Feature Randomization

Random Forest adds one crucial modification: at each split, only a random subset of features is considered.

At each split	Bagging	Random Forest
Features considered	All $d$ features	Random subset
Classification	All $d$	$\sqrt{d}$ (default)
Regression	All $d$	$d/3$ (default)

Why Random Features?

Without feature randomization, all trees tend to split on the same “best” features → trees become correlated → less variance reduction.

graph TD A["Bagging Only"] --> B["Trees split on same strong features"] B --> C["Correlated trees"] C --> D["Limited variance reduction"] E["Random Forest"] --> F["Trees split on different features"] F --> G["Uncorrelated trees"] G --> H["Maximum variance reduction"] style D fill:#ffcdd2 style H fill:#c8e6c9

Intuition: Each tree sees a different “view” of the data. Some trees might miss the strongest feature at some splits, forcing them to find alternative patterns.

Out-of-Bag (OOB) Evaluation

Since ~36.8% of samples are left out of each bootstrap, we get a free validation set!

How OOB Works

For each sample $x_i$, find trees that didn’t use it for training
Get predictions from only those trees
Aggregate these predictions
Compare to true label → OOB error

1
2
3
Sample 1: Used in trees [1,2,5,7] → OOB prediction from trees [3,4,6,8,9,10]
Sample 2: Used in trees [2,3,4,8] → OOB prediction from trees [1,5,6,7,9,10]
...

Advantage	Description
No separate validation set needed	Uses ~37% left-out samples
Unbiased estimate	Similar to cross-validation
Computed automatically	Set `oob_score=True` in sklearn

OOB error is approximately equal to leave-one-out cross-validation error, but much faster to compute!

Code Practice

The following examples compare single trees, bagging, and Random Forest.

Bagging vs Single Tree

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("📊 Single Tree vs Bagging Comparison")
print("=" * 50)

# Single tree (high variance)
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
print(f"Single Tree:       Train={tree.score(X_train, y_train):.2%}, "
      f"Test={tree.score(X_test, y_test):.2%}")

# Bagging (reduced variance)
bag = BaggingClassifier(n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
print(f"Bagging (50 trees): Train={bag.score(X_train, y_train):.2%}, "
      f"Test={bag.score(X_test, y_test):.2%}")

Output:

1
2
3
4
📊 Single Tree vs Bagging Comparison
==================================================
Single Tree:       Train=100.00%, Test=83.50%
Bagging (50 trees): Train=100.00%, Test=91.00%

Single tree achieves 100% training accuracy (overfitting!) but only 83.5% test. Bagging improves test accuracy to 91% by averaging out the variance.

Random Forest with OOB Score

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.ensemble import RandomForestClassifier

# Random Forest with OOB evaluation
rf = RandomForestClassifier(
    n_estimators=100, 
    oob_score=True,  # Enable OOB
    random_state=42,
    n_jobs=-1  # Use all CPUs
)
rf.fit(X_train, y_train)

print("📊 Random Forest Results")
print("=" * 40)
print(f"Train accuracy: {rf.score(X_train, y_train):.2%}")
print(f"Test accuracy:  {rf.score(X_test, y_test):.2%}")
print(f"OOB score:      {rf.oob_score_:.2%}")

print(f"\n📈 Model Statistics:")
print(f"   Number of trees: {rf.n_estimators}")
print(f"   Features per split: {rf.max_features}")
print(f"   Max depth: {rf.max_depth or 'None (unlimited)'}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
📊 Random Forest Results
========================================
Train accuracy: 100.00%
Test accuracy:  92.00%
OOB score:      92.25%

📈 Model Statistics:
   Number of trees: 100
   Features per split: sqrt
   Max depth: None (unlimited)

OOB score (92.25%) closely matches test accuracy (92.00%) — it’s a reliable estimate without needing a separate validation set!

Feature Importance

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from sklearn.datasets import load_iris
import numpy as np

# Train on Iris
X_iris, y_iris = load_iris(return_X_y=True)
feature_names = load_iris().feature_names

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_iris, y_iris)

print("📊 Feature Importance (Iris Dataset)")
print("=" * 50)
for name, importance in sorted(zip(feature_names, rf.feature_importances_), 
                                key=lambda x: -x[1]):
    bar = "█" * int(importance * 40)
    print(f"{name:<18} {importance:.3f} {bar}")

Output:

1
2
3
4
5
6
📊 Feature Importance (Iris Dataset)
==================================================
petal length (cm)  0.436 █████████████████
petal width (cm)   0.436 █████████████████
sepal length (cm)  0.106 ████
sepal width (cm)   0.022 

Petal measurements dominate! This matches what was learned in Decision Trees — petal features are most discriminative for Iris classification.

Effect of Number of Trees

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import warnings

print("📊 Effect of Number of Trees")
print("=" * 45)
print(f"{'Trees':<8} {'Train':>10} {'Test':>10} {'OOB':>10}")
print("-" * 45)

for n_trees in [1, 5, 10, 25, 50, 100, 200]:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        rf = RandomForestClassifier(n_estimators=n_trees, oob_score=True, 
                                     random_state=42)
        rf.fit(X_train, y_train)
        oob = f"{rf.oob_score_:>10.2%}" if hasattr(rf, 'oob_score_') else "N/A"
    
    print(f"{n_trees:<8} {rf.score(X_train, y_train):>10.2%} "
          f"{rf.score(X_test, y_test):>10.2%} {oob}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
📊 Effect of Number of Trees
=============================================
Trees         Train       Test        OOB
---------------------------------------------
1            91.25%     78.00%     60.00%
5            98.50%     84.00%     77.00%
10           99.38%     87.00%     82.38%
25          100.00%     93.00%     87.88%
50          100.00%     91.00%     91.50%
100         100.00%     92.00%     92.25%
200         100.00%     91.00%     92.75%

Accuracy stabilizes around 25-100 trees (~91-93%). More trees improve OOB reliability but add training time with diminishing returns.

Hyperparameter Tuning with GridSearchCV

🐍 Python

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5],
    'max_features': ['sqrt', 'log2']
}

grid = GridSearchCV(
    RandomForestClassifier(random_state=42), 
    param_grid, 
    cv=5, 
    n_jobs=-1,
    scoring='accuracy'
)
grid.fit(X_train, y_train)

print("📊 Grid Search Results")
print("=" * 50)
print(f"Best CV score: {grid.best_score_:.2%}")
print(f"\n🏆 Best Parameters:")
for param, value in grid.best_params_.items():
    print(f"   {param}: {value}")

print(f"\nTest accuracy with best model: {grid.score(X_test, y_test):.2%}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
📊 Grid Search Results
==================================================
Best CV score: 92.50%

🏆 Best Parameters:
   max_depth: None
   max_features: sqrt
   min_samples_split: 2
   n_estimators: 100

Test accuracy with best model: 92.00%

Deep Dive

Frequently Asked Questions

Q1: How many trees should I use?

Trees	Recommendation
10-50	Quick experiments, development
100-200	Production (good tradeoff)
500+	Diminishing returns, only if compute is cheap

Start with 100 trees. Increase only if OOB score keeps improving.

Q2: Random Forest vs XGBoost/LightGBM

Aspect	Random Forest	XGBoost/LightGBM
Building	Parallel (independent trees)	Sequential (each tree corrects previous)
Speed	⚡ Fast training	🔶 Slower training, fast prediction
Tuning	✅ Few hyperparameters	⚠️ Many hyperparameters
Accuracy	🔶 Good	✅ Often better
Overfitting	✅ Resistant	⚠️ Needs regularization
Best for	Quick baseline, robust models	Competitions, max accuracy

Q3: When does Random Forest struggle?

Scenario	Issue	Alternative
Very high dimensions	Slow, feature importance unreliable	Use feature selection first
Sparse data	Doesn’t handle well	XGBoost, Linear models
Extrapolation	Can’t predict outside training range	Gradient Boosting, Linear models
Interpretability needed	Can’t explain easily	Single Decision Tree, SHAP values

Q4: What if feature importance seems wrong?

Issue	Cause	Solution
High-cardinality features dominate	More split opportunities	Use permutation importance
Correlated features share importance	Importance spreads across them	Inspect together or remove one
Noise features ranked high	Overfitting	Try permutation importance on test set

1
2
3
4
# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance

perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10)

Practical Tips

Random Forest tuning checklist:

Start with defaults (n_estimators=100)
Use oob_score=True for quick validation
If overfitting: reduce max_depth, increase min_samples_leaf
If underfitting: increase n_estimators, try max_features='sqrt'
Always use n_jobs=-1 for parallelization

Key Hyperparameters

Parameter	Default	Effect
`n_estimators`	100	More trees → lower variance → slower
`max_depth`	None	Limits tree complexity
`max_features`	‘sqrt’	Features per split (decorrelation)
`min_samples_split`	2	Min samples to split a node
`min_samples_leaf`	1	Min samples in leaf

Common Pitfalls

Pitfall	Symptom	Solution
Too few trees	High variance between runs	Use at least 100 trees
Not using OOB	Wasting data on validation	Set `oob_score=True`
Ignoring class imbalance	Poor minority class prediction	Use `class_weight='balanced'`
Relying on MDI importance	Misleading with correlated/high-cardinality features	Use permutation importance
Max depth too high	Overfitting on noisy data	Try `max_depth=10-20`

Summary

Concept	Key Points
Bagging	Bootstrap + Aggregate → reduces variance
Random Forest	Bagging + random feature subsets per split
OOB Score	Free validation from ~37% unused samples
Feature Importance	MDI (fast) or Permutation (more reliable)
# Trees	50-200 typically sufficient; diminishing returns
vs Boosting	RF: parallel, robust; Boosting: sequential, often higher accuracy

References

Breiman, L. (2001). “Random Forests”
sklearn Random Forest
Hastie, T. et al. “The Elements of Statistical Learning” - Chapter 15