ML-15: Bagging and Random Forest

Summary
Master ensemble learning: understand why combining trees reduces variance, learn Bootstrap aggregating (Bagging), build Random Forests with feature randomization, and evaluate models using Out-of-Bag error.

Learning Objectives

  • Understand ensemble learning motivation
  • Explain Bootstrap aggregating (Bagging)
  • Master Random Forest algorithm
  • Use Out-of-Bag (OOB) evaluation

Theory

Why Ensemble Learning?

In last blog, it was shown that decision trees are prone to overfitting. A tree with no depth limit memorizes training data perfectly but generalizes poorly.

Key insight: Instead of building one complex model, combine many simple models!

graph TD
    subgraph "Ensemble"
        C["๐ŸŒณ๐ŸŒณ๐ŸŒณ Many Trees"] --> D["Vote/Average"] --> E["Low Variance"]
    end
    subgraph "Single Model"
        A["๐ŸŒณ One Tree"] --> B["High Variance"]
    end
ApproachBiasVarianceResult
Single deep treeLowโš ๏ธ HighOverfits
Single shallow treeโš ๏ธ HighLowUnderfits
Ensemble of treesLowโœ… LowBest of both!
The Wisdom of Crowds: Just like averaging many people’s guesses gives a better estimate than any individual, averaging many models often outperforms any single model.

Bootstrap Aggregating (Bagging)

Bagging = Bootstrap Aggregating

Step 1: Create Bootstrap Samples

Bootstrap sampling: Draw $n$ samples with replacement from the original dataset of size $n$.

Original DataBootstrap Sample 1Bootstrap Sample 2
[A, B, C, D, E][A, A, C, D, E][B, C, C, D, D]

Key property: On average, each bootstrap sample contains ~63.2% unique samples.

$$P(\text{sample not selected}) = \left(1 - \frac{1}{n}\right)^n \xrightarrow{n \to \infty} \frac{1}{e} \approx 0.368$$

So ~36.8% of samples are left out โ†’ these become the Out-of-Bag (OOB) samples!

Step 2: Train Models on Each Sample

graph LR D[(Original Data<br/>n samples)] --> S1["Bootstrap 1<br/>~63% unique"] D --> S2["Bootstrap 2<br/>~63% unique"] D --> S3["Bootstrap 3<br/>~63% unique"] S1 --> M1["๐ŸŒณ Tree 1"] S2 --> M2["๐ŸŒณ Tree 2"] S3 --> M3["๐ŸŒณ Tree 3"] M1 --> V{{"๐Ÿ—ณ๏ธ Vote /<br/>Average"}} M2 --> V M3 --> V V --> F["Final Prediction"] style V fill:#fff9c4 style F fill:#c8e6c9

Step 3: Aggregate Predictions

TaskAggregationFormula
ClassificationMajority vote$\hat{y} = \text{mode}(h_1(x), …, h_T(x))$
RegressionAverage$\hat{y} = \frac{1}{T}\sum_{t=1}^{T} h_t(x)$

Why Bagging Reduces Variance

Key mathematical insight: Averaging uncorrelated errors reduces variance.

If each tree has variance $\sigma^2$ and trees are independent:

$$\text{Var}\left(\frac{1}{T}\sum_{t=1}^{T} h_t\right) = \frac{\sigma^2}{T}$$

More trees โ†’ Lower variance! ๐ŸŽ‰

Catch: Real trees aren’t fully independent โ€” they’re trained on similar data. This is why Random Forest adds feature randomization.

Random Forest = Bagging + Feature Randomization

Random Forest adds one crucial modification: at each split, only a random subset of features is considered.

At each splitBaggingRandom Forest
Features consideredAll $d$ featuresRandom subset
ClassificationAll $d$$\sqrt{d}$ (default)
RegressionAll $d$$d/3$ (default)

Why Random Features?

Without feature randomization, all trees tend to split on the same “best” features โ†’ trees become correlated โ†’ less variance reduction.

graph TD A["Bagging Only"] --> B["Trees split on<br/>same strong features"] B --> C["Correlated trees"] C --> D["Limited variance reduction"] E["Random Forest"] --> F["Trees split on<br/>different features"] F --> G["Uncorrelated trees"] G --> H["Maximum variance reduction"] style D fill:#ffcdd2 style H fill:#c8e6c9
Intuition: Each tree sees a different “view” of the data. Some trees might miss the strongest feature at some splits, forcing them to find alternative patterns.

Out-of-Bag (OOB) Evaluation

Since ~36.8% of samples are left out of each bootstrap, we get a free validation set!

How OOB Works

  1. For each sample $x_i$, find trees that didn’t use it for training
  2. Get predictions from only those trees
  3. Aggregate these predictions
  4. Compare to true label โ†’ OOB error
1
2
3
Sample 1: Used in trees [1,2,5,7] โ†’ OOB prediction from trees [3,4,6,8,9,10]
Sample 2: Used in trees [2,3,4,8] โ†’ OOB prediction from trees [1,5,6,7,9,10]
...
AdvantageDescription
No separate validation set neededUses ~37% left-out samples
Unbiased estimateSimilar to cross-validation
Computed automaticallySet oob_score=True in sklearn
OOB error is approximately equal to leave-one-out cross-validation error, but much faster to compute!

Code Practice

The following examples compare single trees, bagging, and Random Forest.

Bagging vs Single Tree

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Generate dataset
X, y = make_classification(n_samples=1000, n_features=20, 
                           n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("๐Ÿ“Š Single Tree vs Bagging Comparison")
print("=" * 50)

# Single tree (high variance)
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
print(f"Single Tree:       Train={tree.score(X_train, y_train):.2%}, "
      f"Test={tree.score(X_test, y_test):.2%}")

# Bagging (reduced variance)
bag = BaggingClassifier(n_estimators=50, random_state=42)
bag.fit(X_train, y_train)
print(f"Bagging (50 trees): Train={bag.score(X_train, y_train):.2%}, "
      f"Test={bag.score(X_test, y_test):.2%}")

Output:

1
2
3
4
๐Ÿ“Š Single Tree vs Bagging Comparison
==================================================
Single Tree:       Train=100.00%, Test=83.50%
Bagging (50 trees): Train=100.00%, Test=91.00%
Single tree achieves 100% training accuracy (overfitting!) but only 83.5% test. Bagging improves test accuracy to 91% by averaging out the variance.

Random Forest with OOB Score

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.ensemble import RandomForestClassifier

# Random Forest with OOB evaluation
rf = RandomForestClassifier(
    n_estimators=100, 
    oob_score=True,  # Enable OOB
    random_state=42,
    n_jobs=-1  # Use all CPUs
)
rf.fit(X_train, y_train)

print("๐Ÿ“Š Random Forest Results")
print("=" * 40)
print(f"Train accuracy: {rf.score(X_train, y_train):.2%}")
print(f"Test accuracy:  {rf.score(X_test, y_test):.2%}")
print(f"OOB score:      {rf.oob_score_:.2%}")

print(f"\n๐Ÿ“ˆ Model Statistics:")
print(f"   Number of trees: {rf.n_estimators}")
print(f"   Features per split: {rf.max_features}")
print(f"   Max depth: {rf.max_depth or 'None (unlimited)'}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
๐Ÿ“Š Random Forest Results
========================================
Train accuracy: 100.00%
Test accuracy:  92.00%
OOB score:      92.25%

๐Ÿ“ˆ Model Statistics:
   Number of trees: 100
   Features per split: sqrt
   Max depth: None (unlimited)
OOB score (92.25%) closely matches test accuracy (92.00%) โ€” it’s a reliable estimate without needing a separate validation set!

Feature Importance

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from sklearn.datasets import load_iris
import numpy as np

# Train on Iris
X_iris, y_iris = load_iris(return_X_y=True)
feature_names = load_iris().feature_names

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_iris, y_iris)

print("๐Ÿ“Š Feature Importance (Iris Dataset)")
print("=" * 50)
for name, importance in sorted(zip(feature_names, rf.feature_importances_), 
                                key=lambda x: -x[1]):
    bar = "โ–ˆ" * int(importance * 40)
    print(f"{name:<18} {importance:.3f} {bar}")

Output:

1
2
3
4
5
6
๐Ÿ“Š Feature Importance (Iris Dataset)
==================================================
petal length (cm)  0.436 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
petal width (cm)   0.436 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ
sepal length (cm)  0.106 โ–ˆโ–ˆโ–ˆโ–ˆ
sepal width (cm)   0.022 
Petal measurements dominate! This matches what was learned in Decision Trees โ€” petal features are most discriminative for Iris classification.

Effect of Number of Trees

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import warnings

print("๐Ÿ“Š Effect of Number of Trees")
print("=" * 45)
print(f"{'Trees':<8} {'Train':>10} {'Test':>10} {'OOB':>10}")
print("-" * 45)

for n_trees in [1, 5, 10, 25, 50, 100, 200]:
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        rf = RandomForestClassifier(n_estimators=n_trees, oob_score=True, 
                                     random_state=42)
        rf.fit(X_train, y_train)
        oob = f"{rf.oob_score_:>10.2%}" if hasattr(rf, 'oob_score_') else "N/A"
    
    print(f"{n_trees:<8} {rf.score(X_train, y_train):>10.2%} "
          f"{rf.score(X_test, y_test):>10.2%} {oob}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
๐Ÿ“Š Effect of Number of Trees
=============================================
Trees         Train       Test        OOB
---------------------------------------------
1            91.25%     78.00%     60.00%
5            98.50%     84.00%     77.00%
10           99.38%     87.00%     82.38%
25          100.00%     93.00%     87.88%
50          100.00%     91.00%     91.50%
100         100.00%     92.00%     92.25%
200         100.00%     91.00%     92.75%
Accuracy stabilizes around 25-100 trees (~91-93%). More trees improve OOB reliability but add training time with diminishing returns.

Hyperparameter Tuning with GridSearchCV

๐Ÿ Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, None],
    'min_samples_split': [2, 5],
    'max_features': ['sqrt', 'log2']
}

grid = GridSearchCV(
    RandomForestClassifier(random_state=42), 
    param_grid, 
    cv=5, 
    n_jobs=-1,
    scoring='accuracy'
)
grid.fit(X_train, y_train)

print("๐Ÿ“Š Grid Search Results")
print("=" * 50)
print(f"Best CV score: {grid.best_score_:.2%}")
print(f"\n๐Ÿ† Best Parameters:")
for param, value in grid.best_params_.items():
    print(f"   {param}: {value}")

print(f"\nTest accuracy with best model: {grid.score(X_test, y_test):.2%}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
๐Ÿ“Š Grid Search Results
==================================================
Best CV score: 92.50%

๐Ÿ† Best Parameters:
   max_depth: None
   max_features: sqrt
   min_samples_split: 2
   n_estimators: 100

Test accuracy with best model: 92.00%

Deep Dive

Frequently Asked Questions

Q1: How many trees should I use?

TreesRecommendation
10-50Quick experiments, development
100-200Production (good tradeoff)
500+Diminishing returns, only if compute is cheap
Start with 100 trees. Increase only if OOB score keeps improving.

Q2: Random Forest vs XGBoost/LightGBM

AspectRandom ForestXGBoost/LightGBM
BuildingParallel (independent trees)Sequential (each tree corrects previous)
Speedโšก Fast training๐Ÿ”ถ Slower training, fast prediction
Tuningโœ… Few hyperparametersโš ๏ธ Many hyperparameters
Accuracy๐Ÿ”ถ Goodโœ… Often better
Overfittingโœ… Resistantโš ๏ธ Needs regularization
Best forQuick baseline, robust modelsCompetitions, max accuracy

Q3: When does Random Forest struggle?

ScenarioIssueAlternative
Very high dimensionsSlow, feature importance unreliableUse feature selection first
Sparse dataDoesn’t handle wellXGBoost, Linear models
ExtrapolationCan’t predict outside training rangeGradient Boosting, Linear models
Interpretability neededCan’t explain easilySingle Decision Tree, SHAP values

Q4: What if feature importance seems wrong?

IssueCauseSolution
High-cardinality features dominateMore split opportunitiesUse permutation importance
Correlated features share importanceImportance spreads across themInspect together or remove one
Noise features ranked highOverfittingTry permutation importance on test set
1
2
3
4
# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance

perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10)

Practical Tips

Random Forest tuning checklist:

  1. Start with defaults (n_estimators=100)
  2. Use oob_score=True for quick validation
  3. If overfitting: reduce max_depth, increase min_samples_leaf
  4. If underfitting: increase n_estimators, try max_features='sqrt'
  5. Always use n_jobs=-1 for parallelization

Key Hyperparameters

ParameterDefaultEffect
n_estimators100More trees โ†’ lower variance โ†’ slower
max_depthNoneLimits tree complexity
max_features‘sqrt’Features per split (decorrelation)
min_samples_split2Min samples to split a node
min_samples_leaf1Min samples in leaf

Common Pitfalls

PitfallSymptomSolution
Too few treesHigh variance between runsUse at least 100 trees
Not using OOBWasting data on validationSet oob_score=True
Ignoring class imbalancePoor minority class predictionUse class_weight='balanced'
Relying on MDI importanceMisleading with correlated/high-cardinality featuresUse permutation importance
Max depth too highOverfitting on noisy dataTry max_depth=10-20

Summary

ConceptKey Points
BaggingBootstrap + Aggregate โ†’ reduces variance
Random ForestBagging + random feature subsets per split
OOB ScoreFree validation from ~37% unused samples
Feature ImportanceMDI (fast) or Permutation (more reliable)
# Trees50-200 typically sufficient; diminishing returns
vs BoostingRF: parallel, robust; Boosting: sequential, often higher accuracy

References

  • Breiman, L. (2001). “Random Forests”
  • sklearn Random Forest
  • Hastie, T. et al. “The Elements of Statistical Learning” - Chapter 15