CW

Random Forest: Bootstrap Aggregating and Feature Randomness

Module 8: Tree-Based ModelsFree Lesson

Advertisement

Random Forest: Bootstrap Aggregating and Feature Randomness

1. Introduction

A Random Forest is an ensemble method that constructs a multitude of decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It combines bootstrap aggregating (bagging) with feature randomization to produce trees with low correlation and high predictive power.

The key insight: by introducing controlled randomness into tree construction, we reduce variance without substantially increasing bias.


2. Bootstrap Aggregating (Bagging)

2.1 The Bootstrap Procedure

Bagging (Breiman, 1996) draws B bootstrap samples from the original dataset of size n. Each bootstrap sample is drawn with replacement, meaning approximately 63.2% of unique original samples appear in any given bootstrap sample. The remaining ~36.8% are called out-of-bag (OOB) observations.

2.2 Bootstrap Sampling Visualization

Bootstrap Sampling (with replacement)Original Dataset (n=8)x₁x₂x₃x₄x₅x₆x₇x₈Bootstrap 1x₂x₁x₃x₁x₅x₆x₇x₅Bootstrap 2x₄x₂x₆x₄x₈x₃x₁x₇Bootstrap 3x₇x₃x₁x₃x₂x₅x₈x₆

Each bootstrap sample draws n=8 observations with replacement → duplicates appear, some originals missing

OOB Observation Probability• P(included in bootstrap) = 1 − (1 − 1/n)^n ≈ 1 − e⁻¹ ≈ 0.632• P(OOB / out-of-bag) = (1 − 1/n)^n ≈ e⁻¹ ≈ 0.368• Each observation is OOB for ~36.8% of bootstrap samples

2.3 Mathematical Foundation

For a dataset D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{n}, each bootstrap sample Db\mathcal{D}^*_b is drawn by uniformly sampling nn observations with replacement. The bagging predictor is:

f^bag(x)=1Bb=1Bf^b(x)\hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} \hat{f}^{*b}(x)

where f^b\hat{f}^{*b} is the model trained on bootstrap sample bb.

For regression, bagging reduces variance:

Var[f^bag]=ρσ2+1ρBσ2\text{Var}[\hat{f}_{\text{bag}}] = \rho \sigma^2 + \frac{1-\rho}{B} \sigma^2

where ρ\rho is the pairwise correlation between trees. Without bagging (single tree), variance = σ2\sigma^2. Bagging reduces the second term by factor 1/B1/B, but the irreducible ρσ2\rho \sigma^2 term limits gains.


3. Random Forest Algorithm

3.1 Two Sources of Randomness

Random Forest (Breiman, 2001) adds a second randomization step: at each node, only a random subset of mm features (out of pp total) is considered for splitting. This decorrelates trees, reducing ρ\rho and thus overall variance.

Random Forest — Parallel ArchitectureTraining Data DBootstrap samples + random m featuresat each node splitTree 1 (D₁*, m=√p)f₁(x)+1-1Tree 2 (D₂*, m=√p)f₂(x)+1+1Tree 3 (D₃*, m=√p)f₃(x)-1+1Tree B (D_B*, m=√p)f_B(x)+1-1···Majority Vote / AveragingRF(x) = majority{f₁, f₂, ..., f_B}Key Insight:Feature randomization at each split reduces tree correlation ρ → lower ensemble varianceClassification: m ≈ √p | Regression: m ≈ p/3 (default heuristics)

3.2 The Algorithm

Architecture Diagram
Algorithm: Random Forest (classification variant)
─────────────────────────────────────────────────
Input:  Training set D = {(x₁,y₁), ..., (xₙ,yₙ)}
        Number of trees B
        Features to consider at each split m
Output: Ensemble classifier |RF|

1:  for b = 1 to B do
2:      Draw bootstrap sample D*_b from D (sample n with replacement)
3:      Grow tree t_b on D*_b:
4:          at each node:
5:              randomly select m features from p total
6:              find best split among these m features
7:              split node into two child nodes
8:          stop when minimum node size or max depth reached
9:  end for
10: Output: |RF(x) = mode{ t_b(x) }_{b=1}^{B}

3.3 Theoretical Justification

Randomness reduces correlation. Consider the variance of the ensemble prediction. For BB identically distributed trees with pairwise correlation ρ\rho and individual variance σ2\sigma^2:

Var[1Bb=1BTb(x)]=ρσ2+1ρBσ2\text{Var}\left[\frac{1}{B}\sum_{b=1}^{B} T_b(x)\right] = \rho \sigma^2 + \frac{1-\rho}{B}\sigma^2
  • As BB \to \infty, the second term vanishes: Varρσ2\text{Var} \to \rho \sigma^2
  • Feature randomization reduces ρ\rho (trees become less correlated)
  • The reduction in ρ\rho is the primary mechanism by which Random Forest outperforms bagged trees

Bias consideration: Random Forests do not reduce bias compared to individual trees (which are typically grown deep with low bias). The variance reduction comes at a slight cost in bias, but this is negligible for large trees.


4. Out-of-Bag (OOB) Evaluation

4.1 Concept

Each training observation (xi,yi)(x_i, y_i) is OOB for approximately 36.8% of the BB trees. We can use these trees to make predictions for that observation without a separate validation set.

Out-of-Bag EvaluationTraining point (xᵢ, yᵢ)Tree 1: includedTree 2: OOB ✓Tree 3: includedTree 4: OOB ✓Tree 5: includedTree 6: OOB ✓Predictions fromOOB trees only:OOB Prediction for xᵢ= mode{t₂(xᵢ), t₄(xᵢ), t₆(xᵢ)}Compare with true label yᵢ→ OOB error estimateRepeat for all n training pointsEach point gets an OOB prediction from the ~36.8% of trees that did NOT see it during trainingOOB error = fraction of points where OOB prediction ≠ true label

4.2 OOB Error Formula

For classification, the OOB error is:

ErrOOB=1ni=1n1[y^iOOByi]\text{Err}_{\text{OOB}} = \frac{1}{n}\sum_{i=1}^{n} \mathbb{1}\left[\hat{y}_i^{\text{OOB}} \neq y_i\right]

where y^iOOB\hat{y}_i^{\text{OOB}} is the prediction from trees for which observation ii was OOB.

Why OOB works: Each observation is predicted by approximately 36.8% of the ensemble (those trees where it was OOB). This is equivalent to a cross-validation estimate with ≈ 0.632×n training samples per fold — remarkably close to leave-one-out CV but computed at zero additional cost.

4.3 OOB vs Cross-Validation

MethodComputational CostBiasVariance
OOBZero (built-in)Slight upward bias (~0.632 rule)Low
5-fold CV5× training costModerateModerate
10-fold CV10× training costLowerLower
Leave-one-outn× training costLowestHighest

5. Feature Importance

5.1 Mean Decrease Impurity (MDI)

For each feature jj, sum the total decrease in Gini impurity (or variance for regression) contributed by splits on that feature across all trees, then normalize:

ImportancejMDI=1Bb=1Bttreeb1[split feature at node t=j]ΔI(t)\text{Importance}_j^{\text{MDI}} = \frac{1}{B}\sum_{b=1}^{B} \sum_{t \in \text{tree}_b} \mathbb{1}[\text{split feature at node } t = j] \cdot \Delta I(t)

where ΔI(t)\Delta I(t) is the impurity decrease at node tt:

ΔI(t)=ntnI(t)ntLnI(tL)ntRnI(tR)\Delta I(t) = \frac{n_t}{n} \cdot I(t) - \frac{n_{t_L}}{n} \cdot I(t_L) - \frac{n_{t_R}}{n} \cdot I(t_R)
Feature Importance — Mean Decrease ImpurityFeature ImportanceFeatureAge0.32Income0.28Score0.18Hours0.12Zone0.06ID0.04Important featuresLess important features

MDI Bias: MDI importance is biased toward features with many unique values (e.g., high-cardinality categorical features). This is because such features offer more split points, providing more opportunities for impurity reduction.

5.2 Permutation Importance (Mean Decrease in Accuracy)

A more reliable measure. For each feature jj:

  1. Compute baseline OOB accuracy
  2. Randomly permute the values of feature jj across all OOB observations
  3. Compute OOB accuracy again
  4. Importance = decrease in accuracy
ImportancejPerm=1Bb=1B[AccbAccb(πj)]\text{Importance}_j^{\text{Perm}} = \frac{1}{B}\sum_{b=1}^{B} \left[\text{Acc}_b - \text{Acc}_b^{(\pi_j)}\right]

where Accb(πj)\text{Acc}_b^{(\pi_j)} is the OOB accuracy of tree bb when feature jj is permuted.

5.3 Permutation Importance Procedure

Architecture Diagram
For each tree t_b (b = 1, ..., B):
    1. Identify OOB observations O_b ⊂ {1, ..., n}
    2. Compute accuracy: Acc_b = (1/|O_b|) Σ_{i∈O_b} 𝟙[t_b(xᵢ) = yᵢ]
    3. For each feature j:
        a. Create permuted data: x̃ᵢⱼ = x_{π(i),j} for i ∈ O_b
           (permute column j only)
        b. Compute: Acc_b^(π_j) = (1/|O_b|) Σ_{i∈O_b} 𝟙[t_b(x̃ᵢ) = yᵢ]
        c. Contribution_b^(j) = Acc_b − Acc_b^(π_j)
    4. Importance_j = (1/B) Σ_b Contribution_b^(j)

6. Hyperparameter Tuning

6.1 Key Hyperparameters

HyperparameterDefault (sklearn)RangeEffect
n_estimators (B)100[10, 1000+]More trees → lower variance (diminishing returns)
max_features (m)√p (classification), p/3 (regression)[1, p]Fewer features → more decorrelation, less bias
max_depthNone (unlimited)[1, None]Limiting depth → reduces overfitting
min_samples_split2[2, 20+]Higher → simpler trees
min_samples_leaf1[1, 20+]Higher → smoother boundaries
max_leaf_nodesNone[2, None]Limiting leaves → regularization

6.2 The m (max_features) Trade-off

ρ(m)=Corr[f^i(x),f^j(x)]\rho(m) = \text{Corr}[\hat{f}_i(x), \hat{f}_j(x)]
  • m=pm = p: All features considered → equivalent to bagging → ρ\rho is high
  • m=1m = 1: Random feature at each split → maximum decorrelation → high bias
  • mpm \approx \sqrt{p}: Sweet spot for classification (Breiman's recommendation)
Effect of m (max_features) on Correlation and ErrorValuem (max_features) →1√pp/3p/2pCorrelation ρ(m)Error (m)Sweet Spot(classification: √p, regression: p/3)

6.3 Tuning Strategy

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'n_estimators':      [100, 200, 500, 1000],
    'max_features':      ['sqrt', 'log2', 0.3, 0.5, 0.7],
    'max_depth':         [None, 10, 20, 30, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf':  [1, 2, 4],
    'bootstrap':         [True, False],
}

rf = RandomForestClassifier(random_state=42, n_jobs=-1)
search = RandomizedSearchCV(
    rf, param_distributions,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV accuracy: {search.best_score_:.4f}")

7. Implementation in Python

7.1 Full Working Example

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    RocCurveDisplay
)
import matplotlib.pyplot as plt

# --- Generate synthetic data ---
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=12,
    n_redundant=4,
    n_classes=2,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Fit Random Forest ---
rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    max_depth=None,
    min_samples_leaf=2,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# --- Evaluate ---
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"Test Accuracy: {rf.score(X_test, y_test):.4f}")
print(f"Test ROC-AUC: {roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1]):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, rf.predict(X_test)))

7.2 Feature Importance Analysis

# --- Permutation Importance (more reliable) ---
from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf, X_test, y_test,
    n_repeats=30,
    random_state=42,
    n_jobs=-1
)

feature_names = [f"Feature {i}" for i in range(X.shape[1])]
importances = pd.DataFrame({
    'feature': feature_names,
    'mdi_importance': rf.feature_importances_,
    'perm_importance': result.importances_mean,
    'perm_std': result.importances_std
}).sort_values('perm_importance', ascending=False)

print(importances.head(10))

7.3 OOB Error Convergence

# --- OOB error vs number of trees ---
errors = []
for n_trees in range(1, 501, 10):
    rf_temp = RandomForestClassifier(
        n_estimators=n_trees,
        oob_score=True,
        random_state=42
    )
    rf_temp.fit(X_train, y_train)
    errors.append({
        'n_trees': n_trees,
        'oob_error': 1 - rf_temp.oob_score_
    })

errors_df = pd.DataFrame(errors)

plt.figure(figsize=(8, 4))
plt.plot(errors_df['n_trees'], errors_df['oob_error'], color='#3b82f6', linewidth=2)
plt.xlabel('Number of Trees (B)')
plt.ylabel('OOB Error')
plt.title('OOB Error Convergence')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('oob_convergence.png', dpi=150)
plt.show()

7.4 Partial Dependence Plots

from sklearn.inspection import PartialDependenceDisplay

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
PartialDependenceDisplay.from_estimator(
    rf, X_train,
    features=[0, 1, 2],
    feature_names=feature_names,
    ax=axes,
    kind='both',          # individual + average
    subsample=50,
    random_state=42
)
plt.tight_layout()
plt.savefig('partial_dependence.png', dpi=150)
plt.show()

8. Random Forest vs. Single Decision Tree

8.1 Comparison

PropertySingle Decision TreeRandom Forest
VarianceHigh (unstable)Low (averaged)
BiasLow (deep trees)Low (deep trees per tree)
InterpretabilityHigh (single path)Low (black box)
Overfitting riskHighLow
Training speedFastModerate (B× slower)
Prediction speedO(depth)O(B × depth)
MemoryO(nodes)O(B × nodes)
Handles missing valuesNo (native)Yes (surrogate, sklearn)
Feature importanceSingle tree pathAveraged across trees

8.2 When to Use Random Forest

Good choice:

  • Tabular data with mixed feature types
  • When interpretability is secondary to accuracy
  • When you need robust estimates with minimal tuning
  • As a strong baseline before trying more complex models

Less suitable:

  • Very high-dimensional sparse data (consider linear models or gradient boosting)
  • When prediction speed is critical (consider distillation or pruned trees)
  • When interpretability is paramount (consider single trees or linear models)

9. Extensions and Variants

9.1 Extra-Trees (Extremely Randomized Trees)

Similar to Random Forest but with additional randomization: split thresholds are chosen randomly rather than optimizing over the feature values.

Split at xjtwhere tUniform(minj,maxj)\text{Split at } x_j \leq t \quad \text{where } t \sim \text{Uniform}(\min_j, \max_j)
  • Further reduces variance at cost of slightly higher bias
  • Faster training (no sorting/split optimization per feature)

9.2 Balanced Random Forest

For imbalanced classification, each bootstrap sample is drawn to balance class frequencies:

Class distribution in Db:nkknk=1Kk\text{Class distribution in } \mathcal{D}^*_b: \quad \frac{n_k^*}{\sum_k n_k^*} = \frac{1}{K} \quad \forall k

9.3 Quantile Regression Forests

Extends Random Forest to estimate conditional quantiles q^α(x)\hat{q}_\alpha(x) by keeping all training response values at each leaf and computing weighted quantiles.


10. Summary

Random Forests achieve excellent predictive performance through two simple ideas:

  1. Bootstrap aggregating: Train each tree on a different random sample → decorrelates models → reduces variance
  2. Feature randomization: At each split, consider only a random subset of features → further decorrelates trees → amplifies variance reduction

Key takeaways:

  • OOB evaluation provides a free, unbiased estimate of generalization error
  • Feature importance measures identify predictive variables
  • Random Forests are robust to overfitting with minimal hyperparameter tuning
  • They serve as a strong baseline for tabular data tasks

Rule of thumb: Always try Random Forest first as a baseline. It requires minimal tuning, handles missing values and mixed feature types, and provides built-in feature importance and OOB error estimates. Only move to gradient boosting if Random Forest performance is insufficient.


References

  • Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 15.
  • Scikit-learn documentation: Random Forest

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement