Random Forest: Bootstrap Aggregating and Feature Randomness

1. Introduction

A Random Forest is an ensemble method that constructs a multitude of decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It combines bootstrap aggregating (bagging) with feature randomization to produce trees with low correlation and high predictive power.
The key insight: by introducing controlled randomness into tree construction, we reduce variance without substantially increasing bias.
2. Bootstrap Aggregating (Bagging)

2.1 The Bootstrap Procedure

Bagging (Breiman, 1996) draws B bootstrap samples from the original dataset of size n. Each bootstrap sample is drawn with replacement, meaning approximately 63.2% of unique original samples appear in any given bootstrap sample. The remaining ~36.8% are called out-of-bag (OOB) observations.
2.2 Bootstrap Sampling Visualization

Original Dataset (n=8)
x₁
x₂
x₃
xâ‚„
x₄
xâ‚†
xâ‚‡
xâ‚ˆ
Bootstrap 1
x₂
x₁
x₃
x₁
x₄
xâ‚†
xâ‚‡
x₄
Bootstrap 2
xâ‚„
x₂
xâ‚†
xâ‚„
xâ‚ˆ
x₃
x₁
xâ‚‡
Bootstrap 3
xâ‚‡
x₃
x₁
x₃
x₂
x₄
xâ‚ˆ
xâ‚†

    Each bootstrap sample draws n=8 observations with replacement ?� duplicates appear, some originals missing

OOB Observation Probability
� P(included in bootstrap) = 1 �� (1 �� 1/n)^n �� 1 �� e⁺¹ �� 0.632
� P(OOB / out-of-bag) = (1 �� 1/n)^n �� e⁺¹ �� 0.368
� Each observation is OOB for ~36.8% of bootstrap samples~~~

**How this diagram works:** This diagram visualizes the bootstrap sampling procedure that powers random forests. Starting with an original dataset of 8 observations, each bootstrap sample is drawn with replacement — meaning the same point can appear multiple times (shown as duplicates like x₁ appearing twice in Bootstrap 1) while some original points are left out entirely. The green, yellow, and red boxes show three different bootstrap samples, each containing 8 drawn observations but with different compositions. This sampling strategy creates diversity among trees: approximately 63.2% of unique original samples appear in each bootstrap, while the remaining ~36.8% are "out-of-bag" (OOB) observations that can be used for free validation.

### 2.3 Mathematical Foundation

For a dataset <MathBlock tex={`\\mathcal{D} = \\{(x_i, y_i)\\}_{i=1}^{n}`} />, each bootstrap sample <MathBlock tex={`\\mathcal{D}^*_b`} /> is drawn by uniformly sampling <MathBlock tex={`n`} /> observations with replacement. The bagging predictor is:



<MathBlock tex={`\\hat{f}_{\\text{bag}}(x) = \\frac{1}{B} \\sum_{b=1}^{B} \\hat{f}^{*b}(x)`} display={true} />



where <MathBlock tex={`\\hat{f}^{*b}`} /> is the model trained on bootstrap sample <MathBlock tex={`b`} />.

For regression, bagging reduces variance:



<MathBlock tex={`\\text{Var}[\\hat{f}_{\\text{bag}}] = \\rho \\sigma^2 + \\frac{1-\\rho}{B} \\sigma^2`} display={true} />



where <MathBlock tex={`\\rho`} /> is the pairwise correlation between trees. Without bagging (single tree), variance = <MathBlock tex={`\\sigma^2`} />. Bagging reduces the second term by factor <MathBlock tex={`1/B`} />, but the **irreducible** <MathBlock tex={`\\rho \\sigma^2`} /> term limits gains.

---

## 3. Random Forest Algorithm

### 3.1 Two Sources of Randomness

Random Forest (Breiman, 2001) adds a second randomization step: at each node, only a random subset of <MathBlock tex={`m`} /> features (out of <MathBlock tex={`p`} /> total) is considered for splitting. This decorrelates trees, reducing <MathBlock tex={`\\rho`} /> and thus overall variance.

~~~Random Forest — Parallel Architecture
Training Data D
Bootstrap samples + random m features
at each node split
Tree 1 (D₁*, m=?�p)
f₁(x)
+1
-1
Tree 2 (D₂*, m=?�p)
f₂(x)
+1
+1
Tree 3 (D₃*, m=?�p)
f₃(x)
-1
+1
Tree B (D_B*, m=?�p)
f_B(x)
+1
-1
···
Majority Vote / Averaging
RF(x) = majority&#123;`{f1, f2, ..., f_B}`&#125;
Key Insight:
Feature randomization at each split reduces tree correlation ρ ?� lower ensemble variance
Classification: m �� ?�p  |  Regression: m �� p/3  (default heuristics)~~~

### 3.2 The Algorithm

```
Algorithm: Random Forest (classification variant)
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Input:  Training set D = {(x₁,y₁), ..., (x₅,y₅)}
        Number of trees B
        Features to consider at each split m
Output: Ensemble classifier |RF|

1:  for b = 1 to B do
2:      Draw bootstrap sample D*_b from D (sample n with replacement)
3:      Grow tree t_b on D*_b:
4:          at each node:
5:              randomly select m features from p total
6:              find best split among these m features
7:              split node into two child nodes
8:          stop when minimum node size or max depth reached
9:  end for
10: Output: |RF(x) = mode{ t_b(x) }_{b=1}^{B}
```

### 3.3 Theoretical Justification

**Randomness reduces correlation.** Consider the variance of the ensemble prediction. For <MathBlock tex={`B`} /> identically distributed trees with pairwise correlation <MathBlock tex={`\\rho`} /> and individual variance <MathBlock tex={`\\sigma^2`} />:



<MathBlock tex={`\\text{Var}\\left[\\frac{1}{B}\\sum_{b=1}^{B} T_b(x)\\right] = \\rho \\sigma^2 + \\frac{1-\\rho}{B}\\sigma^2`} display={true} />



- As <MathBlock tex={`B \\to \\infty`} />, the second term vanishes: <MathBlock tex={`\\text{Var} \\to \\rho \\sigma^2`} />
- Feature randomization reduces <MathBlock tex={`\\rho`} /> (trees become less correlated)
- The **reduction in <MathBlock tex={`\\rho`} /> is the primary mechanism** by which Random Forest outperforms bagged trees

**Bias consideration:** Random Forests do not reduce bias compared to individual trees (which are typically grown deep with low bias). The variance reduction comes at a slight cost in bias, but this is negligible for large trees.

---

## 4. Out-of-Bag (OOB) Evaluation

### 4.1 Concept

Each training observation <MathBlock tex={`(x_i, y_i)`} /> is OOB for approximately 36.8% of the <MathBlock tex={`B`} /> trees. We can use these trees to make predictions for that observation without a separate validation set.

~~~Out-of-Bag Evaluation
Training point (x𝝢, y𝝢)
Tree 1: included
Tree 2: OOB âœ“
Tree 3: included
Tree 4: OOB âœ“
Tree 5: included
Tree 6: OOB âœ“
Predictions from
OOB trees only:
OOB Prediction for x𝝢
= mode&#123;`{t2(x?), t3(x?), t4(x?)}`&#125;
Compare with true label y𝝢
?� OOB error estimate
Repeat for all n training points
Each point gets an OOB prediction from the ~36.8% of trees that did NOT see it during training
OOB error = fraction of points where OOB prediction ��  true label~~~

### 4.2 OOB Error Formula

For classification, the OOB error is:



<MathBlock tex={`\\text{Err}_{\\text{OOB}} = \\frac{1}{n}\\sum_{i=1}^{n} \\mathbb{1}\\left[\\hat{y}_i^{\\text{OOB}} \\neq y_i\\right]`} display={true} />



where <MathBlock tex={`\\hat{y}_i^{\\text{OOB}}`} /> is the prediction from trees for which observation <MathBlock tex={`i`} /> was OOB.

<Callout type="info">
**Why OOB works:** Each observation is predicted by approximately 36.8% of the ensemble (those trees where it was OOB). This is equivalent to a cross-validation estimate with �� 0.632×n training samples per fold — remarkably close to leave-one-out CV but computed at zero additional cost.
</Callout>

### 4.3 OOB vs Cross-Validation

| Method | Computational Cost | Bias | Variance |
|--------|-------------------|------|----------|
| OOB | **Zero** (built-in) | Slight upward bias (~0.632 rule) | Low |
| 5-fold CV | 5× training cost | Moderate | Moderate |
| 10-fold CV | 10× training cost | Lower | Lower |
| Leave-one-out | n× training cost | Lowest | Highest |

---

## 5. Feature Importance

### 5.1 Mean Decrease Impurity (MDI)

For each feature <MathBlock tex={`j`} />, sum the total decrease in Gini impurity (or variance for regression) contributed by splits on that feature across all trees, then normalize:



<MathBlock tex={`\\text{Importance}_j^{\\text{MDI}} = \\frac{1}{B}\\sum_{b=1}^{B} \\sum_{t \\in \\text{tree}_b} \\mathbb{1}[\\text{split feature at node } t = j] \\cdot \\Delta I(t)`} display={true} />



where <MathBlock tex={`\\Delta I(t)`} /> is the impurity decrease at node <MathBlock tex={`t`} />:



<MathBlock tex={`\\Delta I(t) = \\frac{n_t}{n} \\cdot I(t) - \\frac{n_{t_L}}{n} \\cdot I(t_L) - \\frac{n_{t_R}}{n} \\cdot I(t_R)`} display={true} />



<svg viewBox="0 0 720 340" xmlns="http://www.w3.org/2000/svg" style={{ width: '100%', maxWidth: 720 }}>
  <text x="360" y="28" textAnchor="middle" fontSize="17" fontWeight="bold" fill="#1e293b">Feature Importance — Mean Decrease Impurity</text>

  {/* Axes */}
  <line x1="120" y1="60" x2="120" y2="270" stroke="#475569" strokeWidth="2" />
  <line x1="120" y1="270" x2="660" y2="270" stroke="#475569" strokeWidth="2" />

  {/* Y-axis label */}
  <text x="30" y="170" textAnchor="middle" fontSize="12" fill="#475569" transform="rotate(-90  30  170)">Feature Importance</text>

  {/* X-axis label */}
  <text x="390" y="310" textAnchor="middle" fontSize="12" fill="#475569">Feature</text>

  {/* Feature bars */}
  <rect x="140" y="80" width="65" height="190" rx="4" fill="#3b82f6" opacity="0.85" />
  <text x="172" y="290" textAnchor="middle" fontSize="11" fill="#334155">Age</text>
  <text x="172" y="73" textAnchor="middle" fontSize="10" fontWeight="bold" fill="#1e40af">0.32</text>

  <rect x="225" y="100" width="65" height="170" rx="4" fill="#3b82f6" opacity="0.85" />
  <text x="257" y="290" textAnchor="middle" fontSize="11" fill="#334155">Income</text>
  <text x="257" y="93" textAnchor="middle" fontSize="10" fontWeight="bold" fill="#1e40af">0.28</text>

  <rect x="310" y="130" width="65" height="140" rx="4" fill="#3b82f6" opacity="0.85" />
  <text x="342" y="290" textAnchor="middle" fontSize="11" fill="#334155">Score</text>
  <text x="342" y="123" textAnchor="middle" fontSize="10" fontWeight="bold" fill="#1e40af">0.18</text>

  <rect x="395" y="165" width="65" height="105" rx="4" fill="#3b82f6" opacity="0.85" />
  <text x="427" y="290" textAnchor="middle" fontSize="11" fill="#334155">Hours</text>
  <text x="427" y="158" textAnchor="middle" fontSize="10" fontWeight="bold" fill="#1e40af">0.12</text>

  <rect x="480" y="210" width="65" height="60" rx="4" fill="#94a3b8" opacity="0.7" />
  <text x="512" y="290" textAnchor="middle" fontSize="11" fill="#334155">Zone</text>
  <text x="512" y="203" textAnchor="middle" fontSize="10" fontWeight="bold" fill="#475569">0.06</text>

  <rect x="565" y="235" width="65" height="35" rx="4" fill="#94a3b8" opacity="0.7" />
  <text x="597" y="290" textAnchor="middle" fontSize="11" fill="#334155">ID</text>
  <text x="597" y="228" textAnchor="middle" fontSize="10" fontWeight="bold" fill="#475569">0.04</text>

  {/* Legend */}
  <rect x="480" y="50" width="180" height="58" rx="6" fill="white" stroke="#e2e8f0" strokeWidth="1" />
  <rect x="492" y="62" width="12" height="12" rx="2" fill="#3b82f6" opacity="0.85" />
  <text x="510" y="73" fontSize="10" fill="#334155">Important features</text>
  <rect x="492" y="82" width="12" height="12" rx="2" fill="#94a3b8" opacity="0.7" />
  <text x="510" y="93" fontSize="10" fill="#334155">Less important features</text>
</svg>

<Callout type="warning">
**MDI Bias:** MDI importance is biased toward features with many unique values (e.g., high-cardinality categorical features). This is because such features offer more split points, providing more opportunities for impurity reduction.
</Callout>

### 5.2 Permutation Importance (Mean Decrease in Accuracy)

A more reliable measure. For each feature <MathBlock tex={`j`} />:

1. Compute baseline OOB accuracy
2. Randomly permute the values of feature <MathBlock tex={`j`} /> across all OOB observations
3. Compute OOB accuracy again
4. Importance = decrease in accuracy



<MathBlock tex={`\\text{Importance}_j^{\\text{Perm}} = \\frac{1}{B}\\sum_{b=1}^{B} \\left[\\text{Acc}_b - \\text{Acc}_b^{(\\pi_j)}\\right]`} display={true} />



where <MathBlock tex={`\\text{Acc}_b^{(\\pi_j)}`} /> is the OOB accuracy of tree <MathBlock tex={`b`} /> when feature <MathBlock tex={`j`} /> is permuted.

### 5.3 Permutation Importance Procedure

```
For each tree t_b (b = 1, ..., B):
    1. Identify OOB observations O_b ?� {1, ..., n}
    2. Compute accuracy: Acc_b = (1/|O_b|) Σ_{i?�O_b} 𝝙[t_b(x𝝢) = y𝝢]
    3. For each feature j:
        a. Create permuted data: x̂𝝢?� = x_{π(i),j} for i ?� O_b
           (permute column j only)
        b. Compute: Acc_b^(π_j) = (1/|O_b|) Σ_{i?�O_b} 𝝙[t_b(x̂𝝢) = y𝝢]
        c. Contribution_b^(j) = Acc_b �� Acc_b^(π_j)
    4. Importance_j = (1/B) Σ_b Contribution_b^(j)
```

---

## 6. Hyperparameter Tuning

### 6.1 Key Hyperparameters

| Hyperparameter | Default (sklearn) | Range | Effect |
|---|---|---|---|
| `n_estimators` (B) | 100 | [10, 1000+] | More trees ?� lower variance (diminishing returns) |
| `max_features` (m) | ?�p (classification), p/3 (regression) | [1, p] | Fewer features ?� more decorrelation, less bias |
| `max_depth` | None (unlimited) | [1, None] | Limiting depth ?� reduces overfitting |
| `min_samples_split` | 2 | [2, 20+] | Higher ?� simpler trees |
| `min_samples_leaf` | 1 | [1, 20+] | Higher ?� smoother boundaries |
| `max_leaf_nodes` | None | [2, None] | Limiting leaves ?� regularization |

### 6.2 The m (max_features) Trade-off



<MathBlock tex={`\\rho(m) = \\text{Corr}[\\hat{f}_i(x), \\hat{f}_j(x)]`} display={true} />



- **<MathBlock tex={`m = p`} />**: All features considered ?� equivalent to bagging ?� <MathBlock tex={`\\rho`} /> is high
- **<MathBlock tex={`m = 1`} />**: Random feature at each split ?� maximum decorrelation ?� high bias
- **<MathBlock tex={`m \\approx \\sqrt{p}`} />**: Sweet spot for classification (Breiman's recommendation)

~~~Effect of m (max_features) on Correlation and Error
Value
m (max_features) ?�
1
?�p
p/3
p/2
p
Correlation ρ(m)
Error (m)
Sweet Spot
(classification: ?�p, regression: p/3)~~~

### 6.3 Tuning Strategy

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'n_estimators':      [100, 200, 500, 1000],
    'max_features':      ['sqrt', 'log2', 0.3, 0.5, 0.7],
    'max_depth':         [None, 10, 20, 30, 50],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf':  [1, 2, 4],
    'bootstrap':         [True, False],
}

rf = RandomForestClassifier(random_state=42, n_jobs=-1)
search = RandomizedSearchCV(
    rf, param_distributions,
    n_iter=100,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42
)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV accuracy: {search.best_score_:.4f}")
```

---

## 7. Implementation in Python

### 7.1 Full Working Example

```python
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score,
    RocCurveDisplay
)
import matplotlib.pyplot as plt

# --- Generate synthetic data ---
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=12,
    n_redundant=4,
    n_classes=2,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Fit Random Forest ---
rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    max_depth=None,
    min_samples_leaf=2,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# --- Evaluate ---
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"Test Accuracy: {rf.score(X_test, y_test):.4f}")
print(f"Test ROC-AUC: {roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1]):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, rf.predict(X_test)))
```

### 7.2 Feature Importance Analysis

```python
# --- Permutation Importance (more reliable) ---
from sklearn.inspection import permutation_importance

result = permutation_importance(
    rf, X_test, y_test,
    n_repeats=30,
    random_state=42,
    n_jobs=-1
)

feature_names = [f"Feature {i}" for i in range(X.shape[1])]
importances = pd.DataFrame({
    'feature': feature_names,
    'mdi_importance': rf.feature_importances_,
    'perm_importance': result.importances_mean,
    'perm_std': result.importances_std
}).sort_values('perm_importance', ascending=False)

print(importances.head(10))
```

### 7.3 OOB Error Convergence

```python
# --- OOB error vs number of trees ---
errors = []
for n_trees in range(1, 501, 10):
    rf_temp = RandomForestClassifier(
        n_estimators=n_trees,
        oob_score=True,
        random_state=42
    )
    rf_temp.fit(X_train, y_train)
    errors.append({
        'n_trees': n_trees,
        'oob_error': 1 - rf_temp.oob_score_
    })

errors_df = pd.DataFrame(errors)

plt.figure(figsize=(8, 4))
plt.plot(errors_df['n_trees'], errors_df['oob_error'], color='#3b82f6', linewidth=2)
plt.xlabel('Number of Trees (B)')
plt.ylabel('OOB Error')
plt.title('OOB Error Convergence')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('oob_convergence.png', dpi=150)
plt.show()
```

### 7.4 Partial Dependence Plots

```python
from sklearn.inspection import PartialDependenceDisplay

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
PartialDependenceDisplay.from_estimator(
    rf, X_train,
    features=[0, 1, 2],
    feature_names=feature_names,
    ax=axes,
    kind='both',          # individual + average
    subsample=50,
    random_state=42
)
plt.tight_layout()
plt.savefig('partial_dependence.png', dpi=150)
plt.show()
```

---

## 8. Random Forest vs. Single Decision Tree

### 8.1 Comparison

| Property | Single Decision Tree | Random Forest |
|---|---|---|
| Variance | High (unstable) | Low (averaged) |
| Bias | Low (deep trees) | Low (deep trees per tree) |
| Interpretability | High (single path) | Low (black box) |
| Overfitting risk | High | Low |
| Training speed | Fast | Moderate (B× slower) |
| Prediction speed | O(depth) | O(B × depth) |
| Memory | O(nodes) | O(B × nodes) |
| Handles missing values | No (native) | Yes (surrogate, sklearn) |
| Feature importance | Single tree path | Averaged across trees |

### 8.2 When to Use Random Forest

**Good choice:**
- Tabular data with mixed feature types
- When interpretability is secondary to accuracy
- When you need robust estimates with minimal tuning
- As a strong baseline before trying more complex models

**Less suitable:**
- Very high-dimensional sparse data (consider linear models or gradient boosting)
- When prediction speed is critical (consider distillation or pruned trees)
- When interpretability is paramount (consider single trees or linear models)

---

## 9. Extensions and Variants

### 9.1 Extra-Trees (Extremely Randomized Trees)

Similar to Random Forest but with additional randomization: split thresholds are chosen randomly rather than optimizing over the feature values.



<MathBlock tex={`\\text{Split at } x_j \\leq t \\quad \\text{where } t \\sim \\text{Uniform}(\\min_j, \\max_j)`} display={true} />



- Further reduces variance at cost of slightly higher bias
- Faster training (no sorting/split optimization per feature)

### 9.2 Balanced Random Forest

For imbalanced classification, each bootstrap sample is drawn to balance class frequencies:



<MathBlock tex={`\\text{Class distribution in } \\mathcal{D}^*_b: \\quad \\frac{n_k^*}{\\sum_k n_k^*} = \\frac{1}{K} \\quad \\forall k`} display={true} />



### 9.3 Quantile Regression Forests

Extends Random Forest to estimate conditional quantiles <MathBlock tex={`\\hat{q}_\\alpha(x)`} /> by keeping all training response values at each leaf and computing weighted quantiles.

---

## 10. Summary

Random Forests achieve excellent predictive performance through two simple ideas:

1. **Bootstrap aggregating**: Train each tree on a different random sample ?� decorrelates models ?� reduces variance
2. **Feature randomization**: At each split, consider only a random subset of features ?� further decorrelates trees ?� amplifies variance reduction

**Key takeaways:**
- OOB evaluation provides a free, unbiased estimate of generalization error
- Feature importance measures identify predictive variables
- Random Forests are robust to overfitting with minimal hyperparameter tuning
- They serve as a strong baseline for tabular data tasks

<Callout type="tip">
**Rule of thumb:** Always try Random Forest first as a baseline. It requires minimal tuning, handles missing values and mixed feature types, and provides built-in feature importance and OOB error estimates. Only move to gradient boosting if Random Forest performance is insufficient.
</Callout>

---

## References

- Breiman, L. (1996). Bagging predictors. *Machine Learning*, 24(2), 123–140.
- Breiman, L. (2001). Random Forests. *Machine Learning*, 45(1), 5–32.
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). *The Elements of Statistical Learning* (2nd ed.). Springer. Chapter 15.
- Scikit-learn documentation: [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)
Random Forest: Bootstrap Aggregating and Feature Randomness

Random Forest: Bootstrap Aggregating and Feature Randomness

1. Introduction

2. Bootstrap Aggregating (Bagging)

2.1 The Bootstrap Procedure

2.2 Bootstrap Sampling Visualization

Need Expert Data Science Help?