Random Forest: Bootstrap Aggregating and Feature Randomness
1. Introduction
A Random Forest is an ensemble method that constructs a multitude of decision trees during training and outputs the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. It combines bootstrap aggregating (bagging) with feature randomization to produce trees with low correlation and high predictive power.
The key insight: by introducing controlled randomness into tree construction, we reduce variance without substantially increasing bias.
2. Bootstrap Aggregating (Bagging)
2.1 The Bootstrap Procedure
Bagging (Breiman, 1996) draws B bootstrap samples from the original dataset of size n. Each bootstrap sample is drawn with replacement, meaning approximately 63.2% of unique original samples appear in any given bootstrap sample. The remaining ~36.8% are called out-of-bag (OOB) observations.
2.2 Bootstrap Sampling Visualization
2.3 Mathematical Foundation
For a dataset , each bootstrap sample is drawn by uniformly sampling observations with replacement. The bagging predictor is:
where is the model trained on bootstrap sample .
For regression, bagging reduces variance:
where is the pairwise correlation between trees. Without bagging (single tree), variance = . Bagging reduces the second term by factor , but the irreducible term limits gains.
3. Random Forest Algorithm
3.1 Two Sources of Randomness
Random Forest (Breiman, 2001) adds a second randomization step: at each node, only a random subset of features (out of total) is considered for splitting. This decorrelates trees, reducing and thus overall variance.
3.2 The Algorithm
Algorithm: Random Forest (classification variant)
─────────────────────────────────────────────────
Input: Training set D = {(xâ‚,yâ‚), ..., (xâ‚™,yâ‚™)}
Number of trees B
Features to consider at each split m
Output: Ensemble classifier |RF|
1: for b = 1 to B do
2: Draw bootstrap sample D*_b from D (sample n with replacement)
3: Grow tree t_b on D*_b:
4: at each node:
5: randomly select m features from p total
6: find best split among these m features
7: split node into two child nodes
8: stop when minimum node size or max depth reached
9: end for
10: Output: |RF(x) = mode{ t_b(x) }_{b=1}^{B}
3.3 Theoretical Justification
Randomness reduces correlation. Consider the variance of the ensemble prediction. For identically distributed trees with pairwise correlation and individual variance :
- As , the second term vanishes:
- Feature randomization reduces (trees become less correlated)
- The reduction in is the primary mechanism by which Random Forest outperforms bagged trees
Bias consideration: Random Forests do not reduce bias compared to individual trees (which are typically grown deep with low bias). The variance reduction comes at a slight cost in bias, but this is negligible for large trees.
4. Out-of-Bag (OOB) Evaluation
4.1 Concept
Each training observation is OOB for approximately 36.8% of the trees. We can use these trees to make predictions for that observation without a separate validation set.
4.2 OOB Error Formula
For classification, the OOB error is:
where is the prediction from trees for which observation was OOB.
Why OOB works: Each observation is predicted by approximately 36.8% of the ensemble (those trees where it was OOB). This is equivalent to a cross-validation estimate with ≈ 0.632×n training samples per fold — remarkably close to leave-one-out CV but computed at zero additional cost.
4.3 OOB vs Cross-Validation
| Method | Computational Cost | Bias | Variance |
|---|---|---|---|
| OOB | Zero (built-in) | Slight upward bias (~0.632 rule) | Low |
| 5-fold CV | 5× training cost | Moderate | Moderate |
| 10-fold CV | 10× training cost | Lower | Lower |
| Leave-one-out | n× training cost | Lowest | Highest |
5. Feature Importance
5.1 Mean Decrease Impurity (MDI)
For each feature , sum the total decrease in Gini impurity (or variance for regression) contributed by splits on that feature across all trees, then normalize:
where is the impurity decrease at node :
MDI Bias: MDI importance is biased toward features with many unique values (e.g., high-cardinality categorical features). This is because such features offer more split points, providing more opportunities for impurity reduction.
5.2 Permutation Importance (Mean Decrease in Accuracy)
A more reliable measure. For each feature :
- Compute baseline OOB accuracy
- Randomly permute the values of feature across all OOB observations
- Compute OOB accuracy again
- Importance = decrease in accuracy
where is the OOB accuracy of tree when feature is permuted.
5.3 Permutation Importance Procedure
For each tree t_b (b = 1, ..., B):
1. Identify OOB observations O_b ⊂ {1, ..., n}
2. Compute accuracy: Acc_b = (1/|O_b|) Σ_{i∈O_b} ðŸ™[t_b(xáµ¢) = yáµ¢]
3. For each feature j:
a. Create permuted data: x̃ᵢⱼ = x_{π(i),j} for i ∈ O_b
(permute column j only)
b. Compute: Acc_b^(Ï€_j) = (1/|O_b|) Σ_{i∈O_b} ðŸ™[t_b(x̃ᵢ) = yáµ¢]
c. Contribution_b^(j) = Acc_b − Acc_b^(π_j)
4. Importance_j = (1/B) Σ_b Contribution_b^(j)
6. Hyperparameter Tuning
6.1 Key Hyperparameters
| Hyperparameter | Default (sklearn) | Range | Effect |
|---|---|---|---|
n_estimators (B) | 100 | [10, 1000+] | More trees → lower variance (diminishing returns) |
max_features (m) | √p (classification), p/3 (regression) | [1, p] | Fewer features → more decorrelation, less bias |
max_depth | None (unlimited) | [1, None] | Limiting depth → reduces overfitting |
min_samples_split | 2 | [2, 20+] | Higher → simpler trees |
min_samples_leaf | 1 | [1, 20+] | Higher → smoother boundaries |
max_leaf_nodes | None | [2, None] | Limiting leaves → regularization |
6.2 The m (max_features) Trade-off
- : All features considered → equivalent to bagging → is high
- : Random feature at each split → maximum decorrelation → high bias
- : Sweet spot for classification (Breiman's recommendation)
6.3 Tuning Strategy
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
param_distributions = {
'n_estimators': [100, 200, 500, 1000],
'max_features': ['sqrt', 'log2', 0.3, 0.5, 0.7],
'max_depth': [None, 10, 20, 30, 50],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'bootstrap': [True, False],
}
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
search = RandomizedSearchCV(
rf, param_distributions,
n_iter=100,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
search.fit(X_train, y_train)
print(f"Best params: {search.best_params_}")
print(f"Best CV accuracy: {search.best_score_:.4f}")
7. Implementation in Python
7.1 Full Working Example
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
classification_report,
confusion_matrix,
roc_auc_score,
RocCurveDisplay
)
import matplotlib.pyplot as plt
# --- Generate synthetic data ---
X, y = make_classification(
n_samples=1000,
n_features=20,
n_informative=12,
n_redundant=4,
n_classes=2,
random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# --- Fit Random Forest ---
rf = RandomForestClassifier(
n_estimators=500,
max_features='sqrt',
max_depth=None,
min_samples_leaf=2,
oob_score=True,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# --- Evaluate ---
print(f"OOB Score: {rf.oob_score_:.4f}")
print(f"Test Accuracy: {rf.score(X_test, y_test):.4f}")
print(f"Test ROC-AUC: {roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1]):.4f}")
print("\nClassification Report:")
print(classification_report(y_test, rf.predict(X_test)))
7.2 Feature Importance Analysis
# --- Permutation Importance (more reliable) ---
from sklearn.inspection import permutation_importance
result = permutation_importance(
rf, X_test, y_test,
n_repeats=30,
random_state=42,
n_jobs=-1
)
feature_names = [f"Feature {i}" for i in range(X.shape[1])]
importances = pd.DataFrame({
'feature': feature_names,
'mdi_importance': rf.feature_importances_,
'perm_importance': result.importances_mean,
'perm_std': result.importances_std
}).sort_values('perm_importance', ascending=False)
print(importances.head(10))
7.3 OOB Error Convergence
# --- OOB error vs number of trees ---
errors = []
for n_trees in range(1, 501, 10):
rf_temp = RandomForestClassifier(
n_estimators=n_trees,
oob_score=True,
random_state=42
)
rf_temp.fit(X_train, y_train)
errors.append({
'n_trees': n_trees,
'oob_error': 1 - rf_temp.oob_score_
})
errors_df = pd.DataFrame(errors)
plt.figure(figsize=(8, 4))
plt.plot(errors_df['n_trees'], errors_df['oob_error'], color='#3b82f6', linewidth=2)
plt.xlabel('Number of Trees (B)')
plt.ylabel('OOB Error')
plt.title('OOB Error Convergence')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('oob_convergence.png', dpi=150)
plt.show()
7.4 Partial Dependence Plots
from sklearn.inspection import PartialDependenceDisplay
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
PartialDependenceDisplay.from_estimator(
rf, X_train,
features=[0, 1, 2],
feature_names=feature_names,
ax=axes,
kind='both', # individual + average
subsample=50,
random_state=42
)
plt.tight_layout()
plt.savefig('partial_dependence.png', dpi=150)
plt.show()
8. Random Forest vs. Single Decision Tree
8.1 Comparison
| Property | Single Decision Tree | Random Forest |
|---|---|---|
| Variance | High (unstable) | Low (averaged) |
| Bias | Low (deep trees) | Low (deep trees per tree) |
| Interpretability | High (single path) | Low (black box) |
| Overfitting risk | High | Low |
| Training speed | Fast | Moderate (B× slower) |
| Prediction speed | O(depth) | O(B × depth) |
| Memory | O(nodes) | O(B × nodes) |
| Handles missing values | No (native) | Yes (surrogate, sklearn) |
| Feature importance | Single tree path | Averaged across trees |
8.2 When to Use Random Forest
Good choice:
- Tabular data with mixed feature types
- When interpretability is secondary to accuracy
- When you need robust estimates with minimal tuning
- As a strong baseline before trying more complex models
Less suitable:
- Very high-dimensional sparse data (consider linear models or gradient boosting)
- When prediction speed is critical (consider distillation or pruned trees)
- When interpretability is paramount (consider single trees or linear models)
9. Extensions and Variants
9.1 Extra-Trees (Extremely Randomized Trees)
Similar to Random Forest but with additional randomization: split thresholds are chosen randomly rather than optimizing over the feature values.
- Further reduces variance at cost of slightly higher bias
- Faster training (no sorting/split optimization per feature)
9.2 Balanced Random Forest
For imbalanced classification, each bootstrap sample is drawn to balance class frequencies:
9.3 Quantile Regression Forests
Extends Random Forest to estimate conditional quantiles by keeping all training response values at each leaf and computing weighted quantiles.
10. Summary
Random Forests achieve excellent predictive performance through two simple ideas:
- Bootstrap aggregating: Train each tree on a different random sample → decorrelates models → reduces variance
- Feature randomization: At each split, consider only a random subset of features → further decorrelates trees → amplifies variance reduction
Key takeaways:
- OOB evaluation provides a free, unbiased estimate of generalization error
- Feature importance measures identify predictive variables
- Random Forests are robust to overfitting with minimal hyperparameter tuning
- They serve as a strong baseline for tabular data tasks
Rule of thumb: Always try Random Forest first as a baseline. It requires minimal tuning, handles missing values and mixed feature types, and provides built-in feature importance and OOB error estimates. Only move to gradient boosting if Random Forest performance is insufficient.
References
- Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
- Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 15.
- Scikit-learn documentation: Random Forest