Random Forest: Bagging, Feature Sampling & Out-of-Bag Error
Understanding ensemble methods that combine multiple decision trees
Interview Question
"Explain how Random Forest reduces variance compared to a single decision tree. What is the role of feature sampling in Random Forest? How does out-of-bag error estimation work and why is it useful?"
Difficulty: Medium | Frequently asked at Google, Apple, Amazon
Theoretical Foundation
From Decision Trees to Random Forest
A single decision tree has low bias but high variance. Small changes in training data can lead to completely different tree structures. Random Forest addresses this by building many trees and aggregating their predictions.
Random Forest Algorithm:
- For trees: a. Draw a bootstrap sample from the training data b. Grow a decision tree on c. At each split, select (classification) or (regression) random features
- Output:
- Classification:
- Regression:
Two Sources of Randomness
Random Forest introduces two sources of randomness:
1. Bootstrap Aggregating (Bagging)
Each tree is trained on a bootstrap sample (sampling with replacement) of the original data. On average, each bootstrap sample contains about 63.2% of the original samples (since ).
Why bagging reduces variance:
For independent trees with variance :
where is the correlation between trees. As :
Bagging reduces the second term, but doesn't affect the first. To reduce , we need feature sampling.
2. Feature Sampling (Random Subspace Method)
At each split, only consider a random subset of features. This decorrelates the trees because:
- Different trees will use different features for splits
- Important features won't dominate all trees
- The model becomes more robust to noisy features
Why feature sampling works:
- If one feature is very strong, all trees would use it, making them correlated
- By randomly excluding this feature, some trees must find alternative patterns
- This diversity improves the ensemble's generalization
βΉοΈ
Key Insight: The magic of Random Forest is that it combines two techniques that each reduce variance: bagging (reduces variance by averaging) and feature sampling (reduces correlation between trees). Together, they achieve significant variance reduction.
Out-of-Bag (OOB) Error Estimation
Each bootstrap sample leaves out about 36.8% of the original data. These out-of-bag (OOB) samples can be used as a validation set:
where is the prediction for sample using only trees that did not include in their bootstrap sample.
Advantages of OOB estimation:
- No need for a separate validation set
- Each sample is predicted by about 36.8% of the trees
- Provides an unbiased estimate of generalization error
- Computed during training (no additional cost)
β οΈ
Common Misconception: OOB error is NOT the same as cross-validation error. OOB uses each sample's prediction from trees that didn't see it, while CV uses held-out sets. OOB is generally slightly pessimistic compared to CV.
Feature Importance in Random Forest
Mean Decrease in Impurity (MDI)
Aggregate impurity decrease across all trees:
Mean Decrease in Accuracy (Permutation Importance)
For each feature :
- Compute OOB accuracy with original feature values
- Randomly permute feature across OOB samples
- Compute OOB accuracy with permuted values
- Importance:
Permutation importance is more reliable because:
- Not biased toward high-cardinality features
- Captures feature interactions
- Accounts for feature correlation
Hyperparameters
| Parameter | Description | Default | Recommendation |
|---|---|---|---|
n_estimators | Number of trees | 100 | More is better (diminishing returns) |
max_features | Features per split | sqrt(p) for classification | sqrt(p) or log2(p) |
max_depth | Tree depth | None (unlimited) | Set for very large datasets |
min_samples_leaf | Min samples in leaf | 1 | Increase for noisy data |
bootstrap | Use bootstrap sampling | True | True for standard RF |
π‘
Production Tip: The most important hyperparameter is n_estimators. Always set it as high as your computational budget allows. More trees almost never hurt (though with diminishing returns).
Code Implementation
Explanation of Code
-
RF vs Single Tree: Shows how Random Forest reduces overfitting by comparing training-test accuracy gaps.
-
Number of Trees: Demonstrates diminishing returns as more trees are added.
-
OOB Error: Shows how OOB provides a free validation estimate during training.
-
Feature Importance: Compares impurity-based and permutation-based importance measures.
-
Feature Sampling: Shows how
max_featuresaffects performance and tree correlation. -
Tree Correlation: Demonstrates that fewer features per split leads to less correlated trees.
Real-World Applications
Google: Search Ranking
Google uses Random Forest variants for:
- Query Understanding: Classifying search intent
- Spam Detection: Identifying low-quality web pages
- Feature Selection: Identifying important ranking signals
Apple: Siri Voice Recognition
Apple uses Random Forest for:
- Wake Word Detection: "Hey Siri" activation
- Intent Classification: Understanding user commands
- Device Recommendations: Personalized suggestions
π‘
Google Interview Tip: Be prepared to discuss the bias-variance decomposition of Random Forest. The key insight is that RF reduces variance while maintaining low bias, making it a robust default choice.
Common Follow-Up Questions
Q1: Why does Random Forest use bootstrap sampling instead of using the full dataset for each tree?
Bootstrap sampling introduces diversity between trees. If all trees used the full dataset, they would be identical (for deterministic splitting criteria). Bootstrap ensures each tree sees slightly different data, creating diverse predictions.
Q2: What happens when max_features = 1?
Each split considers only one random feature. This maximizes decorrelation but may miss the best split. The model becomes a "random subspace" method. In practice, sqrt(p) works well for classification.
Q3: Can Random Forest handle missing values?
Standard scikit-learn Random Forest cannot. However, implementations like XGBoost and LightGBM handle missing values by learning the optimal direction (left or right) for missing values at each split.
Q4: How do you determine the optimal number of trees?
There's no overfitting with more trees (unlike boosting). Monitor OOB error or validation error and increase trees until it stabilizes. Typically, 100-500 trees suffice for most problems.
Company-Specific Tips
Google Interview Tips
- Discuss computational complexity: for training
- Be ready to explain why RF doesn't overfit with more trees
- Mention parallelization benefits (trees are independent)
- Talk about feature importance for model interpretability
Apple Interview Tips
- Focus on real-time inference requirements
- Discuss model compression techniques
- Be prepared to explain ensemble diversity
- Mention privacy considerations in federated learning