Google & Apple Interview

Random Forest: Bagging, Feature Sampling & Out-of-Bag Error

Understanding ensemble methods that combine multiple decision trees

Interview Question

"Explain how Random Forest reduces variance compared to a single decision tree. What is the role of feature sampling in Random Forest? How does out-of-bag error estimation work and why is it useful?"

Difficulty: Medium | Frequently asked at Google, Apple, Amazon

Theoretical Foundation

From Decision Trees to Random Forest

A single decision tree has low bias but high variance. Small changes in training data can lead to completely different tree structures. Random Forest addresses this by building many trees and aggregating their predictions.

Random Forest Algorithm:

For $b = 1, \ldots, B$ trees: a. Draw a bootstrap sample $X^*$ from the training data b. Grow a decision tree $T_b$ on $X^*$ c. At each split, select $\sqrt{p}$ (classification) or $p/3$ (regression) random features
Output:
- Classification: $\hat{f}(x) = \text{majority vote}\{T_1(x), \ldots, T_B(x)\}$
- Regression: $\hat{f}(x) = \frac{1}{B} \sum_{b=1}^{B} T_b(x)$

Two Sources of Randomness

Random Forest introduces two sources of randomness:

1. Bootstrap Aggregating (Bagging)

Each tree is trained on a bootstrap sample (sampling with replacement) of the original data. On average, each bootstrap sample contains about 63.2% of the original samples (since $1 - (1 - 1/n)^n \approx 0.632$ ).

Why bagging reduces variance:

For $B$ independent trees with variance $\sigma^2$ :

\text{Var}\left(\frac{1}{B}\sum_{b=1}^{B} T_b(x)\right) = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2

where $\rho$ is the correlation between trees. As $B \to \infty$ :

\text{Var}(\text{RF}) \to \rho\sigma^2

Bagging reduces the second term, but doesn't affect the first. To reduce $\rho$ , we need feature sampling.

2. Feature Sampling (Random Subspace Method)

At each split, only consider a random subset of features. This decorrelates the trees because:

Different trees will use different features for splits
Important features won't dominate all trees
The model becomes more robust to noisy features

Why feature sampling works:

If one feature is very strong, all trees would use it, making them correlated
By randomly excluding this feature, some trees must find alternative patterns
This diversity improves the ensemble's generalization

ℹ️

Key Insight: The magic of Random Forest is that it combines two techniques that each reduce variance: bagging (reduces variance by averaging) and feature sampling (reduces correlation between trees). Together, they achieve significant variance reduction.

Out-of-Bag (OOB) Error Estimation

Each bootstrap sample leaves out about 36.8% of the original data. These out-of-bag (OOB) samples can be used as a validation set:

\text{OOB Error} = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[y_i \neq \hat{f}_{-i}(x_i)]

where $\hat{f}_{-i}(x_i)$ is the prediction for sample $i$ using only trees that did not include $i$ in their bootstrap sample.

Advantages of OOB estimation:

No need for a separate validation set
Each sample is predicted by about 36.8% of the trees
Provides an unbiased estimate of generalization error
Computed during training (no additional cost)

⚠️

Common Misconception: OOB error is NOT the same as cross-validation error. OOB uses each sample's prediction from trees that didn't see it, while CV uses held-out sets. OOB is generally slightly pessimistic compared to CV.

Feature Importance in Random Forest

Mean Decrease in Impurity (MDI)

Aggregate impurity decrease across all trees:

\text{Importance}(j) = \frac{1}{B} \sum_{b=1}^{B} \sum_{t \in T_b} \mathbb{1}[\text{feature } j \text{ used at } t] \cdot \frac{n_t}{n} \cdot \Delta I(t)

Mean Decrease in Accuracy (Permutation Importance)

For each feature $j$ :

Compute OOB accuracy $A_j$ with original feature values
Randomly permute feature $j$ across OOB samples
Compute OOB accuracy $A_j^{\text{perm}}$ with permuted values
Importance: $A_j - A_j^{\text{perm}}$

Permutation importance is more reliable because:

Not biased toward high-cardinality features
Captures feature interactions
Accounts for feature correlation

Hyperparameters

Parameter	Description	Default	Recommendation
`n_estimators`	Number of trees	100	More is better (diminishing returns)
`max_features`	Features per split	sqrt(p) for classification	sqrt(p) or log2(p)
`max_depth`	Tree depth	None (unlimited)	Set for very large datasets
`min_samples_leaf`	Min samples in leaf	1	Increase for noisy data
`bootstrap`	Use bootstrap sampling	True	True for standard RF

💡

Production Tip: The most important hyperparameter is n_estimators. Always set it as high as your computational budget allows. More trees almost never hurt (though with diminishing returns).

Code Implementation

Explanation of Code

RF vs Single Tree: Shows how Random Forest reduces overfitting by comparing training-test accuracy gaps.
Number of Trees: Demonstrates diminishing returns as more trees are added.
OOB Error: Shows how OOB provides a free validation estimate during training.
Feature Importance: Compares impurity-based and permutation-based importance measures.
Feature Sampling: Shows how max_features affects performance and tree correlation.
Tree Correlation: Demonstrates that fewer features per split leads to less correlated trees.

Real-World Applications

Google: Search Ranking

Google uses Random Forest variants for:

Query Understanding: Classifying search intent
Spam Detection: Identifying low-quality web pages
Feature Selection: Identifying important ranking signals

Apple: Siri Voice Recognition

Apple uses Random Forest for:

Wake Word Detection: "Hey Siri" activation
Intent Classification: Understanding user commands
Device Recommendations: Personalized suggestions

💡

Google Interview Tip: Be prepared to discuss the bias-variance decomposition of Random Forest. The key insight is that RF reduces variance while maintaining low bias, making it a robust default choice.

Common Follow-Up Questions

Q1: Why does Random Forest use bootstrap sampling instead of using the full dataset for each tree?

Bootstrap sampling introduces diversity between trees. If all trees used the full dataset, they would be identical (for deterministic splitting criteria). Bootstrap ensures each tree sees slightly different data, creating diverse predictions.

Q2: What happens when max_features = 1?

Each split considers only one random feature. This maximizes decorrelation but may miss the best split. The model becomes a "random subspace" method. In practice, sqrt(p) works well for classification.

Q3: Can Random Forest handle missing values?

Standard scikit-learn Random Forest cannot. However, implementations like XGBoost and LightGBM handle missing values by learning the optimal direction (left or right) for missing values at each split.

Q4: How do you determine the optimal number of trees?

There's no overfitting with more trees (unlike boosting). Monitor OOB error or validation error and increase trees until it stabilizes. Typically, 100-500 trees suffice for most problems.

Company-Specific Tips

Google Interview Tips

Discuss computational complexity: $O(B \cdot n \cdot p \cdot \log n)$ for training
Be ready to explain why RF doesn't overfit with more trees
Mention parallelization benefits (trees are independent)
Talk about feature importance for model interpretability

Apple Interview Tips

Focus on real-time inference requirements
Discuss model compression techniques
Be prepared to explain ensemble diversity
Mention privacy considerations in federated learning

Random Forest: Bagging, Feature Sampling & Out-of-Bag Error

Random Forest: Bagging, Feature Sampling & Out-of-Bag Error

Interview Question

Theoretical Foundation

From Decision Trees to Random Forest

Two Sources of Randomness

1. Bootstrap Aggregating (Bagging)

2. Feature Sampling (Random Subspace Method)

Out-of-Bag (OOB) Error Estimation

Feature Importance in Random Forest

Mean Decrease in Impurity (MDI)

Mean Decrease in Accuracy (Permutation Importance)

Hyperparameters

Code Implementation

Explanation of Code

Real-World Applications

Google: Search Ranking

Apple: Siri Voice Recognition

Common Follow-Up Questions

Company-Specific Tips

Google Interview Tips

Apple Interview Tips

Related Topics