Random Forest — Complete Guide
Random Forest builds many decision trees and combines their predictions. It's one of the most popular and effective ML algorithms.
How Random Forest Works
Random Forest = Bagging + Random Feature Selection
Step 1: Bootstrap Sampling
├─ Create N random samples (with replacement)
├─ Each sample ~63% of original data
└─ ~37% left out (out-of-bag samples)
Step 2: Train Decision Tree on Each Sample
├─ At each split, consider only √p features (classification)
├─ Or p/3 features (regression)
└─ This decorrelates the trees
Step 3: Aggregate Predictions
├─ Classification: Majority vote
└─ Regression: Average
Why it works:
├─ Each tree is different (bootstrap + random features)
├─ Errors of individual trees cancel out
└─ Combining reduces variance without increasing bias
Python Implementation
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate data
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Train
rf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
max_features='sqrt',
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# Evaluate
y_pred = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
# Feature importance
importances = rf.feature_importances_
for i, imp in enumerate(importances):
print(f"Feature {i}: {imp:.3f}")
Out-of-Bag (OOB) Evaluation
Each tree sees ~63% of data
The remaining ~37% (OOB samples) can be used for evaluation
rf = RandomForestClassifier(oob_score=True)
rf.fit(X, y)
print(f"OOB Score: {rf.oob_score_:.3f}")
Advantage: No need for separate validation set!
Hyperparameters
n_estimators: Number of trees
├─ More = better (up to a point)
├─ 100-500 is usually good
└─ Diminishing returns after 500
max_depth: Maximum tree depth
├─ None = unlimited (may overfit)
├─ 10-30 is usually good
└─ Deeper = more complex
min_samples_split: Minimum samples to split
├─ 2 = default (grow fully)
├─ 5-20 = more regularization
└─ Higher = simpler trees
max_features: Features per split
├─ 'sqrt' = √p features (classification)
├─ 'log2' = log₂(p) features
└─ 0.3 = 30% of features
Key Takeaways
- Random Forest combines many decision trees for better performance
- Bootstrap sampling + random feature selection decorrelates trees
- Feature importance shows which features matter most
- OOB evaluation provides free validation
- Robust to overfitting — more trees generally help
- Handles missing values and mixed data types
- Parallel training — trees are independent
- Great baseline — often competitive with tuned models