Random Forest — Complete Guide

Random Forest builds many decision trees and combines their predictions. It's one of the most popular and effective ML algorithms.

How Random Forest Works

Random Forest = Bagging + Random Feature Selection

Step 1: Bootstrap Sampling
├─ Create N random samples (with replacement)
├─ Each sample ~63% of original data
└─ ~37% left out (out-of-bag samples)

Step 2: Train Decision Tree on Each Sample
├─ At each split, consider only √p features (classification)
├─ Or p/3 features (regression)
└─ This decorrelates the trees

Step 3: Aggregate Predictions
├─ Classification: Majority vote
└─ Regression: Average

Why it works:
├─ Each tree is different (bootstrap + random features)
├─ Errors of individual trees cancel out
└─ Combining reduces variance without increasing bias

Python Implementation

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate data
X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    max_features='sqrt',
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Evaluate
y_pred = rf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")

# Feature importance
importances = rf.feature_importances_
for i, imp in enumerate(importances):
    print(f"Feature {i}: {imp:.3f}")

Out-of-Bag (OOB) Evaluation

Each tree sees ~63% of data
The remaining ~37% (OOB samples) can be used for evaluation

rf = RandomForestClassifier(oob_score=True)
rf.fit(X, y)
print(f"OOB Score: {rf.oob_score_:.3f}")

Advantage: No need for separate validation set!

Hyperparameters

n_estimators: Number of trees
├─ More = better (up to a point)
├─ 100-500 is usually good
└─ Diminishing returns after 500

max_depth: Maximum tree depth
├─ None = unlimited (may overfit)
├─ 10-30 is usually good
└─ Deeper = more complex

min_samples_split: Minimum samples to split
├─ 2 = default (grow fully)
├─ 5-20 = more regularization
└─ Higher = simpler trees

max_features: Features per split
├─ 'sqrt' = √p features (classification)
├─ 'log2' = log₂(p) features
└─ 0.3 = 30% of features

Key Takeaways

Random Forest combines many decision trees for better performance
Bootstrap sampling + random feature selection decorrelates trees
Feature importance shows which features matter most
OOB evaluation provides free validation
Robust to overfitting — more trees generally help
Handles missing values and mixed data types
Parallel training — trees are independent
Great baseline — often competitive with tuned models

Random Forest — Complete Guide for Ensemble Learning

Random Forest — Complete Guide

How Random Forest Works

Python Implementation

Out-of-Bag (OOB) Evaluation

Hyperparameters

Key Takeaways

Need Expert Machine Learning Help?