Random Forest — Theory, Maths & Complete Python

What Is Random Forest?

Random Forest is an ensemble method that builds hundreds of decision trees on random subsets of data and features, then aggregates their predictions. It is one of the most powerful and robust algorithms for tabular data.

Wisdom of crowds: One tree makes mistakes. 500 trees voting together are much harder to fool.

$\hat{y} = \text{mode}\{T_1(x), T_2(x), \ldots, T_B(x)\} \quad \text{(classification)}$

$\hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x) \quad \text{(regression)}$

How It Works — Step by Step

TRAINING PHASE:
  For b = 1 to B trees:
    1. Bootstrap sample: draw n samples WITH replacement from training set
       (each bootstrap ~63% unique samples — remaining 37% = OOB samples)
    2. At each node split:
       - Randomly select m features from all p features
       - Find best split among those m features only  ← KEY: feature randomness
    3. Grow tree to maximum depth (no pruning)

  Result: B decorrelated trees {T₁, T₂, ..., T_B}

PREDICTION PHASE:
  Classification: majority vote across all B trees
  Regression:     average prediction across all B trees

Why two sources of randomness?

Randomness Source	What It Does	Effect
Bootstrap sampling	Each tree sees different data	Reduces variance
Feature subsampling	Each split considers random features	Decorrelates trees

Both together = strong, diverse ensemble that doesn't overfit.

Key Formula: Out-of-Bag (OOB) Error

Each bootstrap sample leaves out ~37% of the data. These Out-of-Bag samples provide a free validation set.

$\text{OOB Error} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\!\left[\hat{y}_i^{\text{OOB}} \neq y_i\right]$

Where $\hat{y}_i^{\text{OOB}}$ is the prediction for observation $i$ using only trees that did not include $i$ in their bootstrap sample.

Feature Importance

$\text{Importance}(j) = \frac{1}{B}\sum_{b=1}^{B}\sum_{t \in T_b, v(t)=j}\frac{n_t}{n}\left[\text{Gini}(t) - \frac{n_{t_L}}{n_t}\text{Gini}(t_L) - \frac{n_{t_R}}{n_t}\text{Gini}(t_R)\right]$

Gini impurity at node $t$ :

$\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2$

Where $p_k$ is the fraction of class $k$ samples at node $t$ .

Complete Python Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import (train_test_split, cross_val_score,
                                     GridSearchCV, RandomizedSearchCV)
from sklearn.metrics import (accuracy_score, classification_report,
                             confusion_matrix, roc_curve, auc,
                             mean_squared_error, r2_score)
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance
import warnings
warnings.filterwarnings('ignore')

# ── 1. Classification ─────────────────────────────────────────────────
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target  # 0=malignant, 1=benign

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=200,       # number of trees
    max_features='sqrt',    # √p features per split (default for classification)
    max_depth=None,         # grow full trees
    min_samples_split=2,
    min_samples_leaf=1,
    bootstrap=True,
    oob_score=True,         # enable OOB evaluation
    n_jobs=-1,              # use all CPU cores
    random_state=42,
)
rf.fit(X_train, y_train)

# OOB score (no need for separate validation set!)
print(f"OOB Score       : {rf.oob_score_:.4f}")
print(f"Test Accuracy   : {accuracy_score(y_test, rf.predict(X_test)):.4f}")
print(f"\nClassification Report:\n{classification_report(y_test, rf.predict(X_test), target_names=['Malignant','Benign'])}")

# ── 2. Feature Importance Analysis ────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Gini importance (built-in)
gini_imp = pd.Series(rf.feature_importances_, index=X.columns)
gini_imp.nlargest(15).sort_values().plot(kind='barh', ax=axes[0],
                                          color='steelblue', edgecolor='white')
axes[0].set_title('Feature Importance (Gini / MDI)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Mean Decrease in Impurity')

# Permutation importance (more reliable)
perm_imp = permutation_importance(rf, X_test, y_test,
                                   n_repeats=10, random_state=42, n_jobs=-1)
perm_df = pd.DataFrame({
    'importance': perm_imp.importances_mean,
    'std':        perm_imp.importances_std,
}, index=X.columns).nlargest(15, 'importance').sort_values('importance')

perm_df['importance'].plot(kind='barh', ax=axes[1],
                           xerr=perm_df['std'],
                           color='#10b981', edgecolor='white', capsize=3)
axes[1].set_title('Feature Importance (Permutation)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Mean Decrease in Accuracy')

plt.tight_layout()
plt.show()

# ── 3. ROC Curve ──────────────────────────────────────────────────────
y_proba = rf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color='#3b82f6', linewidth=2.5,
         label=f'Random Forest (AUC = {roc_auc:.4f})')
plt.plot([0,1],[0,1], color='gray', linestyle='--', linewidth=1)
plt.fill_between(fpr, tpr, alpha=0.1, color='#3b82f6')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve — Random Forest')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ── 4. Effect of n_estimators on OOB Error ───────────────────────────
n_trees = range(1, 201, 5)
oob_errors = []

for n in n_trees:
    rf_tmp = RandomForestClassifier(n_estimators=n, oob_score=True,
                                    n_jobs=-1, random_state=42)
    rf_tmp.fit(X_train, y_train)
    oob_errors.append(1 - rf_tmp.oob_score_)

plt.figure(figsize=(9, 4))
plt.plot(list(n_trees), oob_errors, color='#ef4444', linewidth=2)
plt.axvline(x=100, color='gray', linestyle='--', alpha=0.7, label='n=100')
plt.xlabel('Number of Trees')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ── 5. Hyperparameter Tuning (Randomized Search) ─────────────────────
param_dist = {
    'n_estimators':     [50, 100, 200, 300],
    'max_features':     ['sqrt', 'log2', 0.3, 0.5],
    'max_depth':        [None, 5, 10, 20],
    'min_samples_split':[2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

rscv = RandomizedSearchCV(
    RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42,
    verbose=0,
)
rscv.fit(X_train, y_train)

print("Best Parameters:", rscv.best_params_)
print(f"Best CV F1: {rscv.best_score_:.4f}")
print(f"Test F1:    {rscv.score(X_test, y_test):.4f}")

Sample Output:

OOB Score       : 0.9692
Test Accuracy   : 0.9737

Best Parameters: {'n_estimators': 300, 'min_samples_split': 2,
                  'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None}
Best CV F1: 0.9771
Test F1:    0.9806

Regression with Random Forest

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
X_r = housing.data
y_r = housing.target

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_r, y_r, test_size=0.2, random_state=42
)

rfr = RandomForestRegressor(
    n_estimators=200,
    max_features=0.33,   # p/3 is default for regression
    oob_score=True,
    n_jobs=-1,
    random_state=42,
)
rfr.fit(X_train_r, y_train_r)
y_pred_r = rfr.predict(X_test_r)

rmse = np.sqrt(mean_squared_error(y_test_r, y_pred_r))
r2   = r2_score(y_test_r, y_pred_r)
print(f"RMSE : {rmse:.4f}")
print(f"R²   : {r2:.4f}")
print(f"OOB  : {rfr.oob_score_:.4f}")

# Actual vs Predicted
plt.figure(figsize=(7, 6))
plt.scatter(y_test_r, y_pred_r, alpha=0.3, color='steelblue', s=10)
plt.plot([y_test_r.min(), y_test_r.max()],
         [y_test_r.min(), y_test_r.max()],
         'r--', linewidth=2, label='Perfect prediction')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title(f'Actual vs Predicted  (R² = {r2:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Hyperparameter Cheat Sheet

Parameter	Default	Effect	Tune If
`n_estimators`	100	More trees → lower variance	Always — try 100–500
`max_features`	`sqrt`	Fewer features → more diversity	High-dim data
`max_depth`	None	Deeper → lower bias, higher variance	Overfitting
`min_samples_leaf`	1	Higher → smoother predictions	Noisy data
`max_samples`	1.0	Bootstrap sample fraction	Large datasets

When to Use Random Forest

✅ Use RF When	❌ Consider Alternatives When
Tabular data, mixed types	Sequential/text/image data
Need feature importance	Need full interpretability
Missing values (can handle)	Ultra-fast inference needed
Medium-sized datasets	Millions of features
Baseline before XGBoost	XGBoost already outperforms

Key Takeaways

Two randomness sources — bootstrap + feature subsets — create diverse, decorrelated trees
OOB score is a free cross-validation estimate — use it
Permutation importance is more reliable than Gini importance for correlated features
n_estimators — more is almost always better; diminishing returns after ~200 trees
Random Forest is an excellent default algorithm before trying gradient boosting