Random Forest — Theory, Maths & Complete Python

Supervised LearningEnsemble MethodsFree Lesson

Advertisement

What Is Random Forest?

Random Forest is an ensemble method that builds hundreds of decision trees on random subsets of data and features, then aggregates their predictions. It is one of the most powerful and robust algorithms for tabular data.

Wisdom of crowds: One tree makes mistakes. 500 trees voting together are much harder to fool.

y^=mode{T1(x),T2(x),,TB(x)}(classification)\hat{y} = \text{mode}\{T_1(x), T_2(x), \ldots, T_B(x)\} \quad \text{(classification)}

y^=1Bb=1BTb(x)(regression)\hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x) \quad \text{(regression)}


How It Works — Step by Step

TRAINING PHASE:
  For b = 1 to B trees:
    1. Bootstrap sample: draw n samples WITH replacement from training set
       (each bootstrap ~63% unique samples — remaining 37% = OOB samples)
    2. At each node split:
       - Randomly select m features from all p features
       - Find best split among those m features only  ← KEY: feature randomness
    3. Grow tree to maximum depth (no pruning)

  Result: B decorrelated trees {T₁, T₂, ..., T_B}

PREDICTION PHASE:
  Classification: majority vote across all B trees
  Regression:     average prediction across all B trees

Why two sources of randomness?

Randomness SourceWhat It DoesEffect
Bootstrap samplingEach tree sees different dataReduces variance
Feature subsamplingEach split considers random featuresDecorrelates trees

Both together = strong, diverse ensemble that doesn't overfit.


Key Formula: Out-of-Bag (OOB) Error

Each bootstrap sample leaves out ~37% of the data. These Out-of-Bag samples provide a free validation set.

OOB Error=1ni=1n1 ⁣[y^iOOByi]\text{OOB Error} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{1}\!\left[\hat{y}_i^{\text{OOB}} \neq y_i\right]

Where y^iOOB\hat{y}_i^{\text{OOB}} is the prediction for observation ii using only trees that did not include ii in their bootstrap sample.


Feature Importance

Importance(j)=1Bb=1BtTb,v(t)=jntn[Gini(t)ntLntGini(tL)ntRntGini(tR)]\text{Importance}(j) = \frac{1}{B}\sum_{b=1}^{B}\sum_{t \in T_b, v(t)=j}\frac{n_t}{n}\left[\text{Gini}(t) - \frac{n_{t_L}}{n_t}\text{Gini}(t_L) - \frac{n_{t_R}}{n_t}\text{Gini}(t_R)\right]

Gini impurity at node tt:

Gini(t)=1k=1Kpk2\text{Gini}(t) = 1 - \sum_{k=1}^{K} p_k^2

Where pkp_k is the fraction of class kk samples at node tt.


Complete Python Implementation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import (train_test_split, cross_val_score,
                                     GridSearchCV, RandomizedSearchCV)
from sklearn.metrics import (accuracy_score, classification_report,
                             confusion_matrix, roc_curve, auc,
                             mean_squared_error, r2_score)
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance
import warnings
warnings.filterwarnings('ignore')

# ── 1. Classification ─────────────────────────────────────────────────
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target  # 0=malignant, 1=benign

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train Random Forest
rf = RandomForestClassifier(
    n_estimators=200,       # number of trees
    max_features='sqrt',    # √p features per split (default for classification)
    max_depth=None,         # grow full trees
    min_samples_split=2,
    min_samples_leaf=1,
    bootstrap=True,
    oob_score=True,         # enable OOB evaluation
    n_jobs=-1,              # use all CPU cores
    random_state=42,
)
rf.fit(X_train, y_train)

# OOB score (no need for separate validation set!)
print(f"OOB Score       : {rf.oob_score_:.4f}")
print(f"Test Accuracy   : {accuracy_score(y_test, rf.predict(X_test)):.4f}")
print(f"\nClassification Report:\n{classification_report(y_test, rf.predict(X_test), target_names=['Malignant','Benign'])}")

# ── 2. Feature Importance Analysis ────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Gini importance (built-in)
gini_imp = pd.Series(rf.feature_importances_, index=X.columns)
gini_imp.nlargest(15).sort_values().plot(kind='barh', ax=axes[0],
                                          color='steelblue', edgecolor='white')
axes[0].set_title('Feature Importance (Gini / MDI)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Mean Decrease in Impurity')

# Permutation importance (more reliable)
perm_imp = permutation_importance(rf, X_test, y_test,
                                   n_repeats=10, random_state=42, n_jobs=-1)
perm_df = pd.DataFrame({
    'importance': perm_imp.importances_mean,
    'std':        perm_imp.importances_std,
}, index=X.columns).nlargest(15, 'importance').sort_values('importance')

perm_df['importance'].plot(kind='barh', ax=axes[1],
                           xerr=perm_df['std'],
                           color='#10b981', edgecolor='white', capsize=3)
axes[1].set_title('Feature Importance (Permutation)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Mean Decrease in Accuracy')

plt.tight_layout()
plt.show()

# ── 3. ROC Curve ──────────────────────────────────────────────────────
y_proba = rf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color='#3b82f6', linewidth=2.5,
         label=f'Random Forest (AUC = {roc_auc:.4f})')
plt.plot([0,1],[0,1], color='gray', linestyle='--', linewidth=1)
plt.fill_between(fpr, tpr, alpha=0.1, color='#3b82f6')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve — Random Forest')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ── 4. Effect of n_estimators on OOB Error ───────────────────────────
n_trees = range(1, 201, 5)
oob_errors = []

for n in n_trees:
    rf_tmp = RandomForestClassifier(n_estimators=n, oob_score=True,
                                    n_jobs=-1, random_state=42)
    rf_tmp.fit(X_train, y_train)
    oob_errors.append(1 - rf_tmp.oob_score_)

plt.figure(figsize=(9, 4))
plt.plot(list(n_trees), oob_errors, color='#ef4444', linewidth=2)
plt.axvline(x=100, color='gray', linestyle='--', alpha=0.7, label='n=100')
plt.xlabel('Number of Trees')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# ── 5. Hyperparameter Tuning (Randomized Search) ─────────────────────
param_dist = {
    'n_estimators':     [50, 100, 200, 300],
    'max_features':     ['sqrt', 'log2', 0.3, 0.5],
    'max_depth':        [None, 5, 10, 20],
    'min_samples_split':[2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

rscv = RandomizedSearchCV(
    RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42),
    param_distributions=param_dist,
    n_iter=20,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    random_state=42,
    verbose=0,
)
rscv.fit(X_train, y_train)

print("Best Parameters:", rscv.best_params_)
print(f"Best CV F1: {rscv.best_score_:.4f}")
print(f"Test F1:    {rscv.score(X_test, y_test):.4f}")

Sample Output:

OOB Score       : 0.9692
Test Accuracy   : 0.9737

Best Parameters: {'n_estimators': 300, 'min_samples_split': 2,
                  'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None}
Best CV F1: 0.9771
Test F1:    0.9806

Regression with Random Forest

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
X_r = housing.data
y_r = housing.target

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
    X_r, y_r, test_size=0.2, random_state=42
)

rfr = RandomForestRegressor(
    n_estimators=200,
    max_features=0.33,   # p/3 is default for regression
    oob_score=True,
    n_jobs=-1,
    random_state=42,
)
rfr.fit(X_train_r, y_train_r)
y_pred_r = rfr.predict(X_test_r)

rmse = np.sqrt(mean_squared_error(y_test_r, y_pred_r))
r2   = r2_score(y_test_r, y_pred_r)
print(f"RMSE : {rmse:.4f}")
print(f"R²   : {r2:.4f}")
print(f"OOB  : {rfr.oob_score_:.4f}")

# Actual vs Predicted
plt.figure(figsize=(7, 6))
plt.scatter(y_test_r, y_pred_r, alpha=0.3, color='steelblue', s=10)
plt.plot([y_test_r.min(), y_test_r.max()],
         [y_test_r.min(), y_test_r.max()],
         'r--', linewidth=2, label='Perfect prediction')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title(f'Actual vs Predicted  (R² = {r2:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Hyperparameter Cheat Sheet

ParameterDefaultEffectTune If
n_estimators100More trees → lower varianceAlways — try 100–500
max_featuressqrtFewer features → more diversityHigh-dim data
max_depthNoneDeeper → lower bias, higher varianceOverfitting
min_samples_leaf1Higher → smoother predictionsNoisy data
max_samples1.0Bootstrap sample fractionLarge datasets

When to Use Random Forest

✅ Use RF When❌ Consider Alternatives When
Tabular data, mixed typesSequential/text/image data
Need feature importanceNeed full interpretability
Missing values (can handle)Ultra-fast inference needed
Medium-sized datasetsMillions of features
Baseline before XGBoostXGBoost already outperforms

Key Takeaways

  1. Two randomness sources — bootstrap + feature subsets — create diverse, decorrelated trees
  2. OOB score is a free cross-validation estimate — use it
  3. Permutation importance is more reliable than Gini importance for correlated features
  4. n_estimators — more is almost always better; diminishing returns after ~200 trees
  5. Random Forest is an excellent default algorithm before trying gradient boosting

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement