What Is Random Forest?
Random Forest is an ensemble method that builds hundreds of decision trees on random subsets of data and features, then aggregates their predictions. It is one of the most powerful and robust algorithms for tabular data.
Wisdom of crowds: One tree makes mistakes. 500 trees voting together are much harder to fool.
How It Works — Step by Step
TRAINING PHASE:
For b = 1 to B trees:
1. Bootstrap sample: draw n samples WITH replacement from training set
(each bootstrap ~63% unique samples — remaining 37% = OOB samples)
2. At each node split:
- Randomly select m features from all p features
- Find best split among those m features only ← KEY: feature randomness
3. Grow tree to maximum depth (no pruning)
Result: B decorrelated trees {T₁, T₂, ..., T_B}
PREDICTION PHASE:
Classification: majority vote across all B trees
Regression: average prediction across all B trees
Why two sources of randomness?
| Randomness Source | What It Does | Effect |
|---|---|---|
| Bootstrap sampling | Each tree sees different data | Reduces variance |
| Feature subsampling | Each split considers random features | Decorrelates trees |
Both together = strong, diverse ensemble that doesn't overfit.
Key Formula: Out-of-Bag (OOB) Error
Each bootstrap sample leaves out ~37% of the data. These Out-of-Bag samples provide a free validation set.
Where is the prediction for observation using only trees that did not include in their bootstrap sample.
Feature Importance
Gini impurity at node :
Where is the fraction of class samples at node .
Complete Python Implementation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import (train_test_split, cross_val_score,
GridSearchCV, RandomizedSearchCV)
from sklearn.metrics import (accuracy_score, classification_report,
confusion_matrix, roc_curve, auc,
mean_squared_error, r2_score)
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance
import warnings
warnings.filterwarnings('ignore')
# ── 1. Classification ─────────────────────────────────────────────────
cancer = load_breast_cancer()
X = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y = cancer.target # 0=malignant, 1=benign
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train Random Forest
rf = RandomForestClassifier(
n_estimators=200, # number of trees
max_features='sqrt', # √p features per split (default for classification)
max_depth=None, # grow full trees
min_samples_split=2,
min_samples_leaf=1,
bootstrap=True,
oob_score=True, # enable OOB evaluation
n_jobs=-1, # use all CPU cores
random_state=42,
)
rf.fit(X_train, y_train)
# OOB score (no need for separate validation set!)
print(f"OOB Score : {rf.oob_score_:.4f}")
print(f"Test Accuracy : {accuracy_score(y_test, rf.predict(X_test)):.4f}")
print(f"\nClassification Report:\n{classification_report(y_test, rf.predict(X_test), target_names=['Malignant','Benign'])}")
# ── 2. Feature Importance Analysis ────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Gini importance (built-in)
gini_imp = pd.Series(rf.feature_importances_, index=X.columns)
gini_imp.nlargest(15).sort_values().plot(kind='barh', ax=axes[0],
color='steelblue', edgecolor='white')
axes[0].set_title('Feature Importance (Gini / MDI)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Mean Decrease in Impurity')
# Permutation importance (more reliable)
perm_imp = permutation_importance(rf, X_test, y_test,
n_repeats=10, random_state=42, n_jobs=-1)
perm_df = pd.DataFrame({
'importance': perm_imp.importances_mean,
'std': perm_imp.importances_std,
}, index=X.columns).nlargest(15, 'importance').sort_values('importance')
perm_df['importance'].plot(kind='barh', ax=axes[1],
xerr=perm_df['std'],
color='#10b981', edgecolor='white', capsize=3)
axes[1].set_title('Feature Importance (Permutation)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Mean Decrease in Accuracy')
plt.tight_layout()
plt.show()
# ── 3. ROC Curve ──────────────────────────────────────────────────────
y_proba = rf.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(7, 6))
plt.plot(fpr, tpr, color='#3b82f6', linewidth=2.5,
label=f'Random Forest (AUC = {roc_auc:.4f})')
plt.plot([0,1],[0,1], color='gray', linestyle='--', linewidth=1)
plt.fill_between(fpr, tpr, alpha=0.1, color='#3b82f6')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve — Random Forest')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# ── 4. Effect of n_estimators on OOB Error ───────────────────────────
n_trees = range(1, 201, 5)
oob_errors = []
for n in n_trees:
rf_tmp = RandomForestClassifier(n_estimators=n, oob_score=True,
n_jobs=-1, random_state=42)
rf_tmp.fit(X_train, y_train)
oob_errors.append(1 - rf_tmp.oob_score_)
plt.figure(figsize=(9, 4))
plt.plot(list(n_trees), oob_errors, color='#ef4444', linewidth=2)
plt.axvline(x=100, color='gray', linestyle='--', alpha=0.7, label='n=100')
plt.xlabel('Number of Trees')
plt.ylabel('OOB Error Rate')
plt.title('OOB Error vs Number of Trees')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# ── 5. Hyperparameter Tuning (Randomized Search) ─────────────────────
param_dist = {
'n_estimators': [50, 100, 200, 300],
'max_features': ['sqrt', 'log2', 0.3, 0.5],
'max_depth': [None, 5, 10, 20],
'min_samples_split':[2, 5, 10],
'min_samples_leaf': [1, 2, 4],
}
rscv = RandomizedSearchCV(
RandomForestClassifier(oob_score=True, n_jobs=-1, random_state=42),
param_distributions=param_dist,
n_iter=20,
cv=5,
scoring='f1',
n_jobs=-1,
random_state=42,
verbose=0,
)
rscv.fit(X_train, y_train)
print("Best Parameters:", rscv.best_params_)
print(f"Best CV F1: {rscv.best_score_:.4f}")
print(f"Test F1: {rscv.score(X_test, y_test):.4f}")
Sample Output:
OOB Score : 0.9692
Test Accuracy : 0.9737
Best Parameters: {'n_estimators': 300, 'min_samples_split': 2,
'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None}
Best CV F1: 0.9771
Test F1: 0.9806
Regression with Random Forest
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
X_r = housing.data
y_r = housing.target
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
X_r, y_r, test_size=0.2, random_state=42
)
rfr = RandomForestRegressor(
n_estimators=200,
max_features=0.33, # p/3 is default for regression
oob_score=True,
n_jobs=-1,
random_state=42,
)
rfr.fit(X_train_r, y_train_r)
y_pred_r = rfr.predict(X_test_r)
rmse = np.sqrt(mean_squared_error(y_test_r, y_pred_r))
r2 = r2_score(y_test_r, y_pred_r)
print(f"RMSE : {rmse:.4f}")
print(f"R² : {r2:.4f}")
print(f"OOB : {rfr.oob_score_:.4f}")
# Actual vs Predicted
plt.figure(figsize=(7, 6))
plt.scatter(y_test_r, y_pred_r, alpha=0.3, color='steelblue', s=10)
plt.plot([y_test_r.min(), y_test_r.max()],
[y_test_r.min(), y_test_r.max()],
'r--', linewidth=2, label='Perfect prediction')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title(f'Actual vs Predicted (R² = {r2:.3f})')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Hyperparameter Cheat Sheet
| Parameter | Default | Effect | Tune If |
|---|---|---|---|
n_estimators | 100 | More trees → lower variance | Always — try 100–500 |
max_features | sqrt | Fewer features → more diversity | High-dim data |
max_depth | None | Deeper → lower bias, higher variance | Overfitting |
min_samples_leaf | 1 | Higher → smoother predictions | Noisy data |
max_samples | 1.0 | Bootstrap sample fraction | Large datasets |
When to Use Random Forest
| ✅ Use RF When | ❌ Consider Alternatives When |
|---|---|
| Tabular data, mixed types | Sequential/text/image data |
| Need feature importance | Need full interpretability |
| Missing values (can handle) | Ultra-fast inference needed |
| Medium-sized datasets | Millions of features |
| Baseline before XGBoost | XGBoost already outperforms |
Key Takeaways
- Two randomness sources — bootstrap + feature subsets — create diverse, decorrelated trees
- OOB score is a free cross-validation estimate — use it
- Permutation importance is more reliable than Gini importance for correlated features
- n_estimators — more is almost always better; diminishing returns after ~200 trees
- Random Forest is an excellent default algorithm before trying gradient boosting