Random Forest + Feature Importance

Overview

Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions. It addresses the overfitting problem of single decision trees while maintaining high accuracy.

Ensemble Learning

The Wisdom of Crowds

Individual trees may make errors, but combining many trees reduces variance without increasing bias:

\text{Var}(\bar{X}) = \frac{\sigma^2}{n}

ℹ️ Ensemble Variance Reduction

When combining $n$ independent models each with variance $\sigma^2$ , the ensemble variance is $\sigma^2/n$ . This is the mathematical foundation for why bagging reduces overfitting. In practice, trees are not fully independent (they share training data), so the actual variance reduction is less dramatic but still substantial.

Architecture Diagram

Single Decision Tree          Random Forest (100 trees)
+---+---+---+---+           +---+---+---+---+
| A | A | B | B |           | A | A | A | B |
+---+---+---+---+    -->    +---+---+---+---+
| A | B | B | B |           | A | B | B | B |
+---+---+---+---+           +---+---+---+---+

Accuracy: 85%               Accuracy: 92%
Variance: High              Variance: Low
Overfitting risk            Robust predictions

Bagging (Bootstrap Aggregating)

Algorithm

Create $B$ bootstrap samples from training data
Train a decision tree on each bootstrap sample
Aggregate predictions (majority vote or averaging)

DfBootstrap Sampling

Given dataset $D$ of size $n$ , create sample $D_b$ by sampling $n$ times with replacement:

D_b = \{x_{i_1}, x_{i_2}, \dots, x_{i_n}\}, \quad i_j \sim \text{Uniform}(1, n)

Properties:

Each bootstrap sample contains ~63.2% unique samples
~36.8% samples are out-of-bag (OOB) for each tree
Different trees see different data → diversity

Architecture Diagram

Original Dataset D: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Bootstrap Sample 1: [2, 3, 3, 5, 7, 7, 8, 9, 10, 10]
Bootstrap Sample 2: [1, 1, 2, 4, 4, 6, 7, 8, 9, 9]
Bootstrap Sample 3: [1, 3, 5, 5, 6, 7, 8, 8, 10, 10]

Each tree trained on different subset!

Mathematical Formulation

For $B$ trees, the bagging prediction:

Bagging Prediction

\hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum_{b=1}^{B} T_b(x)

Here,

$B$ =Number of trees
$T_b(x)$ =Prediction from tree b

Classification: Majority vote

\hat{y} = \arg\max_k \sum_{b=1}^{B} \mathbf{1}(T_b(x) = k)

Regression: Average

\hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x)

Random Forest Algorithm

Key Innovation: Random Feature Selection

At each split, only consider a random subset of features:

Features Per Split

m = \begin{cases} \lfloor\sqrt{p}\rfloor & \text{Classification} \\ \lfloor p/3 \rfloor & \text{Regression} \end{cases}

Here,

$p$ =Total number of features
$m$ =Features to consider at each split

Why Random Features?

Architecture Diagram

Without Feature Randomness:        With Feature Randomness:
+----+----+----+----+             +----+----+----+----+
| T1 | T2 | T3 | T4 |             | T1 | T2 | T3 | T4 |
+----+----+----+----+             +----+----+----+----+
| X1 | X1 | X1 | X1 |             | X1 | X3 | X2 | X4 |
| X2 | X2 | X2 | X2 |             | X3 | X1 | X4 | X2 |
+----+----+----+----+             +----+----+----+----+

All trees use same strong feature     Trees use different features
Trees are correlated                  Trees are diverse
Ensemble has less benefit             Ensemble has more benefit

Complete Algorithm

Architecture Diagram

Input: Training data D = {(x₁,y₁), ..., (xₙ,yₙ)}
Parameters: B = number of trees, m = features per split

For b = 1 to B:
    1. Create bootstrap sample D_b from D
    2. Grow tree T_b on D_b:
       - At each node:
         * Select m random features
         * Find best split among those features
         * Split node into two children
       - Stop when: min_samples_leaf reached or impurity = 0
    3. Do NOT prune the tree

Output: Forest {T₁, T₂, ..., T_B}
Prediction: Majority vote (classification) or average (regression)

Python Implementation

Basic Random Forest

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, r2_score
from sklearn.datasets import make_classification, make_regression

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, n_classes=2, random_state=42
)

feature_names = [f'Feature_{i}' for i in range(X.shape[1])]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=100, max_depth=None, min_samples_split=2,
    min_samples_leaf=1, max_features='sqrt', bootstrap=True,
    oob_score=True, random_state=42, n_jobs=-1
)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print("Random Forest Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")

Out-of-Bag (OOB) Error

Concept

Each tree uses ~63.2% of data; the remaining ~36.8% (OOB) can be used for validation:

Architecture Diagram

Original Data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Tree 1: Training on [2, 3, 5, 7, 8, 10]  OOB: [1, 4, 6, 9]
Tree 2: Training on [1, 2, 4, 6, 7, 9]  OOB: [3, 5, 8, 10]
Tree 3: Training on [1, 3, 5, 7, 8, 10] OOB: [2, 4, 6, 9]

For each sample, only use trees that didn't see it in training

OOB Score Calculation

For sample $i$ , predict using only trees where $i$ was OOB:

OOB Score

\text{OOB Score} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{1}(\hat{y}_i^{\text{OOB}} = y_i)

Here,

$\hat{y}_i^{\text{OOB}}$ =OOB prediction for sample i
$y_i$ =True label for sample i

OOB vs Cross-Validation

rf_with_oob = RandomForestClassifier(
    n_estimators=200, oob_score=True, random_state=42
)
rf_with_oob.fit(X_train, y_train)

print(f"OOB Score: {rf_with_oob.oob_score_:.4f}")

cv_scores = cross_val_score(
    RandomForestClassifier(n_estimators=200, random_state=42),
    X_train, y_train, cv=5, scoring='accuracy'
)
print(f"CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Feature Importance

Mean Decrease in Impurity (MDI)

MDI Feature Importance

\text{Importance}(j) = \frac{1}{B}\sum_{b=1}^{B} \sum_{t \in \text{split on } j \text{ in } T_b} p(t) \cdot \Delta I(t)

Here,

$B$ =Number of trees
$p(t)$ =Proportion of samples at node t
$\Delta I(t)$ =Impurity decrease at node t

importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

print("Feature Importance Ranking:")
print("-" * 40)
for i, idx in enumerate(indices[:10]):
    print(f"{i+1:2d}. {feature_names[idx]:15s} {importances[idx]:.4f} ± {std[idx]:.4f}")

Permutation Importance (More Reliable)

Measures importance by shuffling each feature and measuring accuracy decrease:

Permutation Importance

\text{PermImportance}(j) = \text{Score}(X) - \text{Score}(X_{\pi_j})

Here,

$X_{\pi_j}$ =X with feature j randomly permuted

from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(
    rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
)

perm_indices = perm_importance.importances_mean.argsort()[::-1]

print("\nPermutation Importance:")
print("-" * 40)
for i, idx in enumerate(perm_indices[:10]):
    print(f"{i+1:2d}. {feature_names[idx]:15s} {perm_importance.importances_mean[idx]:.4f} "
          f"± {perm_importance.importances_std[idx]:.4f}")

MDI vs Permutation Importance

Aspect	MDI (Gini)	Permutation
Speed	Fast (during training)	Slow (requires re-evaluation)
Bias	Biased toward high-cardinality features	Less biased
Correlated features	Can be misleading	More reliable
Interpretability	Less intuitive	More intuitive

Hyperparameter Tuning

Key Parameters

Architecture Diagram

Random Forest Hyperparameters
├── n_estimators (Number of trees)
│   └── More trees → better performance but slower
│       Sweet spot: 100-500
│
├── max_features (Features per split)
│   ├── Classification: sqrt(p) or log2(p)
│   └── Regression: p/3
│
├── max_depth (Tree depth)
│   ├── None: fully grown trees (default)
│   └── Limit: reduces overfitting
│
├── min_samples_leaf (Leaf size)
│   ├── 1: fully grown (default)
│   └── Increase: smoother boundaries
│
└── bootstrap (Sampling method)
    ├── True: standard bagging (default)
    └── False: use all data for each tree

Grid Search Example

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_grid,
    n_iter=50, cv=5, scoring='accuracy',
    random_state=42, n_jobs=-1
)
random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV accuracy: {random_search.best_score_:.4f}")

Complete Real-World Example

📝Credit Card Fraud Detection with Random Forest

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    average_precision_score
)
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 10000

amount = np.random.exponential(100, n)
time_since_last = np.random.exponential(2, n)
distance_from_home = np.random.exponential(20, n)
num_transactions_24h = np.random.poisson(5, n)
merchant_risk = np.random.beta(2, 5, n)

fraud_score = (
    0.3 * (amount > 500).astype(int) +
    0.2 * (time_since_last < 0.5).astype(int) +
    0.25 * (distance_from_home > 50).astype(int) +
    0.15 * (num_transactions_24h > 10).astype(int) +
    0.1 * (merchant_risk > 0.6).astype(int) +
    np.random.normal(0, 0.1, n)
)
is_fraud = (fraud_score > 0.6).astype(int)

df = pd.DataFrame({
    'amount': amount, 'time_since_last': time_since_last,
    'distance_from_home': distance_from_home,
    'num_transactions_24h': num_transactions_24h,
    'merchant_risk': merchant_risk, 'is_fraud': is_fraud
})

X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

rf_fraud = RandomForestClassifier(
    n_estimators=200, max_depth=10, min_samples_leaf=5,
    class_weight='balanced', random_state=42, n_jobs=-1, oob_score=True
)
rf_fraud.fit(X_train_scaled, y_train)

y_pred = rf_fraud.predict(X_test_scaled)
y_prob = rf_fraud.predict_proba(X_test_scaled)[:, 1]

print("=== Fraud Detection Model ===")
print(f"OOB Score: {rf_fraud.oob_score_:.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))

Comparison with Other Algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
import time

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}

print("Algorithm Comparison:")
print("-" * 65)
print(f"{'Algorithm':25s} {'CV Accuracy':15s} {'Training Time':15s}")
print("-" * 65)

for name, model in models.items():
    start = time.time()
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    train_time = time.time() - start
    print(f"{name:25s} {scores.mean():.4f} ± {scores.std():.4f} {train_time:.3f}s")

When to Use Random Forest

Advantages

Advantage	Description
Robust	Less prone to overfitting than single trees
No scaling needed	Tree-based, invariant to feature scale
Handles missing values	Some implementations support NaN
Feature importance	Built-in feature ranking
Parallel training	Each tree is independent
Out-of-bag score	Free validation estimate

Disadvantages

Disadvantage	Description
Less interpretable	Hard to explain 100 trees vs 1
Memory intensive	Stores many trees
Slower prediction	Must query all trees
Can overfit	With very noisy data
Biased toward high-cardinality features	MDI importance biased

Key Takeaways

📋Summary: Random Forest

Bagging + Random Features = Random Forest; reduces variance while maintaining low bias
Bootstrap sampling means each tree sees ~63.2% of data; remaining ~36.8% is OOB
OOB score provides free validation without separate validation set
Feature importance via MDI (fast) or permutation (more reliable)
n_estimators: More trees is almost always better (diminishing returns after ~200)
max_features: Use $\sqrt{p}$ for classification, $p/3$ for regression
class_weight='balanced' handles imbalanced datasets
Parallelizable: Trees are independent, can use all CPU cores

Practice Exercises

Exercise 1: OOB Validation

from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

# Train with OOB score
# Train with 5-fold CV
# Are they similar? Why or why not?

Exercise 2: Feature Importance Stability

# Train 10 Random Forests with different random seeds
# Collect feature importances from each
# Calculate mean and std of each feature's importance
# Plot the results
# Which features are consistently important?

Exercise 3: Hyperparameter Sensitivity

n_estimators = [10, 50, 100, 200, 500]
max_features = ['sqrt', 'log2', 0.1, 0.3, 0.5]

# Plot accuracy heatmap
# Which combination works best?

Exercise 4: Compare Ensemble Methods

from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier,
    AdaBoostClassifier, BaggingClassifier
)

# Compare:
# a) Random Forest
# b) Gradient Boosting
# c) AdaBoost
# d) BaggingClassifier with DecisionTree base
# Which performs best? Why?

Discussion Questions

Why does Random Forest reduce overfitting compared to single decision trees?
How would you explain Random Forest to a business stakeholder?
When might Gradient Boosting be preferred over Random Forest?

Random Forest + Feature Importance

Random Forest + Feature Importance

Overview

Ensemble Learning

The Wisdom of Crowds

Bagging (Bootstrap Aggregating)

Algorithm

DfBootstrap Sampling

Mathematical Formulation

Bagging Prediction

Random Forest Algorithm

Key Innovation: Random Feature Selection

Features Per Split

Why Random Features?

Complete Algorithm

Python Implementation

Basic Random Forest

Out-of-Bag (OOB) Error

Concept

OOB Score Calculation

OOB Score

OOB vs Cross-Validation

Feature Importance

Mean Decrease in Impurity (MDI)

MDI Feature Importance

Permutation Importance (More Reliable)

Permutation Importance

MDI vs Permutation Importance

Hyperparameter Tuning

Key Parameters

Grid Search Example

Complete Real-World Example

📝Credit Card Fraud Detection with Random Forest

Comparison with Other Algorithms

When to Use Random Forest

Advantages

Disadvantages

Key Takeaways

📋Summary: Random Forest

Practice Exercises

Exercise 1: OOB Validation

Exercise 2: Feature Importance Stability

Exercise 3: Hyperparameter Sensitivity

Exercise 4: Compare Ensemble Methods

Discussion Questions

Need Expert Data Science Help?