Random Forest + Feature Importance

Module 2: Machine LearningFree Lesson

Advertisement

Random Forest + Feature Importance

Overview

Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions. It addresses the overfitting problem of single decision trees while maintaining high accuracy.


Ensemble Learning

The Wisdom of Crowds

Individual trees may make errors, but combining many trees reduces variance without increasing bias:

Var(Xˉ)=σ2n\text{Var}(\bar{X}) = \frac{\sigma^2}{n}

ℹ️ Ensemble Variance Reduction

When combining nn independent models each with variance σ2\sigma^2, the ensemble variance is σ2/n\sigma^2/n. This is the mathematical foundation for why bagging reduces overfitting. In practice, trees are not fully independent (they share training data), so the actual variance reduction is less dramatic but still substantial.

Architecture Diagram
Single Decision Tree          Random Forest (100 trees)
+---+---+---+---+           +---+---+---+---+
| A | A | B | B |           | A | A | A | B |
+---+---+---+---+    -->    +---+---+---+---+
| A | B | B | B |           | A | B | B | B |
+---+---+---+---+           +---+---+---+---+

Accuracy: 85%               Accuracy: 92%
Variance: High              Variance: Low
Overfitting risk            Robust predictions

Bagging (Bootstrap Aggregating)

Algorithm

  1. Create BB bootstrap samples from training data
  2. Train a decision tree on each bootstrap sample
  3. Aggregate predictions (majority vote or averaging)

DfBootstrap Sampling

Given dataset DD of size nn, create sample DbD_b by sampling nn times with replacement:

Db={xi1,xi2,,xin},ijUniform(1,n)D_b = \{x_{i_1}, x_{i_2}, \dots, x_{i_n}\}, \quad i_j \sim \text{Uniform}(1, n)

Properties:

  • Each bootstrap sample contains ~63.2% unique samples
  • ~36.8% samples are out-of-bag (OOB) for each tree
  • Different trees see different data → diversity
Architecture Diagram
Original Dataset D: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Bootstrap Sample 1: [2, 3, 3, 5, 7, 7, 8, 9, 10, 10]
Bootstrap Sample 2: [1, 1, 2, 4, 4, 6, 7, 8, 9, 9]
Bootstrap Sample 3: [1, 3, 5, 5, 6, 7, 8, 8, 10, 10]

Each tree trained on different subset!

Mathematical Formulation

For BB trees, the bagging prediction:

Bagging Prediction

f^bag(x)=1Bb=1BTb(x)\hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum_{b=1}^{B} T_b(x)

Here,

  • BB=Number of trees
  • Tb(x)T_b(x)=Prediction from tree b

Classification: Majority vote

y^=argmaxkb=1B1(Tb(x)=k)\hat{y} = \arg\max_k \sum_{b=1}^{B} \mathbf{1}(T_b(x) = k)

Regression: Average

y^=1Bb=1BTb(x)\hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x)

Random Forest Algorithm

Key Innovation: Random Feature Selection

At each split, only consider a random subset of features:

Features Per Split

m={pClassificationp/3Regressionm = \begin{cases} \lfloor\sqrt{p}\rfloor & \text{Classification} \\ \lfloor p/3 \rfloor & \text{Regression} \end{cases}

Here,

  • pp=Total number of features
  • mm=Features to consider at each split

Why Random Features?

Architecture Diagram
Without Feature Randomness:        With Feature Randomness:
+----+----+----+----+             +----+----+----+----+
| T1 | T2 | T3 | T4 |             | T1 | T2 | T3 | T4 |
+----+----+----+----+             +----+----+----+----+
| X1 | X1 | X1 | X1 |             | X1 | X3 | X2 | X4 |
| X2 | X2 | X2 | X2 |             | X3 | X1 | X4 | X2 |
+----+----+----+----+             +----+----+----+----+

All trees use same strong feature     Trees use different features
Trees are correlated                  Trees are diverse
Ensemble has less benefit             Ensemble has more benefit

Complete Algorithm

Architecture Diagram
Input: Training data D = {(x₁,y₁), ..., (xₙ,yₙ)}
Parameters: B = number of trees, m = features per split

For b = 1 to B:
    1. Create bootstrap sample D_b from D
    2. Grow tree T_b on D_b:
       - At each node:
         * Select m random features
         * Find best split among those features
         * Split node into two children
       - Stop when: min_samples_leaf reached or impurity = 0
    3. Do NOT prune the tree

Output: Forest {T₁, T₂, ..., T_B}
Prediction: Majority vote (classification) or average (regression)

Python Implementation

Basic Random Forest

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, r2_score
from sklearn.datasets import make_classification, make_regression

X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=10,
    n_redundant=5, n_classes=2, random_state=42
)

feature_names = [f'Feature_{i}' for i in range(X.shape[1])]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=100, max_depth=None, min_samples_split=2,
    min_samples_leaf=1, max_features='sqrt', bootstrap=True,
    oob_score=True, random_state=42, n_jobs=-1
)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
print("Random Forest Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")

Out-of-Bag (OOB) Error

Concept

Each tree uses ~63.2% of data; the remaining ~36.8% (OOB) can be used for validation:

Architecture Diagram
Original Data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Tree 1: Training on [2, 3, 5, 7, 8, 10]  OOB: [1, 4, 6, 9]
Tree 2: Training on [1, 2, 4, 6, 7, 9]  OOB: [3, 5, 8, 10]
Tree 3: Training on [1, 3, 5, 7, 8, 10] OOB: [2, 4, 6, 9]

For each sample, only use trees that didn't see it in training

OOB Score Calculation

For sample ii, predict using only trees where ii was OOB:

OOB Score

OOB Score=1ni=1n1(y^iOOB=yi)\text{OOB Score} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{1}(\hat{y}_i^{\text{OOB}} = y_i)

Here,

  • y^iOOB\hat{y}_i^{\text{OOB}}=OOB prediction for sample i
  • yiy_i=True label for sample i

OOB vs Cross-Validation

rf_with_oob = RandomForestClassifier(
    n_estimators=200, oob_score=True, random_state=42
)
rf_with_oob.fit(X_train, y_train)

print(f"OOB Score: {rf_with_oob.oob_score_:.4f}")

cv_scores = cross_val_score(
    RandomForestClassifier(n_estimators=200, random_state=42),
    X_train, y_train, cv=5, scoring='accuracy'
)
print(f"CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

Feature Importance

Mean Decrease in Impurity (MDI)

MDI Feature Importance

Importance(j)=1Bb=1Btsplit on j in Tbp(t)ΔI(t)\text{Importance}(j) = \frac{1}{B}\sum_{b=1}^{B} \sum_{t \in \text{split on } j \text{ in } T_b} p(t) \cdot \Delta I(t)

Here,

  • BB=Number of trees
  • p(t)p(t)=Proportion of samples at node t
  • ΔI(t)\Delta I(t)=Impurity decrease at node t
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

print("Feature Importance Ranking:")
print("-" * 40)
for i, idx in enumerate(indices[:10]):
    print(f"{i+1:2d}. {feature_names[idx]:15s} {importances[idx]:.4f} ± {std[idx]:.4f}")

Permutation Importance (More Reliable)

Measures importance by shuffling each feature and measuring accuracy decrease:

Permutation Importance

PermImportance(j)=Score(X)Score(Xπj)\text{PermImportance}(j) = \text{Score}(X) - \text{Score}(X_{\pi_j})

Here,

  • XπjX_{\pi_j}=X with feature j randomly permuted
from sklearn.inspection import permutation_importance

perm_importance = permutation_importance(
    rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
)

perm_indices = perm_importance.importances_mean.argsort()[::-1]

print("\nPermutation Importance:")
print("-" * 40)
for i, idx in enumerate(perm_indices[:10]):
    print(f"{i+1:2d}. {feature_names[idx]:15s} {perm_importance.importances_mean[idx]:.4f} "
          f"± {perm_importance.importances_std[idx]:.4f}")

MDI vs Permutation Importance

AspectMDI (Gini)Permutation
SpeedFast (during training)Slow (requires re-evaluation)
BiasBiased toward high-cardinality featuresLess biased
Correlated featuresCan be misleadingMore reliable
InterpretabilityLess intuitiveMore intuitive

Hyperparameter Tuning

Key Parameters

Architecture Diagram
Random Forest Hyperparameters
├── n_estimators (Number of trees)
│   └── More trees → better performance but slower
│       Sweet spot: 100-500
│
├── max_features (Features per split)
│   ├── Classification: sqrt(p) or log2(p)
│   └── Regression: p/3
│
├── max_depth (Tree depth)
│   ├── None: fully grown trees (default)
│   └── Limit: reduces overfitting
│
├── min_samples_leaf (Leaf size)
│   ├── 1: fully grown (default)
│   └── Increase: smoother boundaries
│
└── bootstrap (Sampling method)
    ├── True: standard bagging (default)
    └── False: use all data for each tree

Grid Search Example

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions=param_grid,
    n_iter=50, cv=5, scoring='accuracy',
    random_state=42, n_jobs=-1
)
random_search.fit(X_train, y_train)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV accuracy: {random_search.best_score_:.4f}")

Complete Real-World Example

📝Credit Card Fraud Detection with Random Forest

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    average_precision_score
)
from sklearn.preprocessing import StandardScaler

np.random.seed(42)
n = 10000

amount = np.random.exponential(100, n)
time_since_last = np.random.exponential(2, n)
distance_from_home = np.random.exponential(20, n)
num_transactions_24h = np.random.poisson(5, n)
merchant_risk = np.random.beta(2, 5, n)

fraud_score = (
    0.3 * (amount > 500).astype(int) +
    0.2 * (time_since_last < 0.5).astype(int) +
    0.25 * (distance_from_home > 50).astype(int) +
    0.15 * (num_transactions_24h > 10).astype(int) +
    0.1 * (merchant_risk > 0.6).astype(int) +
    np.random.normal(0, 0.1, n)
)
is_fraud = (fraud_score > 0.6).astype(int)

df = pd.DataFrame({
    'amount': amount, 'time_since_last': time_since_last,
    'distance_from_home': distance_from_home,
    'num_transactions_24h': num_transactions_24h,
    'merchant_risk': merchant_risk, 'is_fraud': is_fraud
})

X = df.drop('is_fraud', axis=1)
y = df['is_fraud']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

rf_fraud = RandomForestClassifier(
    n_estimators=200, max_depth=10, min_samples_leaf=5,
    class_weight='balanced', random_state=42, n_jobs=-1, oob_score=True
)
rf_fraud.fit(X_train_scaled, y_train)

y_pred = rf_fraud.predict(X_test_scaled)
y_prob = rf_fraud.predict_proba(X_test_scaled)[:, 1]

print("=== Fraud Detection Model ===")
print(f"OOB Score: {rf_fraud.oob_score_:.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))

Comparison with Other Algorithms

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
import time

models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}

print("Algorithm Comparison:")
print("-" * 65)
print(f"{'Algorithm':25s} {'CV Accuracy':15s} {'Training Time':15s}")
print("-" * 65)

for name, model in models.items():
    start = time.time()
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
    train_time = time.time() - start
    print(f"{name:25s} {scores.mean():.4f} ± {scores.std():.4f} {train_time:.3f}s")

When to Use Random Forest

Advantages

AdvantageDescription
RobustLess prone to overfitting than single trees
No scaling neededTree-based, invariant to feature scale
Handles missing valuesSome implementations support NaN
Feature importanceBuilt-in feature ranking
Parallel trainingEach tree is independent
Out-of-bag scoreFree validation estimate

Disadvantages

DisadvantageDescription
Less interpretableHard to explain 100 trees vs 1
Memory intensiveStores many trees
Slower predictionMust query all trees
Can overfitWith very noisy data
Biased toward high-cardinality featuresMDI importance biased

Key Takeaways

📋Summary: Random Forest

  1. Bagging + Random Features = Random Forest; reduces variance while maintaining low bias
  2. Bootstrap sampling means each tree sees ~63.2% of data; remaining ~36.8% is OOB
  3. OOB score provides free validation without separate validation set
  4. Feature importance via MDI (fast) or permutation (more reliable)
  5. n_estimators: More trees is almost always better (diminishing returns after ~200)
  6. max_features: Use p\sqrt{p} for classification, p/3p/3 for regression
  7. class_weight='balanced' handles imbalanced datasets
  8. Parallelizable: Trees are independent, can use all CPU cores

Practice Exercises

Exercise 1: OOB Validation

from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

# Train with OOB score
# Train with 5-fold CV
# Are they similar? Why or why not?

Exercise 2: Feature Importance Stability

# Train 10 Random Forests with different random seeds
# Collect feature importances from each
# Calculate mean and std of each feature's importance
# Plot the results
# Which features are consistently important?

Exercise 3: Hyperparameter Sensitivity

n_estimators = [10, 50, 100, 200, 500]
max_features = ['sqrt', 'log2', 0.1, 0.3, 0.5]

# Plot accuracy heatmap
# Which combination works best?

Exercise 4: Compare Ensemble Methods

from sklearn.ensemble import (
    RandomForestClassifier, GradientBoostingClassifier,
    AdaBoostClassifier, BaggingClassifier
)

# Compare:
# a) Random Forest
# b) Gradient Boosting
# c) AdaBoost
# d) BaggingClassifier with DecisionTree base
# Which performs best? Why?

Discussion Questions

  1. Why does Random Forest reduce overfitting compared to single decision trees?
  2. How would you explain Random Forest to a business stakeholder?
  3. When might Gradient Boosting be preferred over Random Forest?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement