Random Forest + Feature Importance
Overview
Random Forest is an ensemble learning method that builds multiple decision trees and combines their predictions. It addresses the overfitting problem of single decision trees while maintaining high accuracy.
Ensemble Learning
The Wisdom of Crowds
Individual trees may make errors, but combining many trees reduces variance without increasing bias:
ℹ️ Ensemble Variance Reduction
When combining independent models each with variance , the ensemble variance is . This is the mathematical foundation for why bagging reduces overfitting. In practice, trees are not fully independent (they share training data), so the actual variance reduction is less dramatic but still substantial.
Single Decision Tree Random Forest (100 trees)
+---+---+---+---+ +---+---+---+---+
| A | A | B | B | | A | A | A | B |
+---+---+---+---+ --> +---+---+---+---+
| A | B | B | B | | A | B | B | B |
+---+---+---+---+ +---+---+---+---+
Accuracy: 85% Accuracy: 92%
Variance: High Variance: Low
Overfitting risk Robust predictions
Bagging (Bootstrap Aggregating)
Algorithm
- Create bootstrap samples from training data
- Train a decision tree on each bootstrap sample
- Aggregate predictions (majority vote or averaging)
DfBootstrap Sampling
Given dataset of size , create sample by sampling times with replacement:
Properties:
- Each bootstrap sample contains ~63.2% unique samples
- ~36.8% samples are out-of-bag (OOB) for each tree
- Different trees see different data → diversity
Original Dataset D: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Bootstrap Sample 1: [2, 3, 3, 5, 7, 7, 8, 9, 10, 10]
Bootstrap Sample 2: [1, 1, 2, 4, 4, 6, 7, 8, 9, 9]
Bootstrap Sample 3: [1, 3, 5, 5, 6, 7, 8, 8, 10, 10]
Each tree trained on different subset!
Mathematical Formulation
For trees, the bagging prediction:
Bagging Prediction
Here,
- =Number of trees
- =Prediction from tree b
Classification: Majority vote
Regression: Average
Random Forest Algorithm
Key Innovation: Random Feature Selection
At each split, only consider a random subset of features:
Features Per Split
Here,
- =Total number of features
- =Features to consider at each split
Why Random Features?
Without Feature Randomness: With Feature Randomness:
+----+----+----+----+ +----+----+----+----+
| T1 | T2 | T3 | T4 | | T1 | T2 | T3 | T4 |
+----+----+----+----+ +----+----+----+----+
| X1 | X1 | X1 | X1 | | X1 | X3 | X2 | X4 |
| X2 | X2 | X2 | X2 | | X3 | X1 | X4 | X2 |
+----+----+----+----+ +----+----+----+----+
All trees use same strong feature Trees use different features
Trees are correlated Trees are diverse
Ensemble has less benefit Ensemble has more benefit
Complete Algorithm
Input: Training data D = {(x₁,y₁), ..., (xₙ,yₙ)}
Parameters: B = number of trees, m = features per split
For b = 1 to B:
1. Create bootstrap sample D_b from D
2. Grow tree T_b on D_b:
- At each node:
* Select m random features
* Find best split among those features
* Split node into two children
- Stop when: min_samples_leaf reached or impurity = 0
3. Do NOT prune the tree
Output: Forest {T₁, T₂, ..., T_B}
Prediction: Majority vote (classification) or average (regression)
Python Implementation
Basic Random Forest
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, r2_score
from sklearn.datasets import make_classification, make_regression
X, y = make_classification(
n_samples=1000, n_features=20, n_informative=10,
n_redundant=5, n_classes=2, random_state=42
)
feature_names = [f'Feature_{i}' for i in range(X.shape[1])]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
rf = RandomForestClassifier(
n_estimators=100, max_depth=None, min_samples_split=2,
min_samples_leaf=1, max_features='sqrt', bootstrap=True,
oob_score=True, random_state=42, n_jobs=-1
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
print("Random Forest Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"OOB Score: {rf.oob_score_:.4f}")
Out-of-Bag (OOB) Error
Concept
Each tree uses ~63.2% of data; the remaining ~36.8% (OOB) can be used for validation:
Original Data: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Tree 1: Training on [2, 3, 5, 7, 8, 10] OOB: [1, 4, 6, 9]
Tree 2: Training on [1, 2, 4, 6, 7, 9] OOB: [3, 5, 8, 10]
Tree 3: Training on [1, 3, 5, 7, 8, 10] OOB: [2, 4, 6, 9]
For each sample, only use trees that didn't see it in training
OOB Score Calculation
For sample , predict using only trees where was OOB:
OOB Score
Here,
- =OOB prediction for sample i
- =True label for sample i
OOB vs Cross-Validation
rf_with_oob = RandomForestClassifier(
n_estimators=200, oob_score=True, random_state=42
)
rf_with_oob.fit(X_train, y_train)
print(f"OOB Score: {rf_with_oob.oob_score_:.4f}")
cv_scores = cross_val_score(
RandomForestClassifier(n_estimators=200, random_state=42),
X_train, y_train, cv=5, scoring='accuracy'
)
print(f"CV Accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
Feature Importance
Mean Decrease in Impurity (MDI)
MDI Feature Importance
Here,
- =Number of trees
- =Proportion of samples at node t
- =Impurity decrease at node t
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
print("Feature Importance Ranking:")
print("-" * 40)
for i, idx in enumerate(indices[:10]):
print(f"{i+1:2d}. {feature_names[idx]:15s} {importances[idx]:.4f} ± {std[idx]:.4f}")
Permutation Importance (More Reliable)
Measures importance by shuffling each feature and measuring accuracy decrease:
Permutation Importance
Here,
- =X with feature j randomly permuted
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(
rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1
)
perm_indices = perm_importance.importances_mean.argsort()[::-1]
print("\nPermutation Importance:")
print("-" * 40)
for i, idx in enumerate(perm_indices[:10]):
print(f"{i+1:2d}. {feature_names[idx]:15s} {perm_importance.importances_mean[idx]:.4f} "
f"± {perm_importance.importances_std[idx]:.4f}")
MDI vs Permutation Importance
| Aspect | MDI (Gini) | Permutation |
|---|---|---|
| Speed | Fast (during training) | Slow (requires re-evaluation) |
| Bias | Biased toward high-cardinality features | Less biased |
| Correlated features | Can be misleading | More reliable |
| Interpretability | Less intuitive | More intuitive |
Hyperparameter Tuning
Key Parameters
Random Forest Hyperparameters
├── n_estimators (Number of trees)
│ └── More trees → better performance but slower
│ Sweet spot: 100-500
│
├── max_features (Features per split)
│ ├── Classification: sqrt(p) or log2(p)
│ └── Regression: p/3
│
├── max_depth (Tree depth)
│ ├── None: fully grown trees (default)
│ └── Limit: reduces overfitting
│
├── min_samples_leaf (Leaf size)
│ ├── 1: fully grown (default)
│ └── Increase: smoother boundaries
│
└── bootstrap (Sampling method)
├── True: standard bagging (default)
└── False: use all data for each tree
Grid Search Example
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None]
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions=param_grid,
n_iter=50, cv=5, scoring='accuracy',
random_state=42, n_jobs=-1
)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV accuracy: {random_search.best_score_:.4f}")
Complete Real-World Example
📝Credit Card Fraud Detection with Random Forest
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
classification_report, confusion_matrix, roc_auc_score,
average_precision_score
)
from sklearn.preprocessing import StandardScaler
np.random.seed(42)
n = 10000
amount = np.random.exponential(100, n)
time_since_last = np.random.exponential(2, n)
distance_from_home = np.random.exponential(20, n)
num_transactions_24h = np.random.poisson(5, n)
merchant_risk = np.random.beta(2, 5, n)
fraud_score = (
0.3 * (amount > 500).astype(int) +
0.2 * (time_since_last < 0.5).astype(int) +
0.25 * (distance_from_home > 50).astype(int) +
0.15 * (num_transactions_24h > 10).astype(int) +
0.1 * (merchant_risk > 0.6).astype(int) +
np.random.normal(0, 0.1, n)
)
is_fraud = (fraud_score > 0.6).astype(int)
df = pd.DataFrame({
'amount': amount, 'time_since_last': time_since_last,
'distance_from_home': distance_from_home,
'num_transactions_24h': num_transactions_24h,
'merchant_risk': merchant_risk, 'is_fraud': is_fraud
})
X = df.drop('is_fraud', axis=1)
y = df['is_fraud']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
rf_fraud = RandomForestClassifier(
n_estimators=200, max_depth=10, min_samples_leaf=5,
class_weight='balanced', random_state=42, n_jobs=-1, oob_score=True
)
rf_fraud.fit(X_train_scaled, y_train)
y_pred = rf_fraud.predict(X_test_scaled)
y_prob = rf_fraud.predict_proba(X_test_scaled)[:, 1]
print("=== Fraud Detection Model ===")
print(f"OOB Score: {rf_fraud.oob_score_:.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average Precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Legitimate', 'Fraud']))
Comparison with Other Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
import time
models = {
'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
}
print("Algorithm Comparison:")
print("-" * 65)
print(f"{'Algorithm':25s} {'CV Accuracy':15s} {'Training Time':15s}")
print("-" * 65)
for name, model in models.items():
start = time.time()
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
train_time = time.time() - start
print(f"{name:25s} {scores.mean():.4f} ± {scores.std():.4f} {train_time:.3f}s")
When to Use Random Forest
Advantages
| Advantage | Description |
|---|---|
| Robust | Less prone to overfitting than single trees |
| No scaling needed | Tree-based, invariant to feature scale |
| Handles missing values | Some implementations support NaN |
| Feature importance | Built-in feature ranking |
| Parallel training | Each tree is independent |
| Out-of-bag score | Free validation estimate |
Disadvantages
| Disadvantage | Description |
|---|---|
| Less interpretable | Hard to explain 100 trees vs 1 |
| Memory intensive | Stores many trees |
| Slower prediction | Must query all trees |
| Can overfit | With very noisy data |
| Biased toward high-cardinality features | MDI importance biased |
Key Takeaways
📋Summary: Random Forest
- Bagging + Random Features = Random Forest; reduces variance while maintaining low bias
- Bootstrap sampling means each tree sees ~63.2% of data; remaining ~36.8% is OOB
- OOB score provides free validation without separate validation set
- Feature importance via MDI (fast) or permutation (more reliable)
- n_estimators: More trees is almost always better (diminishing returns after ~200)
- max_features: Use for classification, for regression
- class_weight='balanced' handles imbalanced datasets
- Parallelizable: Trees are independent, can use all CPU cores
Practice Exercises
Exercise 1: OOB Validation
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
# Train with OOB score
# Train with 5-fold CV
# Are they similar? Why or why not?
Exercise 2: Feature Importance Stability
# Train 10 Random Forests with different random seeds
# Collect feature importances from each
# Calculate mean and std of each feature's importance
# Plot the results
# Which features are consistently important?
Exercise 3: Hyperparameter Sensitivity
n_estimators = [10, 50, 100, 200, 500]
max_features = ['sqrt', 'log2', 0.1, 0.3, 0.5]
# Plot accuracy heatmap
# Which combination works best?
Exercise 4: Compare Ensemble Methods
from sklearn.ensemble import (
RandomForestClassifier, GradientBoostingClassifier,
AdaBoostClassifier, BaggingClassifier
)
# Compare:
# a) Random Forest
# b) Gradient Boosting
# c) AdaBoost
# d) BaggingClassifier with DecisionTree base
# Which performs best? Why?
Discussion Questions
- Why does Random Forest reduce overfitting compared to single decision trees?
- How would you explain Random Forest to a business stakeholder?
- When might Gradient Boosting be preferred over Random Forest?