Decision Trees + Overfitting

Overview

Decision trees are intuitive, interpretable models that recursively split the feature space into regions. They form the foundation for powerful ensemble methods like Random Forests and Gradient Boosting.

Tree Structure

Visual Representation

Architecture Diagram

                    [Income > 50k?]
                    /              \
                 Yes                No
                /                    \
      [Age > 35?]              [Credit Score > 600?]
      /         \               /              \
   Yes          No            Yes               No
   /             \            /                  \
[Approve]    [Review]    [Approve]           [Deny]

Leaf nodes: Final predictions
Internal nodes: Decision rules
Root node: First split (most important feature)

Mathematical Representation

DfDecision Tree Function

A decision tree partitions the feature space $\mathcal{X}$ into $M$ regions $R_1, R_2, \\dots, R_M$ :

T(\mathbf{x}) = \sum_{m=1}^{M} c_m \cdot \mathbf{1}(\mathbf{x} \in R_m)

Where:

$R_m$ = region (leaf node)
$c_m$ = prediction for region $m$ (class label or mean value)
$\mathbf{1}(\\cdot)$ = indicator function

For classification: $c_m = \\arg\\max_k \\hat{p}_{mk}$ (majority class)

For regression: $c_m = \\frac{1}{|R_m|}\\sum_{x_i \\in R_m} y_i$ (mean of region)

Splitting Criteria

For Classification

The goal is to find splits that create "pure" child nodes.

Gini Impurity

DfGini Impurity

Measures the probability of incorrectly classifying a randomly chosen element:

\text{Gini}(R) = 1 - \sum_{k=1}^{K} p_{mk}^2

where $p_{mk}$ is the proportion of class $k$ in region $R$ .

Architecture Diagram

Gini Values:
- Pure node (all same class): Gini = 0
- Maximum impurity (50/50): Gini = 0.5
- Binary classification: Gini ∈ [0, 0.5]

Example:
Class distribution: [0.5, 0.5]
Gini = 1 - (0.5² + 0.5²) = 1 - 0.5 = 0.5 (maximum impurity)

Class distribution: [0.9, 0.1]
Gini = 1 - (0.9² + 0.1²) = 1 - 0.82 = 0.18 (low impurity)

Entropy and Information Gain

DfEntropy

H(R) = -\sum_{k=1}^{K} p_{mk} \log_2(p_{mk})

Information Gain

IG = H(R) - \frac{|R_L|}{|R|} H(R_L) - \frac{|R_R|}{|R|} H(R_R)

Here,

$H(R)$ =Entropy of parent node
$R_L, R_R$ =Left and right child nodes
$|R|$ =Number of samples in region

Architecture Diagram

Entropy Values:
- Pure node: Entropy = 0
- Maximum impurity (binary): Entropy = 1.0

Entropy Scale:
0.0 |___________________________
    |                            |  Pure
0.5 |         *                  |
    |            *               |
1.0 |_______________*___________|  Maximum impurity

Gain Ratio (C4.5)

Normalizes Information Gain to avoid bias toward multi-way splits:

\text{GainRatio} = \frac{IG}{\text{SplitInfo}}

\text{SplitInfo} = -\sum_{j=1}^{V} \frac{|R_j|}{|R|} \log_2\frac{|R_j|}{|R|}

For Regression

Variance Reduction (MSE)

Variance of Region

\text{Var}(R) = \frac{1}{|R|}\sum_{x_i \in R}(y_i - \bar{y}_R)^2

Here,

$\bar{y}_R$ =Mean of target values in region R

Information Gain (Variance Reduction):

\Delta = \text{Var}(R) - \frac{|R_L|}{|R|}\text{Var}(R_L) - \frac{|R_R|}{|R|}\text{Var}(R_R)

Mean Absolute Error (MAE)

\text{MAE}(R) = \frac{1}{|R|}\sum_{x_i \in R}|y_i - \bar{y}_R|

Split Algorithm

Greedy Search

import numpy as np

def find_best_split(X, y, criterion='gini'):
    """Find the best feature and threshold to split on."""
    best_gain = -np.inf
    best_feature = None
    best_threshold = None

    n_samples, n_features = X.shape

    for feature_idx in range(n_features):
        thresholds = np.unique(X[:, feature_idx])

        for threshold in thresholds:
            left_mask = X[:, feature_idx] <= threshold
            right_mask = ~left_mask

            if left_mask.sum() == 0 or right_mask.sum() == 0:
                continue

            if criterion == 'gini':
                gain = gini_impurity(y) - (
                    left_mask.sum()/n_samples * gini_impurity(y[left_mask]) +
                    right_mask.sum()/n_samples * gini_impurity(y[right_mask])
                )
            elif criterion == 'entropy':
                gain = entropy(y) - (
                    left_mask.sum()/n_samples * entropy(y[left_mask]) +
                    right_mask.sum()/n_samples * entropy(y[right_mask])
                )

            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold

    return best_feature, best_threshold, best_gain

Impurity Functions Comparison

def gini_impurity(y):
    _, counts = np.unique(y, return_counts=True)
    probs = counts / len(y)
    return 1 - np.sum(probs ** 2)

def entropy(y):
    _, counts = np.unique(y, return_counts=True)
    probs = counts / len(y)
    return -np.sum(probs * np.log2(probs + 1e-10))

def variance(y):
    return np.var(y)

distributions = [
    np.array([1, 1, 1, 1, 0]),
    np.array([1, 1, 1, 1, 1]),
    np.array([1, 0, 1, 0, 1]),
    np.array([1, 1, 1, 0, 0]),
]

print("Distribution | Gini | Entropy")
print("-" * 45)
for dist in distributions:
    print(f"{str(dist):15s} | {gini_impurity(dist):.3f} | {entropy(dist):.3f}")

Python Implementation

Single Decision Tree

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_text
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

dt_classifier = DecisionTreeClassifier(
    criterion='gini',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)
dt_classifier.fit(X_train, y_train)

y_pred = dt_classifier.predict(X_test)
print("Decision Tree Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=class_names))

print("\nDecision Tree Rules:")
tree_rules = export_text(dt_classifier, feature_names=feature_names)
print(tree_rules)

Overfitting in Decision Trees

The Problem

Without constraints, trees grow until nodes are pure, memorizing training data:

Architecture Diagram

Training Accuracy vs Tree Depth

Accuracy
^
100%|                          ___________  Training
    |                        /
    |                      /
 90%|                    /
    |                  /         ________  Validation
    |                /         /
 80%|              /         /
    |            /         /
 70%|          /         /
    |        /         /
 60%|      /         /
    |    /         /
 50%|__/________/
    +------------------------------------> Tree Depth
        1  2  3  4  5  6  7  8  9  10

Optimal depth: ~4-5 (gap between training and validation)

Quantifying Overfitting

from sklearn.model_selection import cross_val_score

depths = range(1, 21)
train_scores = []
cv_scores = []

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)

    train_scores.append(accuracy_score(y_train, dt.predict(X_train)))
    cv_scores.append(cross_val_score(dt, X_train, y_train, cv=5).mean())

optimal_depth = depths[np.argmax(cv_scores)]
print(f"Optimal depth: {optimal_depth}")

Pruning Strategies

Pre-Pruning (Early Stopping)

Stop growing the tree before it overfits:

Parameter	Description	Typical Values
`max_depth`	Maximum tree depth	3-20
`min_samples_split`	Min samples to split a node	2-50
`min_samples_leaf`	Min samples in leaf node	1-20
`max_features`	Number of features to consider	sqrt(n) or log(n)

Post-Pruning (Cost Complexity Pruning)

Grow full tree, then remove branches that don't improve generalization.

R_\alpha(T) = R(T) + \alpha |T|

Minimal Cost-Complexity Pruning:

\alpha_{\text{eff}} = \min\left\{\alpha : \text{pruned subtree } T_\alpha \text{ is optimal}\right\}

path = dt_classifier.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

trees = []
for ccp_alpha in ccp_alphas:
    dt = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
    dt.fit(X_train, y_train)
    trees.append(dt)

cv_scores = [cross_val_score(t, X_train, y_train, cv=5).mean() for t in trees]
optimal_idx = np.argmax(cv_scores)
optimal_alpha = ccp_alphas[optimal_idx]

print(f"Optimal ccp_alpha: {optimal_alpha:.6f}")
print(f"Best CV accuracy: {cv_scores[optimal_idx]:.4f}")

Pruning Comparison

Architecture Diagram

Full Tree (α=0)          Pruned (α=0.01)        Highly Pruned (α=0.1)
+---+---+---+---+       +-------+-------+       +---------------+
| A | A | B | B |       |   A   |   B   |       |       A       |
+---+---+---+---+       +-------+-------+       +---------------+
| A | B | B | B |       |   B   |   B   |       |       B       |
+---+---+---+---+       +-------+-------+       +---------------+

16 leaves                 4 leaves               2 leaves
Overfits                  Good balance            Underfits

Feature Importance

Mean Decrease in Impurity (MDI)

Feature Importance (MDI)

\text{Importance}(j) = \sum_{t \in \text{splits on } j} p(t) \cdot \Delta I(t)

Here,

$p(t)$ =Proportion of samples reaching node t
$\Delta I(t)$ =Impurity decrease at node t

importances = dt_classifier.feature_importances_
indices = np.argsort(importances)[::-1]

print("\nFeature Importance Ranking:")
for i, idx in enumerate(indices):
    print(f"{i+1}. {feature_names[idx]}: {importances[idx]:.4f}")

Limitations of MDI Importance

Biased toward high-cardinality features
Can be misleading with correlated features
Alternative: permutation importance

Complete Python Example

📝Decision Tree with Hyperparameter Tuning

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

np.random.seed(42)
n = 800

data = pd.DataFrame({
    'Pclass': np.random.choice([1, 2, 3], n, p=[0.24, 0.21, 0.55]),
    'Sex': np.random.choice(['male', 'female'], n, p=[0.65, 0.35]),
    'Age': np.random.normal(30, 14, n).clip(0.5, 80),
    'SibSp': np.random.choice([0, 1, 2, 3, 4], n, p=[0.6, 0.2, 0.1, 0.05, 0.05]),
    'Parch': np.random.choice([0, 1, 2, 3], n, p=[0.7, 0.15, 0.1, 0.05]),
    'Fare': np.random.exponential(30, n)
})

survival_prob = (
    0.3 * (data['Sex'] == 'female').astype(int) +
    0.2 * (data['Pclass'] == 1).astype(int) -
    0.15 * (data['Age'] > 50).astype(int) +
    0.1 * (data['Fare'] > 50).astype(int) +
    0.05 * (data['SibSp'] <= 1).astype(int) +
    np.random.normal(0, 0.2, n)
)
data['Survived'] = (survival_prob > 0.4).astype(int)

data['Sex_encoded'] = (data['Sex'] == 'female').astype(int)
features = ['Pclass', 'Sex_encoded', 'Age', 'SibSp', 'Parch', 'Fare']

X = data[features]
y = data['Survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")

best_dt = grid_search.best_estimator_
y_pred = best_dt.predict(X_test)

print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nDecision Tree Rules:")
print(export_text(best_dt, feature_names=features))

Key Takeaways

📋Summary: Decision Trees

Intuitive and interpretable — easy to explain decisions to non-technical stakeholders
No feature scaling required — splits are based on thresholds
Handles both classification and regression — same algorithm, different impurity measures
Prone to overfitting — use pruning or ensemble methods
Gini vs Entropy: Results are usually similar; Gini is slightly faster
Cost-complexity pruning uses $R_\\alpha(T) = R(T) + \\alpha|T|$ to balance fit and complexity
Greedy algorithm — finds locally optimal splits, not globally optimal tree
Foundation for ensembles — Random Forest and Gradient Boosting build on decision trees

Practice Exercises

Exercise 1: Manual Split Calculation

Given a node with samples: [A, A, A, B, B]

a) Calculate Gini impurity b) Calculate Entropy c) If splitting produces left=[A, A, A] and right=[B, B], calculate Information Gain

Exercise 2: Pruning Comparison

# Train 5 trees with different max_depth values (1, 3, 5, 10, None)
# For each:
#   - Calculate training accuracy
#   - Calculate test accuracy
#   - Count number of leaves
# Plot the results

Exercise 3: Feature Importance Stability

# Train 100 decision trees on bootstrap samples
# Collect feature importances from each
# Calculate mean and std of each feature's importance
# Which features are consistently important?

Exercise 4: Classification vs Regression

# On the same dataset:
# a) Train a classification tree
# b) Train a regression tree (treat target as continuous)
# c) Compare predictions and performance
# d) When might regression trees be preferred for classification?

Discussion Questions

Why don't decision trees require feature scaling?
How would you handle missing values in a decision tree?
What are the advantages of using decision trees over logistic regression?

Decision Trees + Overfitting

Decision Trees + Overfitting

Overview

Tree Structure

Visual Representation

Mathematical Representation

DfDecision Tree Function

Splitting Criteria

For Classification

Gini Impurity

DfGini Impurity

Entropy and Information Gain

DfEntropy

Information Gain

Gain Ratio (C4.5)

For Regression

Variance Reduction (MSE)

Variance of Region

Mean Absolute Error (MAE)

Split Algorithm

Greedy Search

Impurity Functions Comparison

Python Implementation

Single Decision Tree

Overfitting in Decision Trees

The Problem

Quantifying Overfitting

Pruning Strategies

Pre-Pruning (Early Stopping)

Post-Pruning (Cost Complexity Pruning)

Pruning Comparison

Feature Importance

Mean Decrease in Impurity (MDI)

Feature Importance (MDI)

Limitations of MDI Importance

Complete Python Example

📝Decision Tree with Hyperparameter Tuning

Key Takeaways

📋Summary: Decision Trees

Practice Exercises

Exercise 1: Manual Split Calculation

Exercise 2: Pruning Comparison

Exercise 3: Feature Importance Stability

Exercise 4: Classification vs Regression

Discussion Questions

Need Expert Data Science Help?