Decision Trees + Overfitting

Module 2: Machine LearningFree Lesson

Advertisement

Decision Trees + Overfitting

Overview

Decision trees are intuitive, interpretable models that recursively split the feature space into regions. They form the foundation for powerful ensemble methods like Random Forests and Gradient Boosting.


Tree Structure

Visual Representation

Architecture Diagram
                    [Income > 50k?]
                    /              \
                 Yes                No
                /                    \
      [Age > 35?]              [Credit Score > 600?]
      /         \               /              \
   Yes          No            Yes               No
   /             \            /                  \
[Approve]    [Review]    [Approve]           [Deny]

Leaf nodes: Final predictions
Internal nodes: Decision rules
Root node: First split (most important feature)

Mathematical Representation

DfDecision Tree Function

A decision tree partitions the feature space X\mathcal{X} into MM regions R1,R2,dots,RMR_1, R_2, \\dots, R_M:

T(x)=m=1Mcm1(xRm)T(\mathbf{x}) = \sum_{m=1}^{M} c_m \cdot \mathbf{1}(\mathbf{x} \in R_m)

Where:

  • RmR_m = region (leaf node)
  • cmc_m = prediction for region mm (class label or mean value)
  • 1(cdot)\mathbf{1}(\\cdot) = indicator function

For classification: cm=argmaxkhatpmkc_m = \\arg\\max_k \\hat{p}_{mk} (majority class)

For regression: cm=frac1RmsumxiinRmyic_m = \\frac{1}{|R_m|}\\sum_{x_i \\in R_m} y_i (mean of region)


Splitting Criteria

For Classification

The goal is to find splits that create "pure" child nodes.

Gini Impurity

DfGini Impurity

Measures the probability of incorrectly classifying a randomly chosen element:

Gini(R)=1k=1Kpmk2\text{Gini}(R) = 1 - \sum_{k=1}^{K} p_{mk}^2

where pmkp_{mk} is the proportion of class kk in region RR.

Architecture Diagram
Gini Values:
- Pure node (all same class): Gini = 0
- Maximum impurity (50/50): Gini = 0.5
- Binary classification: Gini ∈ [0, 0.5]

Example:
Class distribution: [0.5, 0.5]
Gini = 1 - (0.5² + 0.5²) = 1 - 0.5 = 0.5 (maximum impurity)

Class distribution: [0.9, 0.1]
Gini = 1 - (0.9² + 0.1²) = 1 - 0.82 = 0.18 (low impurity)

Entropy and Information Gain

DfEntropy

H(R)=k=1Kpmklog2(pmk)H(R) = -\sum_{k=1}^{K} p_{mk} \log_2(p_{mk})

Information Gain

IG=H(R)RLRH(RL)RRRH(RR)IG = H(R) - \frac{|R_L|}{|R|} H(R_L) - \frac{|R_R|}{|R|} H(R_R)

Here,

  • H(R)H(R)=Entropy of parent node
  • RL,RRR_L, R_R=Left and right child nodes
  • R|R|=Number of samples in region
Architecture Diagram
Entropy Values:
- Pure node: Entropy = 0
- Maximum impurity (binary): Entropy = 1.0

Entropy Scale:
0.0 |___________________________
    |                            |  Pure
0.5 |         *                  |
    |            *               |
1.0 |_______________*___________|  Maximum impurity

Gain Ratio (C4.5)

Normalizes Information Gain to avoid bias toward multi-way splits:

GainRatio=IGSplitInfo\text{GainRatio} = \frac{IG}{\text{SplitInfo}}
SplitInfo=j=1VRjRlog2RjR\text{SplitInfo} = -\sum_{j=1}^{V} \frac{|R_j|}{|R|} \log_2\frac{|R_j|}{|R|}

For Regression

Variance Reduction (MSE)

Variance of Region

Var(R)=1RxiR(yiyˉR)2\text{Var}(R) = \frac{1}{|R|}\sum_{x_i \in R}(y_i - \bar{y}_R)^2

Here,

  • yˉR\bar{y}_R=Mean of target values in region R

Information Gain (Variance Reduction):

Δ=Var(R)RLRVar(RL)RRRVar(RR)\Delta = \text{Var}(R) - \frac{|R_L|}{|R|}\text{Var}(R_L) - \frac{|R_R|}{|R|}\text{Var}(R_R)

Mean Absolute Error (MAE)

MAE(R)=1RxiRyiyˉR\text{MAE}(R) = \frac{1}{|R|}\sum_{x_i \in R}|y_i - \bar{y}_R|

Split Algorithm

Greedy Search

import numpy as np

def find_best_split(X, y, criterion='gini'):
    """Find the best feature and threshold to split on."""
    best_gain = -np.inf
    best_feature = None
    best_threshold = None

    n_samples, n_features = X.shape

    for feature_idx in range(n_features):
        thresholds = np.unique(X[:, feature_idx])

        for threshold in thresholds:
            left_mask = X[:, feature_idx] <= threshold
            right_mask = ~left_mask

            if left_mask.sum() == 0 or right_mask.sum() == 0:
                continue

            if criterion == 'gini':
                gain = gini_impurity(y) - (
                    left_mask.sum()/n_samples * gini_impurity(y[left_mask]) +
                    right_mask.sum()/n_samples * gini_impurity(y[right_mask])
                )
            elif criterion == 'entropy':
                gain = entropy(y) - (
                    left_mask.sum()/n_samples * entropy(y[left_mask]) +
                    right_mask.sum()/n_samples * entropy(y[right_mask])
                )

            if gain > best_gain:
                best_gain = gain
                best_feature = feature_idx
                best_threshold = threshold

    return best_feature, best_threshold, best_gain

Impurity Functions Comparison

def gini_impurity(y):
    _, counts = np.unique(y, return_counts=True)
    probs = counts / len(y)
    return 1 - np.sum(probs ** 2)

def entropy(y):
    _, counts = np.unique(y, return_counts=True)
    probs = counts / len(y)
    return -np.sum(probs * np.log2(probs + 1e-10))

def variance(y):
    return np.var(y)

distributions = [
    np.array([1, 1, 1, 1, 0]),
    np.array([1, 1, 1, 1, 1]),
    np.array([1, 0, 1, 0, 1]),
    np.array([1, 1, 1, 0, 0]),
]

print("Distribution | Gini | Entropy")
print("-" * 45)
for dist in distributions:
    print(f"{str(dist):15s} | {gini_impurity(dist):.3f} | {entropy(dist):.3f}")

Python Implementation

Single Decision Tree

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_text
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

dt_classifier = DecisionTreeClassifier(
    criterion='gini',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)
dt_classifier.fit(X_train, y_train)

y_pred = dt_classifier.predict(X_test)
print("Decision Tree Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=class_names))

print("\nDecision Tree Rules:")
tree_rules = export_text(dt_classifier, feature_names=feature_names)
print(tree_rules)

Overfitting in Decision Trees

The Problem

Without constraints, trees grow until nodes are pure, memorizing training data:

Architecture Diagram
Training Accuracy vs Tree Depth

Accuracy
^
100%|                          ___________  Training
    |                        /
    |                      /
 90%|                    /
    |                  /         ________  Validation
    |                /         /
 80%|              /         /
    |            /         /
 70%|          /         /
    |        /         /
 60%|      /         /
    |    /         /
 50%|__/________/
    +------------------------------------> Tree Depth
        1  2  3  4  5  6  7  8  9  10

Optimal depth: ~4-5 (gap between training and validation)

Quantifying Overfitting

from sklearn.model_selection import cross_val_score

depths = range(1, 21)
train_scores = []
cv_scores = []

for depth in depths:
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train, y_train)

    train_scores.append(accuracy_score(y_train, dt.predict(X_train)))
    cv_scores.append(cross_val_score(dt, X_train, y_train, cv=5).mean())

optimal_depth = depths[np.argmax(cv_scores)]
print(f"Optimal depth: {optimal_depth}")

Pruning Strategies

Pre-Pruning (Early Stopping)

Stop growing the tree before it overfits:

ParameterDescriptionTypical Values
max_depthMaximum tree depth3-20
min_samples_splitMin samples to split a node2-50
min_samples_leafMin samples in leaf node1-20
max_featuresNumber of features to considersqrt(n) or log(n)

Post-Pruning (Cost Complexity Pruning)

Grow full tree, then remove branches that don't improve generalization.

Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha |T|

Minimal Cost-Complexity Pruning:

αeff=min{α:pruned subtree Tα is optimal}\alpha_{\text{eff}} = \min\left\{\alpha : \text{pruned subtree } T_\alpha \text{ is optimal}\right\}
path = dt_classifier.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

trees = []
for ccp_alpha in ccp_alphas:
    dt = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
    dt.fit(X_train, y_train)
    trees.append(dt)

cv_scores = [cross_val_score(t, X_train, y_train, cv=5).mean() for t in trees]
optimal_idx = np.argmax(cv_scores)
optimal_alpha = ccp_alphas[optimal_idx]

print(f"Optimal ccp_alpha: {optimal_alpha:.6f}")
print(f"Best CV accuracy: {cv_scores[optimal_idx]:.4f}")

Pruning Comparison

Architecture Diagram
Full Tree (α=0)          Pruned (α=0.01)        Highly Pruned (α=0.1)
+---+---+---+---+       +-------+-------+       +---------------+
| A | A | B | B |       |   A   |   B   |       |       A       |
+---+---+---+---+       +-------+-------+       +---------------+
| A | B | B | B |       |   B   |   B   |       |       B       |
+---+---+---+---+       +-------+-------+       +---------------+

16 leaves                 4 leaves               2 leaves
Overfits                  Good balance            Underfits

Feature Importance

Mean Decrease in Impurity (MDI)

Feature Importance (MDI)

Importance(j)=tsplits on jp(t)ΔI(t)\text{Importance}(j) = \sum_{t \in \text{splits on } j} p(t) \cdot \Delta I(t)

Here,

  • p(t)p(t)=Proportion of samples reaching node t
  • ΔI(t)\Delta I(t)=Impurity decrease at node t
importances = dt_classifier.feature_importances_
indices = np.argsort(importances)[::-1]

print("\nFeature Importance Ranking:")
for i, idx in enumerate(indices):
    print(f"{i+1}. {feature_names[idx]}: {importances[idx]:.4f}")

Limitations of MDI Importance

  • Biased toward high-cardinality features
  • Can be misleading with correlated features
  • Alternative: permutation importance

Complete Python Example

📝Decision Tree with Hyperparameter Tuning

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report

np.random.seed(42)
n = 800

data = pd.DataFrame({
    'Pclass': np.random.choice([1, 2, 3], n, p=[0.24, 0.21, 0.55]),
    'Sex': np.random.choice(['male', 'female'], n, p=[0.65, 0.35]),
    'Age': np.random.normal(30, 14, n).clip(0.5, 80),
    'SibSp': np.random.choice([0, 1, 2, 3, 4], n, p=[0.6, 0.2, 0.1, 0.05, 0.05]),
    'Parch': np.random.choice([0, 1, 2, 3], n, p=[0.7, 0.15, 0.1, 0.05]),
    'Fare': np.random.exponential(30, n)
})

survival_prob = (
    0.3 * (data['Sex'] == 'female').astype(int) +
    0.2 * (data['Pclass'] == 1).astype(int) -
    0.15 * (data['Age'] > 50).astype(int) +
    0.1 * (data['Fare'] > 50).astype(int) +
    0.05 * (data['SibSp'] <= 1).astype(int) +
    np.random.normal(0, 0.2, n)
)
data['Survived'] = (survival_prob > 0.4).astype(int)

data['Sex_encoded'] = (data['Sex'] == 'female').astype(int)
features = ['Pclass', 'Sex_encoded', 'Age', 'SibSp', 'Parch', 'Fare']

X = data[features]
y = data['Survived']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
    'criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")

best_dt = grid_search.best_estimator_
y_pred = best_dt.predict(X_test)

print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nDecision Tree Rules:")
print(export_text(best_dt, feature_names=features))

Key Takeaways

📋Summary: Decision Trees

  1. Intuitive and interpretable — easy to explain decisions to non-technical stakeholders
  2. No feature scaling required — splits are based on thresholds
  3. Handles both classification and regression — same algorithm, different impurity measures
  4. Prone to overfitting — use pruning or ensemble methods
  5. Gini vs Entropy: Results are usually similar; Gini is slightly faster
  6. Cost-complexity pruning uses R_\\alpha(T) = R(T) + \\alpha|T| to balance fit and complexity
  7. Greedy algorithm — finds locally optimal splits, not globally optimal tree
  8. Foundation for ensembles — Random Forest and Gradient Boosting build on decision trees

Practice Exercises

Exercise 1: Manual Split Calculation

Given a node with samples: [A, A, A, B, B]

a) Calculate Gini impurity b) Calculate Entropy c) If splitting produces left=[A, A, A] and right=[B, B], calculate Information Gain

Exercise 2: Pruning Comparison

# Train 5 trees with different max_depth values (1, 3, 5, 10, None)
# For each:
#   - Calculate training accuracy
#   - Calculate test accuracy
#   - Count number of leaves
# Plot the results

Exercise 3: Feature Importance Stability

# Train 100 decision trees on bootstrap samples
# Collect feature importances from each
# Calculate mean and std of each feature's importance
# Which features are consistently important?

Exercise 4: Classification vs Regression

# On the same dataset:
# a) Train a classification tree
# b) Train a regression tree (treat target as continuous)
# c) Compare predictions and performance
# d) When might regression trees be preferred for classification?

Discussion Questions

  1. Why don't decision trees require feature scaling?
  2. How would you handle missing values in a decision tree?
  3. What are the advantages of using decision trees over logistic regression?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement