Decision Trees + Overfitting
Overview
Decision trees are intuitive, interpretable models that recursively split the feature space into regions. They form the foundation for powerful ensemble methods like Random Forests and Gradient Boosting.
Tree Structure
Visual Representation
[Income > 50k?]
/ \
Yes No
/ \
[Age > 35?] [Credit Score > 600?]
/ \ / \
Yes No Yes No
/ \ / \
[Approve] [Review] [Approve] [Deny]
Leaf nodes: Final predictions
Internal nodes: Decision rules
Root node: First split (most important feature)
Mathematical Representation
DfDecision Tree Function
A decision tree partitions the feature space into regions :
Where:
- = region (leaf node)
- = prediction for region (class label or mean value)
- = indicator function
For classification: (majority class)
For regression: (mean of region)
Splitting Criteria
For Classification
The goal is to find splits that create "pure" child nodes.
Gini Impurity
DfGini Impurity
Measures the probability of incorrectly classifying a randomly chosen element:
where is the proportion of class in region .
Gini Values:
- Pure node (all same class): Gini = 0
- Maximum impurity (50/50): Gini = 0.5
- Binary classification: Gini ∈ [0, 0.5]
Example:
Class distribution: [0.5, 0.5]
Gini = 1 - (0.5² + 0.5²) = 1 - 0.5 = 0.5 (maximum impurity)
Class distribution: [0.9, 0.1]
Gini = 1 - (0.9² + 0.1²) = 1 - 0.82 = 0.18 (low impurity)
Entropy and Information Gain
DfEntropy
Information Gain
Here,
- =Entropy of parent node
- =Left and right child nodes
- =Number of samples in region
Entropy Values:
- Pure node: Entropy = 0
- Maximum impurity (binary): Entropy = 1.0
Entropy Scale:
0.0 |___________________________
| | Pure
0.5 | * |
| * |
1.0 |_______________*___________| Maximum impurity
Gain Ratio (C4.5)
Normalizes Information Gain to avoid bias toward multi-way splits:
For Regression
Variance Reduction (MSE)
Variance of Region
Here,
- =Mean of target values in region R
Information Gain (Variance Reduction):
Mean Absolute Error (MAE)
Split Algorithm
Greedy Search
import numpy as np
def find_best_split(X, y, criterion='gini'):
"""Find the best feature and threshold to split on."""
best_gain = -np.inf
best_feature = None
best_threshold = None
n_samples, n_features = X.shape
for feature_idx in range(n_features):
thresholds = np.unique(X[:, feature_idx])
for threshold in thresholds:
left_mask = X[:, feature_idx] <= threshold
right_mask = ~left_mask
if left_mask.sum() == 0 or right_mask.sum() == 0:
continue
if criterion == 'gini':
gain = gini_impurity(y) - (
left_mask.sum()/n_samples * gini_impurity(y[left_mask]) +
right_mask.sum()/n_samples * gini_impurity(y[right_mask])
)
elif criterion == 'entropy':
gain = entropy(y) - (
left_mask.sum()/n_samples * entropy(y[left_mask]) +
right_mask.sum()/n_samples * entropy(y[right_mask])
)
if gain > best_gain:
best_gain = gain
best_feature = feature_idx
best_threshold = threshold
return best_feature, best_threshold, best_gain
Impurity Functions Comparison
def gini_impurity(y):
_, counts = np.unique(y, return_counts=True)
probs = counts / len(y)
return 1 - np.sum(probs ** 2)
def entropy(y):
_, counts = np.unique(y, return_counts=True)
probs = counts / len(y)
return -np.sum(probs * np.log2(probs + 1e-10))
def variance(y):
return np.var(y)
distributions = [
np.array([1, 1, 1, 1, 0]),
np.array([1, 1, 1, 1, 1]),
np.array([1, 0, 1, 0, 1]),
np.array([1, 1, 1, 0, 0]),
]
print("Distribution | Gini | Entropy")
print("-" * 45)
for dist in distributions:
print(f"{str(dist):15s} | {gini_impurity(dist):.3f} | {entropy(dist):.3f}")
Python Implementation
Single Decision Tree
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_text
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target
feature_names = iris.feature_names
class_names = iris.target_names
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
dt_classifier = DecisionTreeClassifier(
criterion='gini',
max_depth=None,
min_samples_split=2,
min_samples_leaf=1,
random_state=42
)
dt_classifier.fit(X_train, y_train)
y_pred = dt_classifier.predict(X_test)
print("Decision Tree Performance:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=class_names))
print("\nDecision Tree Rules:")
tree_rules = export_text(dt_classifier, feature_names=feature_names)
print(tree_rules)
Overfitting in Decision Trees
The Problem
Without constraints, trees grow until nodes are pure, memorizing training data:
Training Accuracy vs Tree Depth
Accuracy
^
100%| ___________ Training
| /
| /
90%| /
| / ________ Validation
| / /
80%| / /
| / /
70%| / /
| / /
60%| / /
| / /
50%|__/________/
+------------------------------------> Tree Depth
1 2 3 4 5 6 7 8 9 10
Optimal depth: ~4-5 (gap between training and validation)
Quantifying Overfitting
from sklearn.model_selection import cross_val_score
depths = range(1, 21)
train_scores = []
cv_scores = []
for depth in depths:
dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
dt.fit(X_train, y_train)
train_scores.append(accuracy_score(y_train, dt.predict(X_train)))
cv_scores.append(cross_val_score(dt, X_train, y_train, cv=5).mean())
optimal_depth = depths[np.argmax(cv_scores)]
print(f"Optimal depth: {optimal_depth}")
Pruning Strategies
Pre-Pruning (Early Stopping)
Stop growing the tree before it overfits:
| Parameter | Description | Typical Values |
|---|---|---|
max_depth | Maximum tree depth | 3-20 |
min_samples_split | Min samples to split a node | 2-50 |
min_samples_leaf | Min samples in leaf node | 1-20 |
max_features | Number of features to consider | sqrt(n) or log(n) |
Post-Pruning (Cost Complexity Pruning)
Grow full tree, then remove branches that don't improve generalization.
Minimal Cost-Complexity Pruning:
path = dt_classifier.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas
trees = []
for ccp_alpha in ccp_alphas:
dt = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
dt.fit(X_train, y_train)
trees.append(dt)
cv_scores = [cross_val_score(t, X_train, y_train, cv=5).mean() for t in trees]
optimal_idx = np.argmax(cv_scores)
optimal_alpha = ccp_alphas[optimal_idx]
print(f"Optimal ccp_alpha: {optimal_alpha:.6f}")
print(f"Best CV accuracy: {cv_scores[optimal_idx]:.4f}")
Pruning Comparison
Full Tree (α=0) Pruned (α=0.01) Highly Pruned (α=0.1)
+---+---+---+---+ +-------+-------+ +---------------+
| A | A | B | B | | A | B | | A |
+---+---+---+---+ +-------+-------+ +---------------+
| A | B | B | B | | B | B | | B |
+---+---+---+---+ +-------+-------+ +---------------+
16 leaves 4 leaves 2 leaves
Overfits Good balance Underfits
Feature Importance
Mean Decrease in Impurity (MDI)
Feature Importance (MDI)
Here,
- =Proportion of samples reaching node t
- =Impurity decrease at node t
importances = dt_classifier.feature_importances_
indices = np.argsort(importances)[::-1]
print("\nFeature Importance Ranking:")
for i, idx in enumerate(indices):
print(f"{i+1}. {feature_names[idx]}: {importances[idx]:.4f}")
Limitations of MDI Importance
- Biased toward high-cardinality features
- Can be misleading with correlated features
- Alternative: permutation importance
Complete Python Example
📝Decision Tree with Hyperparameter Tuning
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report
np.random.seed(42)
n = 800
data = pd.DataFrame({
'Pclass': np.random.choice([1, 2, 3], n, p=[0.24, 0.21, 0.55]),
'Sex': np.random.choice(['male', 'female'], n, p=[0.65, 0.35]),
'Age': np.random.normal(30, 14, n).clip(0.5, 80),
'SibSp': np.random.choice([0, 1, 2, 3, 4], n, p=[0.6, 0.2, 0.1, 0.05, 0.05]),
'Parch': np.random.choice([0, 1, 2, 3], n, p=[0.7, 0.15, 0.1, 0.05]),
'Fare': np.random.exponential(30, n)
})
survival_prob = (
0.3 * (data['Sex'] == 'female').astype(int) +
0.2 * (data['Pclass'] == 1).astype(int) -
0.15 * (data['Age'] > 50).astype(int) +
0.1 * (data['Fare'] > 50).astype(int) +
0.05 * (data['SibSp'] <= 1).astype(int) +
np.random.normal(0, 0.2, n)
)
data['Survived'] = (survival_prob > 0.4).astype(int)
data['Sex_encoded'] = (data['Sex'] == 'female').astype(int)
features = ['Pclass', 'Sex_encoded', 'Age', 'SibSp', 'Parch', 'Fare']
X = data[features]
y = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
param_grid = {
'max_depth': [3, 5, 7, 10, None],
'min_samples_split': [2, 10, 20],
'min_samples_leaf': [1, 5, 10],
'criterion': ['gini', 'entropy']
}
grid_search = GridSearchCV(
DecisionTreeClassifier(random_state=42),
param_grid, cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")
best_dt = grid_search.best_estimator_
y_pred = best_dt.predict(X_test)
print(f"\nTest Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"\nDecision Tree Rules:")
print(export_text(best_dt, feature_names=features))
Key Takeaways
📋Summary: Decision Trees
- Intuitive and interpretable — easy to explain decisions to non-technical stakeholders
- No feature scaling required — splits are based on thresholds
- Handles both classification and regression — same algorithm, different impurity measures
- Prone to overfitting — use pruning or ensemble methods
- Gini vs Entropy: Results are usually similar; Gini is slightly faster
- Cost-complexity pruning uses R_\\alpha(T) = R(T) + \\alpha|T| to balance fit and complexity
- Greedy algorithm — finds locally optimal splits, not globally optimal tree
- Foundation for ensembles — Random Forest and Gradient Boosting build on decision trees
Practice Exercises
Exercise 1: Manual Split Calculation
Given a node with samples: [A, A, A, B, B]
a) Calculate Gini impurity b) Calculate Entropy c) If splitting produces left=[A, A, A] and right=[B, B], calculate Information Gain
Exercise 2: Pruning Comparison
# Train 5 trees with different max_depth values (1, 3, 5, 10, None)
# For each:
# - Calculate training accuracy
# - Calculate test accuracy
# - Count number of leaves
# Plot the results
Exercise 3: Feature Importance Stability
# Train 100 decision trees on bootstrap samples
# Collect feature importances from each
# Calculate mean and std of each feature's importance
# Which features are consistently important?
Exercise 4: Classification vs Regression
# On the same dataset:
# a) Train a classification tree
# b) Train a regression tree (treat target as continuous)
# c) Compare predictions and performance
# d) When might regression trees be preferred for classification?
Discussion Questions
- Why don't decision trees require feature scaling?
- How would you handle missing values in a decision tree?
- What are the advantages of using decision trees over logistic regression?