CW

Decision Trees: Gini, Entropy and Pruning

Module 8: Tree-Based ModelsFree Lesson

Advertisement

Decision Trees: Gini, Entropy and Pruning

Decision trees are non-parametric supervised learning models that partition feature space into regions, making predictions based on the majority class (classification) or mean value (regression) within each region. Their intuitive, tree-like structure makes them one of the most interpretable models in machine learning.


How Decision Trees Work

A decision tree recursively partitions the input space by selecting feature thresholds that maximize predictive purity. Each internal node represents a feature test, each branch represents the outcome of that test, and each leaf node represents a prediction.

Recursive Partitioning Process:

  1. Select the best split — evaluate all features and thresholds
  2. Partition the data — split into child nodes
  3. Repeat — recursively split child nodes until stopping criteria are met
  4. Assign predictions — leaves hold class labels or regression values
from sklearn.tree import DecisionTreeClassifier, plot_tree
import matplotlib.pyplot as plt

# Train a decision tree
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

# Visualize
plt.figure(figsize=(12, 8))
plot_tree(tree, feature_names=feature_names, class_names=class_names, filled=True)
plt.show()

Decision Tree Structure

Age ≤ 30?Gini = 0.48YesNoIncome ≤ 50k?Gini = 0.32Student?Gini = 0.42YesNoYesNoBuy: Yes87% confidenceBuy: No62% confidenceBuy: Yes78% confidenceBuy: No91% confidenceTree LegendDecision Node (feature test)Leaf Node (prediction)Each node shows split criterion and Gini impurity

Impurity Measures

The quality of a split is measured by the reduction in impurity it produces. Three primary impurity measures are used in practice.

Gini Impurity

The Gini impurity measures the probability of incorrectly classifying a randomly chosen element if it were labeled according to the class distribution of the dataset.

Gini(D)=1k=1Kpk2\text{Gini}(D) = 1 - \sum_{k=1}^{K} p_k^2

where pkp_k is the proportion of class kk in dataset DD, and KK is the number of classes.

Properties:

  • Ranges from 0 (pure node) to 11K1 - \frac{1}{K} (maximum impurity)
  • For binary classification: maximum Gini = 0.5
  • Computationally efficient (no logarithms required)

Entropy

Entropy measures the expected amount of information (surprise) in the class distribution.

Entropy(D)=k=1Kpklog2(pk)\text{Entropy}(D) = -\sum_{k=1}^{K} p_k \log_2(p_k)

Properties:

  • Ranges from 0 (pure node) to log2(K)\log_2(K) (maximum impurity)
  • For binary classification: maximum Entropy = 1.0
  • Rooted in Shannon's information theory

Information Gain

Information Gain quantifies the reduction in entropy (or impurity) achieved by splitting on a feature AA.

IG(D,A)=Entropy(D)v=1VDvDEntropy(Dv)\text{IG}(D, A) = \text{Entropy}(D) - \sum_{v=1}^{V} \frac{|D_v|}{|D|} \cdot \text{Entropy}(D_v)

where DvD_v is the subset of DD where feature AA takes value vv, and VV is the number of distinct values of AA.

Gini vs Entropy Comparison

Class Probability (p)Impurity00.250.500.751.000.250.500.751.0Gini ImpurityEntropyGini max = 0.5Entropy max = 1.0

Key Insight: Both measures peak at p=0.5p = 0.5 for binary classification, but Entropy has a sharper peak. In practice, Gini and Entropy produce very similar trees; the difference rarely exceeds 1–2% in accuracy.


Splitting Criteria: Mathematical Framework

For a candidate split on feature AA with threshold tt:

Ginisplit(D,A,t)=DleftDGini(Dleft)+DrightDGini(Dright)\text{Gini}_{\text{split}}(D, A, t) = \frac{|D_{\text{left}}|}{|D|} \cdot \text{Gini}(D_{\text{left}}) + \frac{|D_{\text{right}}|}{|D|} \cdot \text{Gini}(D_{\text{right}})

Goal: Minimize Ginisplit\text{Gini}_{\text{split}} (or maximize Information Gain)

Gain Ratio (C4.5)

To avoid bias toward multi-valued features, C4.5 uses the Gain Ratio:

GainRatio(D,A)=IG(D,A)SplitInfo(D,A)\text{GainRatio}(D, A) = \frac{\text{IG}(D, A)}{\text{SplitInfo}(D, A)}

where:

SplitInfo(D,A)=v=1VDvDlog2DvD\text{SplitInfo}(D, A) = -\sum_{v=1}^{V} \frac{|D_v|}{|D|} \log_2 \frac{|D_v|}{|D|}

For Regression Trees

The splitting criterion minimizes the mean squared error (MSE):

MSEsplit=DleftDVar(Dleft)+DrightDVar(Dright)\text{MSE}_{\text{split}} = \frac{|D_{\text{left}}|}{|D|} \cdot \text{Var}(D_{\text{left}}) + \frac{|D_{\text{right}}|}{|D|} \cdot \text{Var}(D_{\text{right}})

Tree Building Algorithms

ID3 (Iterative Dichotomiser 3)

  • Uses Information Gain as splitting criterion
  • Handles only categorical features
  • No pruning — grows until pure leaves or no more splits
  • Prone to overfitting

C4.5 (Successor to ID3)

  • Uses Gain Ratio to reduce multi-valued feature bias
  • Handles continuous features via thresholding
  • Handles missing values through fractional instance weighting
  • Includes post-pruning via error-based pruning

CART (Classification and Regression Trees)

  • Uses Gini impurity (classification) or MSE (regression)
  • Produces binary trees only (2-way splits)
  • Supports both classification and regression
  • Uses cost-complexity pruning for generalization
from sklearn.tree import DecisionTreeClassifier

# CART with different criteria
cart_gini = DecisionTreeClassifier(criterion='gini', max_depth=5)
cart_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=5)

cart_gini.fit(X_train, y_train)
cart_entropy.fit(X_train, y_train)

print(f"Gini accuracy:   {cart_gini.score(X_test, y_test):.4f}")
print(f"Entropy accuracy: {cart_entropy.score(X_test, y_test):.4f}")

Hyperparameters

ParameterDescriptionDefault
criterion'gini' or 'entropy' (classification), 'mse' (regression)'gini'
max_depthMaximum tree depthNone (unlimited)
min_samples_splitMinimum samples to split a node2
min_samples_leafMinimum samples in a leaf node1
max_featuresNumber of features to consider for best splitNone (all)
max_leaf_nodesMaximum number of leaf nodesNone (unlimited)
min_impurity_decreaseMinimum impurity decrease for a split0.0

Pruning Strategies

Pruning removes branches that provide little predictive power, reducing overfitting and improving generalization.

Before Pruning (Overfitting)RootNode ANode BC1C2D1D2E1E2F1F2→PruneAfter Pruning (Generalized)RootLeaf 1Leaf 2Pruning Benefits• Reduces overfitting• Improves generalization• Faster inference time

Pre-Pruning (Early Stopping)

Stop growing the tree before it becomes too complex.

  • max_depth — limit tree height
  • min_samples_split — require minimum samples to split
  • min_samples_leaf — require minimum samples in leaves
  • min_impurity_decrease — require minimum improvement

Advantage: Computationally efficient — avoids growing unnecessary branches.

Disadvantage: May stop too early (horizon effect) — a seemingly poor split now might lead to a good split later.

Post-Pruning

Grow the full tree first, then remove branches that don't improve generalization.

Reduced Error Pruning

  1. Grow tree to maximum depth
  2. For each non-leaf node (bottom-up):
    • Evaluate validation accuracy if subtree is replaced by a leaf
    • Prune if accuracy doesn't decrease

Cost-Complexity Pruning (CART)

Minimize the cost-complexity function:

Rα(T)=R(T)+αTR_\alpha(T) = R(T) + \alpha \cdot |T|

where:

  • R(T)R(T) is the misclassification rate (resubstitution error)
  • T|T| is the number of leaf nodes
  • α0\alpha \geq 0 is the complexity parameter

Optimal α\alpha is found via cross-validation:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

# Find optimal ccp_alpha
path = DecisionTreeClassifier(random_state=42).cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

cv_scores = []
for alpha in ccp_alphas:
    tree = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    scores = cross_val_score(tree, X_train, y_train, cv=5, scoring='accuracy')
    cv_scores.append(scores.mean())

optimal_alpha = ccp_alphas[np.argmax(cv_scores)]
print(f"Optimal ccp_alpha: {optimal_alpha:.6f}")

# Train final model
pruned_tree = DecisionTreeClassifier(ccp_alpha=optimal_alpha, random_state=42)
pruned_tree.fit(X_train, y_train)

Overfitting vs Proper Fit

UnderfittingHigh bias, low varianceTrain acc: LowProper FitBalanced bias-varianceTrain acc: High, Test acc: HighOverfittingLow bias, high varianceTrain acc: High, Test acc: LowTree Complexity →↑ Optimal Complexity (via Pruning/Cross-Validation)

Advantages and Disadvantages

Advantages

AdvantageDescription
InterpretabilityVisual tree structure is easy to explain to stakeholders
No feature scalingInvariant to monotonic feature transformations
Handles mixed typesWorks with numerical and categorical features
Feature importanceBuilt-in feature ranking via impurity decrease
Non-parametricNo assumptions about data distribution
Fast inferenceO(logn)O(\log n) prediction time

Disadvantages

DisadvantageDescription
OverfittingProne to memorizing noise without regularization
InstabilitySmall data changes can produce different trees
Greedy optimizationLocally optimal splits ≠ globally optimal tree
Imbalanced dataBias toward majority classes
Axis-aligned splitsCannot efficiently represent diagonal boundaries
ExtrapolationCannot predict beyond training range (regression)

Implementation in Python

Complete Pipeline

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree, export_text
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Load dataset
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- 1. Basic Decision Tree ---
basic_tree = DecisionTreeClassifier(random_state=42)
basic_tree.fit(X_train, y_train)

print(f"Depth: {basic_tree.get_depth()}")
print(f"Leaves: {basic_tree.get_n_leaves()}")
print(f"Train accuracy: {basic_tree.score(X_train, y_train):.4f}")
print(f"Test accuracy:  {basic_tree.score(X_test, y_test):.4f}")

# --- 2. Hyperparameter Tuning ---
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 2, 5, 10],
    'criterion': ['gini', 'entropy'],
    'ccp_alpha': [0.0, 0.001, 0.01, 0.02, 0.05]
}

grid_search = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)
grid_search.fit(X_train, y_train)

print(f"\nBest params: {grid_search.best_params_}")
print(f"Best CV accuracy: {grid_search.best_score_:.4f}")

best_tree = grid_search.best_estimator_
print(f"Test accuracy: {best_tree.score(X_test, y_test):.4f}")

# --- 3. Cost-Complexity Pruning Path ---
clf = DecisionTreeClassifier(random_state=42)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

fig, ax = plt.subplots(1, 2, figsize=(14, 5))

ax[0].plot(ccp_alphas, impurities, marker='o', drawstyle='steps-post')
ax[0].set_xlabel('ccp_alpha')
ax[0].set_ylabel('Total impurity of leaves')
ax[0].set_title('Effective Alphas vs Impurities')

trees = []
for ccp_alpha in ccp_alphas:
    tree = DecisionTreeClassifier(ccp_alpha=ccp_alpha, random_state=42)
    tree.fit(X_train, y_train)
    trees.append(tree)

train_scores = [t.score(X_train, y_train) for t in trees]
test_scores = [t.score(X_test, y_test) for t in trees]

ax[1].plot(ccp_alphas, train_scores, marker='o', label='train', drawstyle='steps-post')
ax[1].plot(ccp_alphas, test_scores, marker='o', label='test', drawstyle='steps-post')
ax[1].set_xlabel('ccp_alpha')
ax[1].set_ylabel('Accuracy')
ax[1].set_title('Accuracy vs Alpha')
ax[1].legend()

plt.tight_layout()
plt.show()

# --- 4. Feature Importance ---
importances = best_tree.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 6))
plt.title('Feature Importances')
plt.bar(range(X.shape[1]), importances[indices], align='center')
plt.xticks(range(X.shape[1]), X.columns[indices], rotation=45, ha='right')
plt.tight_layout()
plt.show()

# --- 5. Text Representation ---
print(export_text(best_tree, feature_names=list(X.columns), max_depth=3))

Regression Trees

from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Regression example
reg_tree = DecisionTreeRegressor(
    max_depth=5,
    min_samples_leaf=10,
    random_state=42
)
reg_tree.fit(X_train_reg, y_train_reg)

y_pred = reg_tree.predict(X_test_reg)
print(f"RMSE: {np.sqrt(mean_squared_error(y_test_reg, y_pred)):.4f}")
print(f"R²:   {r2_score(y_test_reg, y_pred):.4f}")

Key Takeaways

  1. Gini vs Entropy — both produce similar trees; Gini is slightly faster, Entropy tends to produce slightly more balanced trees
  2. Pre-pruning is computationally efficient but risks the horizon effect; post-pruning is more robust but requires a validation set
  3. Cost-complexity pruning with cross-validation is the standard approach for finding optimal tree size
  4. Decision trees are the foundation for ensemble methods (Random Forests, Gradient Boosting) which address individual tree limitations
  5. Feature importance from trees is based on impurity decrease and can be biased toward high-cardinality features — consider permutation importance as an alternative

Further Reading

  • Breiman, L. et al. (1984). Classification and Regression Trees. Wadsworth.
  • Quinlan, J.R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106.
  • Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann.
  • Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning. Springer. Chapter 9

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement