CW

Cross-Validation and Bias-Variance Tradeoff

Module 7: Machine Learning FundamentalsFree Lesson

Advertisement

Why Cross-Validation?

The Holdout Method Limitation

The naive holdout approach splits data into training and test sets once. This has critical flaws:

Holdout Method: Single SplitTraining Set (80%)Test (20%)Split A: Good TestSplit B: Poor TestProblem: Performance estimate depends on arbitrary splitHigh variance in evaluation metric

Key limitations:

  • Performance estimate has high variance (depends on single split)
  • Wastes data (test set never used for training)
  • Can't assess model stability
  • Risk of optimistic/pessimistic bias

K-Fold Cross-Validation

The gold standard for model evaluation. Split data into kk folds, train on kโˆ’1k-1, test on 1, rotate.

K-Fold Cross-Validation Process (k=5)Fold 1:TestTrainTrainTrainTrainโ†’ Scoreโ‚Fold 2:TrainTestTrainTrainTrainโ†’ Scoreโ‚‚Fold 3:TrainTrainTestTrainTrainโ†’ Scoreโ‚ƒFold 4:TrainTrainTrainTestTrainโ†’ Scoreโ‚„Fold 5:TrainTrainTrainTrainTestโ†’ Scoreโ‚…CV Score = (Scoreโ‚ + Scoreโ‚‚ + Scoreโ‚ƒ + Scoreโ‚„ + Scoreโ‚…) / 5

Mathematical Formulation

For dataset D={(xi,yi)}i=1n\mathcal{D} = \{(x_i, y_i)\}_{i=1}^{n} partitioned into kk folds {F1,F2,โ€ฆ,Fk}\{F_1, F_2, \ldots, F_k\}:

CV(k)=1kโˆ‘i=1kL(f^โˆ’Fi,Fi)\text{CV}_{(k)} = \frac{1}{k} \sum_{i=1}^{k} \mathcal{L}\left(\hat{f}^{-F_i}, F_i\right)

where f^โˆ’Fi\hat{f}^{-F_i} is the model trained on all data except fold FiF_i, and L\mathcal{L} is the loss function.

Choice of k

kProsCons
5Good bias-variance tradeoffStandard choice
10Lower bias estimateHigher computational cost
nn (LOO)Nearly unbiasedHigh variance, expensive

Stratified K-Fold

Ensures each fold maintains the same class distribution as the full dataset โ€” critical for imbalanced problems.

Regular vs Stratified K-Fold (Imbalanced Dataset: 90% Class A, 10% Class B)Regular K-Fold:Fold 1: 95% A, 5% BFold 2: 88% A, 12% BFold 3: 100% A, 0% BFold 4: 77% A, 23% Bโš  Fold 3 has no Class B samples โ€” model never learns minority class!Stratified K-Fold:Fold 1: 90% A, 10% BFold 2: 90% A, 10% BFold 3: 90% A, 10% BFold 4: 90% A, 10% Bโœ“ Each fold preserves class distribution โ€” reliable evaluationscikit-learn: StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Leave-One-Out (LOO) Cross-Validation

A special case of K-Fold where k=nk = n (number of samples):

LOO-CV=1nโˆ‘i=1nL(f^โˆ’xi,(xi,yi))\text{LOO-CV} = \frac{1}{n} \sum_{i=1}^{n} \mathcal{L}\left(\hat{f}^{-x_i}, (x_i, y_i)\right)

Characteristics:

  • Nearly unbiased estimate of generalization error
  • High variance (each training set differs by only 1 sample)
  • Computationally expensive: O(n)O(n) model fits
  • Approximately equivalent to AIC for linear models

Time Series Cross-Validation

Standard K-Fold violates temporal ordering. Use expanding or sliding windows instead.

Time Series Cross-Validation (Expanding Window)Time โ†’Split 1:Train (tโ‚-tโ‚†)Test (tโ‚‡)Split 2:Train (tโ‚-tโ‚ˆ)Test (tโ‚‰)Split 3:Train (tโ‚-tโ‚โ‚€)Test (tโ‚โ‚)Split 4:Train (tโ‚-tโ‚โ‚‚)Test (tโ‚โ‚ƒ)Critical Rule: Never use future data to predict the past!Each training set is a prefix of the data; test set always follows training period.

Bias-Variance Tradeoff

Mathematical Decomposition

For model f^\hat{f} trained on dataset D\mathcal{D}, the expected prediction error at point xx decomposes as:

E[(yโˆ’f^(x))2]=Bias2(f^(x))โŸSystematicย error+Var(f^(x))โŸSensitivityย toย data+ฯƒฯต2โŸIrreducibleย noise\mathbb{E}\left[\left(y - \hat{f}(x)\right)^2\right] = \underbrace{\text{Bias}^2\left(\hat{f}(x)\right)}_{\text{Systematic error}} + \underbrace{\text{Var}\left(\hat{f}(x)\right)}_{\text{Sensitivity to data}} + \underbrace{\sigma^2_{\epsilon}}_{\text{Irreducible noise}}

where:

Bias(f^(x))=E[f^(x)]โˆ’f(x)\text{Bias}\left(\hat{f}(x)\right) = \mathbb{E}\left[\hat{f}(x)\right] - f(x)
Var(f^(x))=E[(f^(x)โˆ’E[f^(x)])2]\text{Var}\left(\hat{f}(x)\right) = \mathbb{E}\left[\left(\hat{f}(x) - \mathbb{E}[\hat{f}(x)]\right)^2\right]

Intuition

Bias-Variance: Target AnalogyLow Bias, High VarianceHigh model complexityHigh Bias, High VarianceWrong model familyLow Bias, Low VarianceSweet spot!The tradeoff: Increasing model complexity reduces bias but increases variance.Optimal complexity minimizes total error = Biasยฒ + Variance + Noise.

Underfitting vs Overfitting Diagnosis

Learning Curves: Diagnosing Model ProblemsUnderfitting (High Bias)ScoreTraining SizeValTrainBoth plateau at LOW scoreGood FitTraining SizeValTrainConverge at HIGH score, small gapOverfitting (High Variance)Training SizeValTrainLARGE gap between Train and ValTrain โ‰ซ Val โ†’ Overfitting | Train โ‰ˆ Val (both low) โ†’ Underfitting | Train โ‰ˆ Val (both high) โ†’ Good fit

Diagnostic Summary

SymptomDiagnosisRemedy
Train acc โ‰ซ Val accOverfittingRegularization, more data, simpler model
Train acc โ‰ˆ Val acc (both low)UnderfittingMore features, complex model
High CV varianceUnstable modelMore data, simpler model, ensemble

Model Selection with Cross-Validation

Use nested cross-validation to avoid optimistic bias when selecting hyperparameters:

Architecture Diagram
Outer Loop: Evaluate generalization
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Split into Train/Test                      โ”‚
โ”‚                                             โ”‚
โ”‚  Inner Loop: Hyperparameter Tuning          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  Split Train into Train/Val           โ”‚  โ”‚
โ”‚  โ”‚  Try all hyperparameter combinations  โ”‚  โ”‚
โ”‚  โ”‚  Select best by validation score      โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                             โ”‚
โ”‚  Train final model with best params on Trainโ”‚
โ”‚  Evaluate on outer Test                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Implementation in Python

import numpy as np
from sklearn.model_selection import (
    KFold, StratifiedKFold, TimeSeriesSplit,
    cross_val_score, GridSearchCV, learning_curve
)
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=20,
                           n_informative=10, n_classes=2,
                           random_state=42)

# --- Basic K-Fold ---
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(Ridge(alpha=1.0), X, y, cv=kfold, scoring='accuracy')
print(f"K-Fold CV: {scores.mean():.4f} ยฑ {scores.std():.4f}")

# --- Stratified K-Fold ---
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(RandomForestClassifier(n_estimators=100),
                         X, y, cv=skfold, scoring='accuracy')
print(f"Stratified CV: {scores.mean():.4f} ยฑ {scores.std():.4f}")

# --- Nested CV for model selection ---
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

param_grid = {
    'classifier__n_estimators': [50, 100, 200],
    'classifier__max_depth': [5, 10, 20, None]
}

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

grid_search = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='accuracy')
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy')
print(f"Nested CV: {nested_scores.mean():.4f} ยฑ {nested_scores.std():.4f}")

# --- Time Series CV ---
tscv = TimeSeriesSplit(n_splits=5)
for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
    print(f"Split {i+1}: Train={len(train_idx)}, Test={len(test_idx)}")

# --- Learning Curves ---
train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(n_estimators=100), X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='accuracy', n_jobs=-1
)

print(f"\nLearning Curve (sample sizes):")
for size, train, val in zip(train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)):
    print(f"  n={size:4d}: train={train:.3f}, val={val:.3f}, gap={train-val:.3f}")

Key Takeaways

  1. Always use cross-validation โ€” holdout estimates are unreliable
  2. Stratified K-Fold is essential for classification (especially imbalanced)
  3. Time series require temporal ordering โ€” never shuffle
  4. Bias-variance tradeoff is fundamental: optimize the total error, not just bias
  5. Learning curves reveal whether you need more data, more features, or regularization
  6. Nested CV avoids optimistic bias in model selection
  7. Variance of CV scores matters โ€” high variance signals instability

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement