Cross-Validation & Bias-Variance Tradeoff
Why Cross-Validation?
When building machine learning models, we need reliable estimates of how well our model will perform on unseen data. Simply evaluating on training data leads to overfitting — the model memorizes patterns rather than learning generalizable relationships.
The Core Problem
┌─────────────────────────────────────────────────────────┐
│ MODEL EVALUATION │
├─────────────────────────────────────────────────────────┤
│ │
│ Training Data: Model learns patterns here │
│ ↓ │
│ Test Data: How do we know if it works? │
│ ↓ │
│ Problem: We only have ONE dataset! │
│ ↓ │
│ Solution: Cross-Validation │
│ │
└─────────────────────────────────────────────────────────┘
Cross-validation is a resampling technique that:
- Splits data into multiple subsets (folds)
- Trains on some folds, validates on others
- Repeats the process systematically
- Provides robust performance estimates
K-Fold Cross-Validation
The most common approach: split data into K equal folds.
Visual Representation
Dataset split into 5 folds:
Fold 1: [VAL] [TRN] [TRN] [TRN] [TRN] -> Score_1
Fold 2: [TRN] [VAL] [TRN] [TRN] [TRN] -> Score_2
Fold 3: [TRN] [TRN] [VAL] [TRN] [TRN] -> Score_3
Fold 4: [TRN] [TRN] [TRN] [VAL] [TRN] -> Score_4
Fold 5: [TRN] [TRN] [TRN] [TRN] [VAL] -> Score_5
Final Score = (Score_1 + Score_2 + Score_3 + Score_4 + Score_5) / 5
Mathematical Formulation
ThBias-Variance of K-Fold CV
The K-Fold CV estimator has the following bias-variance properties:
As increases, the bias decreases (more training data per fold) but variance increases (more correlation between folds). or provides a good balance.
Complete Python Implementation
📝K-Fold Cross-Validation Strategies
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_regression
np.random.seed(42)
X, y = make_regression(n_samples=500, n_features=20, noise=25, random_state=42)
X_noisy = np.column_stack([X, np.random.randn(500, 10)])
# 1. Basic K-Fold Cross-Validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
models = {
'Linear Regression': LinearRegression(),
'Ridge (α=1.0)': Ridge(alpha=1.0),
'Random Forest': RandomForestRegressor(n_estimators=50, random_state=42)
}
for name, model in models.items():
cv_scores = cross_val_score(model, X_noisy, y, cv=kfold, scoring='r2', n_jobs=-1)
print(f"{name}:")
print(f" Mean R²: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
Bias-Variance Tradeoff
The bias-variance tradeoff is fundamental to understanding model performance.
The Decomposition
For any model , the expected prediction error can be decomposed:
Mathematical Definition
Given true function and model :
Bias (systematic error):
Variance (sensitivity to training data):
Visual Explanation
HIGH BIAS, LOW VARIANCE LOW BIAS, HIGH VARIANCE
(Underfitting) (Overfitting)
o o o o o o o o o o
o o o o o o o o
o o o o o o
o o o o o o o o o
o o o o o o
──────────────── ────────────────
Simple Model Complex Model
• Misses true patterns • Captures noise
• Consistent predictions • Highly variable predictions
• High training error • Low training error
• High test error • High test error
LOW BIAS, LOW VARIANCE
(Optimal Model)
o o o o o
o o o ← True pattern
o o o
o o o o • Captures true patterns
o o • Consistent predictions
──────────────── • Low training error
Balanced Model • Low test error
Complete Bias-Variance Analysis
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
np.random.seed(42)
def true_function(x):
return np.sin(1.5 * np.pi * x)
n_train = 30
X_train = np.sort(np.random.uniform(0, 1, n_train))
y_train = true_function(X_train) + np.random.normal(0, 0.3, n_train)
n_test = 100
X_test = np.sort(np.random.uniform(0, 1, n_test))
y_test = true_function(X_test) + np.random.normal(0, 0.3, n_test)
degrees = [1, 3, 5, 10, 15, 20]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()
for i, degree in enumerate(degrees):
poly = PolynomialFeatures(degree=degree)
X_poly_train = poly.fit_transform(X_train.reshape(-1, 1))
X_poly_test = poly.transform(X_test.reshape(-1, 1))
model = LinearRegression()
model.fit(X_poly_train, y_train)
y_pred_train = model.predict(X_poly_train)
y_pred_test = model.predict(X_poly_test)
train_mse = mean_squared_error(y_train, y_pred_train)
test_mse = mean_squared_error(y_test, y_pred_test)
ax = axes[i]
X_plot = np.linspace(0, 1, 100)
X_poly_plot = poly.transform(X_plot.reshape(-1, 1))
y_plot = model.predict(X_poly_plot)
ax.scatter(X_train, y_train, alpha=0.6, label='Train', s=20)
ax.scatter(X_test, y_test, alpha=0.3, label='Test', s=20)
ax.plot(X_plot, true_function(X_plot), 'g--', label='True', linewidth=2)
ax.plot(X_plot, y_plot, 'r-', label=f'Degree {degree}', linewidth=2)
ax.set_title(f'Degree {degree}\nTrain MSE: {train_mse:.4f}\nTest MSE: {test_mse:.4f}')
ax.legend(loc='upper right', fontsize=8)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Learning Curves Analysis
Learning curves visualize how model performance changes with training data size.
Types of Learning Curves
UNDERFITTING (High Bias): OVERFITTING (High Variance):
Error ↑ Error ↑
│ ╲ Train │╲
│ ╲ │ ╲ Train
│ ╲ │ ╲
│ ╲_____ │ ╲
│ ╲______ │ ╲_____ CV
│ ────── CV │
└────────────────────→ Data └────────────────────→ Data
• Both curves converge • Curves don't converge
• High final error • Gap between curves
• Need more complex model • Need more data or simpler model
from sklearn.model_selection import learning_curve
def plot_learning_curve(estimator, title, X, y, cv=5):
train_sizes_abs, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, train_sizes=np.linspace(0.1, 1.0, 10),
scoring='neg_mean_squared_error'
)
train_scores_mean = -train_scores.mean(axis=1)
test_scores_mean = -test_scores.mean(axis=1)
plt.figure(figsize=(10, 6))
plt.title(title)
plt.xlabel("Training examples")
plt.ylabel("MSE")
plt.grid(True, alpha=0.3)
plt.plot(train_sizes_abs, train_scores_mean, 'o-', label="Training score")
plt.plot(train_sizes_abs, test_scores_mean, 'o-', label="Cross-validation score")
plt.legend(loc="best")
plt.tight_layout()
plt.show()
Key Takeaways
📋Summary: Cross-Validation & Bias-Variance Tradeoff
- Cross-validation provides robust performance estimates without wasting data
- K-Fold (K=5 or 10) is the standard choice for most problems
- Stratified K-Fold is essential for imbalanced classification
- Always use pipelines to prevent data leakage in preprocessing
- Bias-variance tradeoff: Simple models have high bias, complex models have high variance:
- Learning curves diagnose underfitting vs overfitting
- The sweet spot is where test error is minimized
Practice Exercises
Exercise 1: Cross-Validation Strategies
from sklearn.datasets import load_breast_cancer
from sklearn.svm import SVC
X, y = load_breast_cancer(return_X_y=True)
# Compare different CV strategies:
# 1. K-Fold with K=3, 5, 10, 20
# 2. Stratified K-Fold
# 3. Leave-One-Out
# 4. Repeated K-Fold
# Which gives the most stable estimates?
Exercise 2: Bias-Variance Analysis
# Analyze bias-variance tradeoff for:
# 1. Polynomial regression (degrees 1-20)
# 2. Decision trees (max_depth 1-20)
# 3. Random forests (n_estimators 1-100)
# Plot learning curves and identify optimal complexity.
Exercise 3: Practical Model Selection
from sklearn.datasets import fetch_california_housing
X, y = fetch_california_housing(return_X_y=True)
# Build a complete model selection pipeline:
# 1. Split data into train/test (80/20)
# 2. Use 5-fold CV on training set
# 3. Try 3+ different algorithms
# 4. Select best model
# 5. Evaluate on test set
# 6. Compare CV estimate with test performance
Exercise 4: Diagnose and Fix
# Given a model with poor performance:
# 1. Plot learning curve
# 2. Determine if underfitting or overfitting
# 3. Apply appropriate fix
# 4. Re-evaluate with CV
Summary Table
| Technique | Use Case | Pros | Cons |
|---|---|---|---|
| K-Fold | General purpose | Good bias-variance balance | Misses class distribution |
| Stratified K-Fold | Classification | Maintains class balance | Slightly complex |
| LOO | Very small datasets | Maximum training data | Computationally expensive |
| Repeated K-Fold | Stable estimates | More robust results | 10x slower |
| Time Series | Temporal data | Respects order | Limited folds |