Why Cross-Validation?
The Holdout Method Limitation
The naive holdout approach splits data into training and test sets once. This has critical flaws:
Key limitations:
- Performance estimate has high variance (depends on single split)
- Wastes data (test set never used for training)
- Can't assess model stability
- Risk of optimistic/pessimistic bias
K-Fold Cross-Validation
The gold standard for model evaluation. Split data into folds, train on , test on 1, rotate.
Mathematical Formulation
For dataset partitioned into folds :
where is the model trained on all data except fold , and is the loss function.
Choice of k
| k | Pros | Cons |
|---|---|---|
| 5 | Good bias-variance tradeoff | Standard choice |
| 10 | Lower bias estimate | Higher computational cost |
| (LOO) | Nearly unbiased | High variance, expensive |
Stratified K-Fold
Ensures each fold maintains the same class distribution as the full dataset โ critical for imbalanced problems.
Leave-One-Out (LOO) Cross-Validation
A special case of K-Fold where (number of samples):
Characteristics:
- Nearly unbiased estimate of generalization error
- High variance (each training set differs by only 1 sample)
- Computationally expensive: model fits
- Approximately equivalent to AIC for linear models
Time Series Cross-Validation
Standard K-Fold violates temporal ordering. Use expanding or sliding windows instead.
Bias-Variance Tradeoff
Mathematical Decomposition
For model trained on dataset , the expected prediction error at point decomposes as:
where:
Intuition
Underfitting vs Overfitting Diagnosis
Diagnostic Summary
| Symptom | Diagnosis | Remedy |
|---|---|---|
| Train acc โซ Val acc | Overfitting | Regularization, more data, simpler model |
| Train acc โ Val acc (both low) | Underfitting | More features, complex model |
| High CV variance | Unstable model | More data, simpler model, ensemble |
Model Selection with Cross-Validation
Use nested cross-validation to avoid optimistic bias when selecting hyperparameters:
Outer Loop: Evaluate generalization
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Split into Train/Test โ
โ โ
โ Inner Loop: Hyperparameter Tuning โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ Split Train into Train/Val โ โ
โ โ Try all hyperparameter combinations โ โ
โ โ Select best by validation score โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
โ Train final model with best params on Trainโ
โ Evaluate on outer Test โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Implementation in Python
import numpy as np
from sklearn.model_selection import (
KFold, StratifiedKFold, TimeSeriesSplit,
cross_val_score, GridSearchCV, learning_curve
)
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20,
n_informative=10, n_classes=2,
random_state=42)
# --- Basic K-Fold ---
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(Ridge(alpha=1.0), X, y, cv=kfold, scoring='accuracy')
print(f"K-Fold CV: {scores.mean():.4f} ยฑ {scores.std():.4f}")
# --- Stratified K-Fold ---
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(RandomForestClassifier(n_estimators=100),
X, y, cv=skfold, scoring='accuracy')
print(f"Stratified CV: {scores.mean():.4f} ยฑ {scores.std():.4f}")
# --- Nested CV for model selection ---
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
param_grid = {
'classifier__n_estimators': [50, 100, 200],
'classifier__max_depth': [5, 10, 20, None]
}
pipe = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(random_state=42))
])
grid_search = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='accuracy')
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='accuracy')
print(f"Nested CV: {nested_scores.mean():.4f} ยฑ {nested_scores.std():.4f}")
# --- Time Series CV ---
tscv = TimeSeriesSplit(n_splits=5)
for i, (train_idx, test_idx) in enumerate(tscv.split(X)):
print(f"Split {i+1}: Train={len(train_idx)}, Test={len(test_idx)}")
# --- Learning Curves ---
train_sizes, train_scores, val_scores = learning_curve(
RandomForestClassifier(n_estimators=100), X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='accuracy', n_jobs=-1
)
print(f"\nLearning Curve (sample sizes):")
for size, train, val in zip(train_sizes, train_scores.mean(axis=1), val_scores.mean(axis=1)):
print(f" n={size:4d}: train={train:.3f}, val={val:.3f}, gap={train-val:.3f}")
Key Takeaways
- Always use cross-validation โ holdout estimates are unreliable
- Stratified K-Fold is essential for classification (especially imbalanced)
- Time series require temporal ordering โ never shuffle
- Bias-variance tradeoff is fundamental: optimize the total error, not just bias
- Learning curves reveal whether you need more data, more features, or regularization
- Nested CV avoids optimistic bias in model selection
- Variance of CV scores matters โ high variance signals instability