← Math|59 of 100
Statistics

Model Selection

Master cross-validation, AIC, BIC, and the bias-variance tradeoff.

πŸ“‚ Model EvaluationπŸ“– Lesson 59 of 100πŸŽ“ Free Course

Advertisement

Model Selection

ℹ️ Why It Matters

Choosing the right model complexity prevents overfitting and improves generalization. Too simple β†’ underfitting (high bias). Too complex β†’ overfitting (high variance). Model selection methods β€” cross-validation, AIC, BIC β€” find the sweet spot that balances fit and complexity for reliable prediction on unseen data.


Overview

The bias-variance tradeoff decomposes prediction error into biasΒ² (error from incorrect assumptions), variance (error from sensitivity to training data), and irreducible noise. Cross-validation estimates out-of-sample performance by training on kβˆ’1k-1 folds and testing on the held-out fold, repeating for each fold. AIC (βˆ’2β„“+2k-2\ell + 2k) minimizes prediction error and favors larger models. BIC (βˆ’2β„“+klog⁑n-2\ell + k\log n) penalizes complexity more heavily and favors simpler models. Regularization (Ridge/Lasso) implicitly selects complexity by shrinking coefficients β€” Lasso drives some to zero for automatic feature selection. The goal is always to minimize expected prediction error on new data.


Key Concepts

Bias-Variance Decomposition

ExpectedΒ Error=Bias2+Variance+Noise\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

Here,

  • Bias2\text{Bias}^2=Error from incorrect assumptions (underfitting)
  • Variance\text{Variance}=Error from sensitivity to training data (overfitting)
  • Noise\text{Noise}=Irreducible error from random variation

AIC (Akaike Information Criterion)

AIC=βˆ’2β„“+2kAIC = -2\ell + 2k

Here,

  • β„“\ell=Maximized log-likelihood
  • kk=Number of parameters

BIC (Bayesian Information Criterion)

BIC=βˆ’2β„“+klog⁑nBIC = -2\ell + k\log n

Here,

  • nn=Sample size
  • kk=Number of parameters

K-Fold Cross-Validation

CV(K)=1Kβˆ‘i=1KMSEiCV_{(K)} = \frac{1}{K}\sum_{i=1}^{K} \text{MSE}_i

Here,

  • KK=Number of folds
  • MSEi\text{MSE}_i=Mean squared error on fold i

Ridge Regression (L2)

Ξ²^Ridge=arg⁑min⁑β[βˆ‘(yiβˆ’XiΞ²)2+Ξ±βˆ‘Ξ²j2]\hat{\beta}_{Ridge} = \arg\min_\beta \left[\sum(y_i - X_i\beta)^2 + \alpha\sum\beta_j^2\right]

Here,

  • Ξ±\alpha=Regularization strength

Lasso Regression (L1)

Ξ²^Lasso=arg⁑min⁑β[βˆ‘(yiβˆ’XiΞ²)2+Ξ±βˆ‘βˆ£Ξ²j∣]\hat{\beta}_{Lasso} = \arg\min_\beta \left[\sum(y_i - X_i\beta)^2 + \alpha\sum|\beta_j|\right]

Here,

  • Ξ±\alpha=Regularization strength

AIC vs BIC

CriterionPenaltyFavorsBest ForConsistency
AIC2k2kLarger modelsPrediction accuracyNo
BICklog⁑nk\log nSimpler modelsInterpretabilityYes

Regularization Comparison

MethodPenaltyEffectFeature Selection?
Ridge (L2)Ξ±βˆ‘Ξ²j2\alpha\sum\beta_j^2Shrinks all coefficientsNo
Lasso (L1)Ξ±βˆ‘βˆ£Ξ²j∣\alpha\sum|\beta_j|Drives some to zeroYes
Elastic NetL1 + L2Combines bothPartially

Quick Example

πŸ“Choosing Between Models

Model A: AIC = 100, BIC = 110. Model B: AIC = 105, BIC = 105.

  • If prediction is the goal: prefer Model A (lower AIC).
  • If interpretability or sparse true model: prefer Model B (lower BIC).
  • In large samples, BIC is consistent (selects the true model if it's in the candidate set). AIC minimizes KL divergence (best predictive model).

πŸ“Cross-Validation for Regularization

Ridge regression with Ξ±=0.01\alpha = 0.01: CV MSE = 45.2. Ξ±=1\alpha = 1: CV MSE = 38.7. Ξ±=100\alpha = 100: CV MSE = 52.1.

Best Ξ±=1\alpha = 1 β€” it balances bias and variance. Too small Ξ±\alpha overfits; too large Ξ±\alpha underfits.


Key Takeaways

πŸ“‹Summary: Model Selection

  • Bias-Variance Tradeoff: Increasing complexity reduces bias but increases variance. The optimal model minimizes their sum.
  • Cross-Validation: K-fold CV estimates generalization error. Use it to compare models and tune hyperparameters.
  • AIC vs BIC: AIC minimizes prediction error (favors larger models). BIC penalizes complexity more (favors simpler models).
  • Regularization: Ridge (β„“2\ell_2) shrinks all coefficients. Lasso (β„“1\ell_1) drives some to zero, performing feature selection.
  • Underfitting vs Overfitting: High bias = too simple; high variance = too complex. Monitor training vs. validation error curves.
  • Workflow: Split data into train/validation/test β†’ use CV to select hyperparameters β†’ evaluate once on held-out test set.
  • Adjusted RΒ²: Penalizes for adding predictors. Use it to compare models with different numbers of features.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Cross-Validation

  • Cross-Validation β€” K-fold, stratified, leave-one-out, and nested cross-validation

Information Criteria

  • AIC and BIC β€” Derivation, interpretation, model averaging, and when each is appropriate

ROC and AUC

  • ROC and AUC β€” Threshold-independent evaluation, ROC curves, AUC interpretation, and trade-offs

Related Topics

Lesson Progress59 / 100