Model Selection
βΉοΈ Why It Matters
Choosing the right model complexity prevents overfitting and improves generalization. Too simple β underfitting (high bias). Too complex β overfitting (high variance). Model selection methods β cross-validation, AIC, BIC β find the sweet spot that balances fit and complexity for reliable prediction on unseen data.
Overview
The bias-variance tradeoff decomposes prediction error into biasΒ² (error from incorrect assumptions), variance (error from sensitivity to training data), and irreducible noise. Cross-validation estimates out-of-sample performance by training on folds and testing on the held-out fold, repeating for each fold. AIC () minimizes prediction error and favors larger models. BIC () penalizes complexity more heavily and favors simpler models. Regularization (Ridge/Lasso) implicitly selects complexity by shrinking coefficients β Lasso drives some to zero for automatic feature selection. The goal is always to minimize expected prediction error on new data.
Key Concepts
Bias-Variance Decomposition
Here,
- =Error from incorrect assumptions (underfitting)
- =Error from sensitivity to training data (overfitting)
- =Irreducible error from random variation
AIC (Akaike Information Criterion)
Here,
- =Maximized log-likelihood
- =Number of parameters
BIC (Bayesian Information Criterion)
Here,
- =Sample size
- =Number of parameters
K-Fold Cross-Validation
Here,
- =Number of folds
- =Mean squared error on fold i
Ridge Regression (L2)
Here,
- =Regularization strength
Lasso Regression (L1)
Here,
- =Regularization strength
AIC vs BIC
| Criterion | Penalty | Favors | Best For | Consistency |
|---|---|---|---|---|
| AIC | Larger models | Prediction accuracy | No | |
| BIC | Simpler models | Interpretability | Yes |
Regularization Comparison
| Method | Penalty | Effect | Feature Selection? |
|---|---|---|---|
| Ridge (L2) | Shrinks all coefficients | No | |
| Lasso (L1) | Drives some to zero | Yes | |
| Elastic Net | L1 + L2 | Combines both | Partially |
Quick Example
πChoosing Between Models
Model A: AIC = 100, BIC = 110. Model B: AIC = 105, BIC = 105.
- If prediction is the goal: prefer Model A (lower AIC).
- If interpretability or sparse true model: prefer Model B (lower BIC).
- In large samples, BIC is consistent (selects the true model if it's in the candidate set). AIC minimizes KL divergence (best predictive model).
πCross-Validation for Regularization
Ridge regression with : CV MSE = 45.2. : CV MSE = 38.7. : CV MSE = 52.1.
Best β it balances bias and variance. Too small overfits; too large underfits.
Key Takeaways
πSummary: Model Selection
- Bias-Variance Tradeoff: Increasing complexity reduces bias but increases variance. The optimal model minimizes their sum.
- Cross-Validation: K-fold CV estimates generalization error. Use it to compare models and tune hyperparameters.
- AIC vs BIC: AIC minimizes prediction error (favors larger models). BIC penalizes complexity more (favors simpler models).
- Regularization: Ridge () shrinks all coefficients. Lasso () drives some to zero, performing feature selection.
- Underfitting vs Overfitting: High bias = too simple; high variance = too complex. Monitor training vs. validation error curves.
- Workflow: Split data into train/validation/test β use CV to select hyperparameters β evaluate once on held-out test set.
- Adjusted RΒ²: Penalizes for adding predictors. Use it to compare models with different numbers of features.
Deep Dive
For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:
Cross-Validation
- Cross-Validation β K-fold, stratified, leave-one-out, and nested cross-validation
Information Criteria
- AIC and BIC β Derivation, interpretation, model averaging, and when each is appropriate
ROC and AUC
- ROC and AUC β Threshold-independent evaluation, ROC curves, AUC interpretation, and trade-offs
Related Topics
- Regression Assumptions β What happens when model assumptions are violated
- Multiple Linear Regression β Variable selection in regression models
- Ridge Regression β L2 regularization for multicollinearity
- Lasso Regression β L1 regularization for feature selection
- Elastic Net β Combining L1 and L2 penalties