Linear Regression: Bias-Variance Tradeoff & Regularization
Understanding the foundational algorithm behind predictive modeling
Interview Question
"Explain the bias-variance tradeoff in the context of linear regression. How does regularization (L1 vs L2) address overfitting, and when would you choose one over the other?"
Difficulty: Medium-Hard | Frequently asked at Google, Amazon, Meta
Theoretical Foundation
What is Linear Regression?
Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation:
where is the intercept, are coefficients, and is the error term assumed to follow .
The ordinary least squares (OLS) estimator minimizes the residual sum of squares:
The closed-form solution is:
ℹ️
Key Insight: The OLS estimator is the Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov assumptions. However, "best" doesn't mean optimal when the model is misspecified or when is large relative to .
The Bias-Variance Tradeoff
The expected prediction error for any model can be decomposed into three components:
Bias measures how far off the model's predictions are from the true values on average. High bias indicates the model is too simple (underfitting). For linear regression:
Variance measures how much the model's predictions change when trained on different subsets of data. High variance indicates the model is too complex (overfitting):
Irreducible error represents noise that no model can eliminate.
⚠️
Common Interview Trap: Many candidates explain bias and variance separately but fail to articulate the tradeoff. As model complexity increases, bias decreases but variance increases. The optimal model minimizes the total error, not just one component.
Visual Intuition
Think of a dartboard analogy:
- Low Bias, Low Variance = Darts clustered at the bullseye (ideal)
- Low Bias, High Variance = Darts scattered but centered on bullseye
- High Bias, Low Variance = Darts clustered but far from bullseye
- High Bias, High Variance = Darts scattered and far from bullseye (worst)
How Regularization Addresses Overfitting
When we have many features or correlated features, OLS can produce unstable estimates with high variance. Regularization adds a penalty term to the loss function to constrain model complexity.
Ridge Regression (L2 Regularization)
The closed-form solution becomes:
Key properties:
- Shrinks coefficients toward zero but never exactly to zero
- Handles multicollinearity by distributing weight among correlated features
- The tuning parameter controls the strength of regularization
- When , all coefficients
Lasso Regression (L1 Regularization)
Key properties:
- Can shrink coefficients exactly to zero (sparse solutions)
- Performs automatic feature selection
- The penalty creates a diamond-shaped constraint region that tends to hit axes
- No closed-form solution; solved using coordinate descent
Elastic Net (L1 + L2 Combined)
💡
When to use Elastic Net: When you have many correlated features and want both feature selection (L1) and group selection (L2). Netflix famously uses Elastic Net for their recommendation system feature selection.
L1 vs L2: When to Choose Which?
| Criterion | L1 (Lasso) | L2 (Ridge) |
|---|---|---|
| Feature Selection | Yes (sparse solutions) | No |
| Multicollinearity | Selects one feature from group | Distributes weight evenly |
| Interpretability | Higher (fewer features) | Lower |
| Computation | Slower (no closed-form) | Faster (closed-form) |
| When | Selects at most features | Can include all features |
| Solution Uniqueness | Not unique if | Always unique |
| Geometric Interpretation | Diamond constraint (corners on axes) | Circular constraint (no corners) |
Code Implementation
Explanation of Code
The code above demonstrates:
-
Data Generation: Creates a synthetic dataset with 50 features but only 10 truly informative, simulating real-world scenarios where many features are irrelevant.
-
Model Comparison: Shows how OLS, Ridge, Lasso, and Elastic Net perform differently on the same data. Notice Lasso produces sparse solutions while Ridge keeps all features.
-
Cross-Validation for λ Selection: Demonstrates how to find the optimal regularization strength by balancing training and validation performance.
-
Coefficient Shrinkage Paths: Visualizes how L1 and L2 regularization differently shrink coefficients as λ increases.
-
Bias-Variance Decomposition: Uses bootstrap to directly estimate bias² and variance at different regularization strengths.
Real-World Applications
Google: Ad Click Prediction
Google uses regularized linear models (specifically FTRL-Proximal) for online ad click prediction. The sparsity induced by L1 regularization is crucial for serving billions of predictions per second with limited memory.
Amazon: Dynamic Pricing
Amazon's pricing algorithms use Ridge regression to handle the multicollinearity between features like competitor price, demand, and inventory levels.
Finance: Risk Models
Banks use Elastic Net for credit scoring where:
- L1 selects the most predictive financial ratios
- L2 handles the natural correlation between financial metrics
- The resulting model must be interpretable for regulatory compliance
💡
Production Tip: In production systems, always standardize your features before applying regularization. The penalty is applied to coefficient magnitudes, so features on different scales will be penalized unequally.
Common Follow-Up Questions
Q1: What happens to the bias-variance tradeoff as you increase λ?
As λ increases:
- Bias increases: The model becomes more constrained, moving away from the true function
- Variance decreases: The model becomes more stable across different training sets
- Optimal λ minimizes the total prediction error
Q2: Can Lasso select more features than the number of samples?
No. Lasso can select at most features (where is the number of samples). When , Lasso selects at most non-zero coefficients. This is a fundamental limitation that Elastic Net addresses.
Q3: How do you choose between Ridge and Lasso in practice?
- If you believe most features are relevant → Ridge
- If you believe few features are relevant → Lasso
- If you're unsure → Elastic Net with cross-validation
- Always check the correlation structure of your features
Q4: What is the connection between regularization and Bayesian inference?
- Ridge = Gaussian prior on coefficients:
- Lasso = Laplace prior on coefficients:
- The MAP estimate with these priors yields the regularized solutions
Company-Specific Tips
Google Interview Tips
- Emphasize understanding of the mathematical derivation of the closed-form solution
- Be prepared to discuss computational complexity when is very large
- Mention SGD variants for online learning scenarios
- Discuss FTRL-Proximal for sparse online learning
Amazon Interview Tips
- Focus on business impact: How does regularization improve predictions on unseen data?
- Discuss A/B testing regularized models in production
- Be ready to explain why regularization reduces overfitting using the variance formula
- Mention interpretability as a business requirement