🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Regularization: L1, L2, Elastic Net & Dropout

Machine LearningRegularization⭐ Premium

Advertisement

OpenAI & Anthropic Interview

Regularization: L1, L2, Elastic Net & Dropout

Techniques to prevent overfitting and improve generalization

Interview Question

"Explain the difference between L1 and L2 regularization from both geometric and Bayesian perspectives. What is Elastic Net and when would you use it? How does dropout work in neural networks?"

Difficulty: Hard | Frequently asked at OpenAI, Anthropic, Google


Theoretical Foundation

Why Regularization?

Without regularization, models can overfit by learning noise, assigning large weights to irrelevant features, and creating overly complex decision boundaries. Regularization adds constraints to prevent this.

L2 Regularization (Ridge)

Mathematical Formulation

J(β)=Loss(β)+λβ22=Loss(β)+λj=1pβj2J(\beta) = \text{Loss}(\beta) + \lambda \|\beta\|_2^2 = \text{Loss}(\beta) + \lambda \sum_{j=1}^{p} \beta_j^2

Geometric Interpretation

The L2 constraint region is a hypersphere. The solution is where the loss contour first touches this sphere. Since spheres are smooth, the solution typically doesn't land on axes.

Bayesian Interpretation

L2 regularization corresponds to a Gaussian prior on coefficients:

P(β)exp(λ2β22)P(\beta) \propto \exp\left(-\frac{\lambda}{2} \|\beta\|_2^2\right)

Properties

  • Shrinks coefficients toward zero but never exactly to zero
  • Handles multicollinearity by distributing weight evenly
  • Differentiable everywhere
  • Closed-form solution exists

L1 Regularization (Lasso)

Mathematical Formulation

J(β)=Loss(β)+λβ1=Loss(β)+λj=1pβjJ(\beta) = \text{Loss}(\beta) + \lambda \|\beta\|_1 = \text{Loss}(\beta) + \lambda \sum_{j=1}^{p} |\beta_j|

Geometric Interpretation

The L1 constraint region is a hypercube. The solution is where the loss contour touches a corner of this cube. Corners are on axes, promoting sparsity.

Bayesian Interpretation

L1 regularization corresponds to a Laplace prior:

P(β)exp(λβ1)P(\beta) \propto \exp\left(-\lambda \|\beta\|_1\right)

The Laplace prior has a peak at zero, encouraging sparsity.

Properties

  • Can shrink coefficients exactly to zero (sparse solutions)
  • Performs automatic feature selection
  • Not differentiable at zero
  • No closed-form solution (use coordinate descent)

L1 vs L2 Comparison

AspectL1 (Lasso)L2 (Ridge)
SparsityYes (feature selection)No
GeometryDiamond (corners on axes)Sphere (no corners)
PriorLaplaceGaussian
DifferentiabilityNot at zeroEverywhere
Closed-formNoYes
MulticollinearitySelects one featureDistributes weight

⚠️

Interview Trap: Don't just say "L1 does feature selection." Explain why geometrically (diamond corners) and probabilistically (Laplace prior has peak at zero).

Elastic Net

Combines L1 and L2:

J(β)=Loss(β)+αλβ1+(1α)λβ22J(\beta) = \text{Loss}(\beta) + \alpha \lambda \|\beta\|_1 + (1-\alpha) \lambda \|\beta\|_2^2

Use Elastic Net when you have many correlated features and want both feature selection (L1) and stability (L2).

Dropout (Neural Networks)

During training, randomly zero out neurons with probability pp and scale remaining by 1/(1p)1/(1-p). At inference, use all neurons.

Dropout works by:

  1. Ensemble Effect: Approximates training 2N2^N different sub-networks
  2. Reduced Co-adaptation: Forces neurons to learn robust features
  3. Noise Injection: Acts as implicit data augmentation

💡

OpenAI Interview Tip: Dropout can be interpreted as approximate Bayesian inference in deep Gaussian processes.


Code Implementation


Real-World Applications

OpenAI: Large Language Models

  • Weight Decay: L2 regularization in transformer training
  • Dropout: Preventing overfitting in attention layers
  • Early Stopping: Stopping training at optimal point

Anthropic: AI Safety

  • Robustness: Regularizing against adversarial examples
  • Interpretability: Sparse models are more interpretable
  • Generalization: Ensuring models work on unseen distributions

Common Follow-Up Questions

Q1: Why does L1 produce sparse solutions? Geometrically, the L1 constraint region has corners on axes. The loss function intersects at corners, giving zero coefficients. Probabilistically, the Laplace prior peaks at zero.

Q2: When should you use Elastic Net? When you have many correlated features, want both feature selection and stability, or when p>np > n.

Q3: How does dropout relate to bagging? Dropout trains different sub-networks per mini-batch, approximating an ensemble of 2N2^N networks.

Q4: What is the relationship between regularization and model complexity? Regularization reduces effective model complexity by constraining the parameter space.


Related Topics

Advertisement