Regularization: Lasso, Ridge, ElasticNet

Why Regularization?

Without constraints, models minimize training error at the cost of generalization. Regularization adds a penalty term that discourages complex models.

DfRegularization

A technique that adds a penalty to the loss function to constrain the model's complexity, preventing overfitting by discouraging large coefficient values. Regularization introduces a bias-variance tradeoff: it slightly increases bias but significantly reduces variance, leading to better generalization on unseen data.

Architecture Diagram

Overfitting Visualization:
                    Loss
                      ↑
    Training Loss ───→│   ╲
                      │    ╲─────── Training
                      │     ╲
                      │      ╲
                      │       ╲_____ Validation
                      │
                      └──────────────────→ Model Complexity

  Regularization shifts the minimum toward simpler models:
                      ↑
    Loss              │  ╲
                      │   ╲
                      │    ╲    ← Regularized
                      │     ╲
                      │      ╲
                      └──────────────────→

The Bias-Variance Tradeoff

E[\text{Total Error}] = \text{Bias}^2 + \text{Variance} + \sigma^2_{\text{irreducible}}

ThBias-Variance Decomposition

For a model $\hat{f}$ trained on dataset $D$ , the expected prediction error for any test point $x$ decomposes as: $E_D[(y - \hat{f}_D(x))^2] = \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x)) + \sigma^2$ where $\sigma^2$ is the irreducible noise variance. Regularization increases $\text{Bias}^2$ but reduces $\text{Variance}$ , often decreasing the total error.

Bias: Error from wrong assumptions (underfitting)
Variance: Error from sensitivity to training data (overfitting)

Regularization increases bias slightly but reduces variance significantly.

Ridge Regression (L2 Regularization)

Adds the sum of squared coefficients to the loss function:

L_{\text{Ridge}} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} \beta_j^2 = \|y - X\beta\|_2^2 + \alpha\|\beta\|_2^2

Geometric Interpretation

Architecture Diagram

L2 Constraint Region (circle):

  β₂
   ↑     ╭──────────╮
   │    │            │
   │   │   Ellipse   │
   │   │ (OLS Loss)  │
   │    │            │
   │     ╰──────────╯
   └──────────────────→ β₁

The optimal β̂ is where the loss ellipse first touches the L2 circle.
Result: coefficients are shrunk but never exactly zero.

Mathematical Derivation

DfRidge Regression

Also known as Tikhonov regularization, Ridge regression adds an L2 penalty $\alpha \|\beta\|_2^2$ to the ordinary least squares loss. The closed-form solution is $\hat{\beta}_{\text{ridge}} = (X^TX + \alpha I)^{-1}X^Ty$ . The penalty shrinks all coefficients toward zero but never sets them exactly to zero.

The closed-form solution:

Ridge Closed-Form Solution

\hat{\beta}_{\text{ridge}} = (X^TX + \alpha I)^{-1}X^Ty

Here,

=p × p identity matrix
=Regularization strength (\geq 0)
=Feature matrix (n × p)

As $\alpha \to 0$ : $\hat{\beta}_{\text{ridge}} \to \hat{\beta}_{\text{OLS}$ As $\alpha \to \infty$ : $\hat{\beta}_{\text{ridge}} \to 0$

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Generate data with multicollinearity
np.random.seed(42)
X, y, true_coefs = make_regression(
    n_samples=200, n_features=20, n_informative=5,
    noise=10, coef=True, random_state=42
)

# Add correlated features
X[:, 5] = X[:, 0] + np.random.randn(200) * 0.1
X[:, 6] = X[:, 1] + np.random.randn(200) * 0.1

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Ridge regression with cross-validation
alphas = np.logspace(-4, 4, 100)
ridge_cv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error')
ridge_cv.fit(X_train_scaled, y_train)

print(f"Best alpha: {ridge_cv.alpha_:.4f}")
print(f"Test MSE:   {mean_squared_error(y_test, ridge_cv.predict(X_test_scaled)):.4f}")

Ridge regression is particularly effective when many features contribute small effects (the "many weak predictors" scenario). It shrinks all coefficients proportionally, keeping all features in the model while reducing the influence of noisy ones. It handles multicollinearity well because the $\alpha I$ term makes $X^TX + \alpha I$ invertible even when $X^TX$ is singular.

Lasso Regression (L1 Regularization)

Adds the sum of absolute coefficient values:

L_{\text{Lasso}} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} |\beta_j| = \|y - X\beta\|_2^2 + \alpha\|\beta\|_1

Geometric Interpretation

Architecture Diagram

L1 Constraint Region (diamond):

  β₂
   ↑      ◇
   │     ╱ ╲
   │    ╱   ╲
   │   ╱ Loss╲
   │  ╱Ellipse╲
   │ ╱         ╲
   ◇────────────◇──→ β₁
    ╲           ╱
     ╲         ╱
      ╲       ╱
       ╲─────╱

The optimal β̂ is at a corner → coefficients become exactly zero!
Result: automatic feature selection (sparse model).

Why Lasso Produces Sparsity

DfLasso Regression

Least Absolute Shrinkage and Selection Operator (Lasso) regression adds an L1 penalty $\alpha \|\beta\|_1$ to the OLS loss. Unlike Ridge, Lasso can set coefficients exactly to zero, performing automatic feature selection. The L1 penalty's diamond-shaped constraint region has corners where the loss ellipse is most likely to intersect, producing sparse solutions.

The L1 penalty creates corners in the constraint region where the loss ellipse is most likely to touch, setting coefficients to exactly zero.

ThLasso Sparsity

The Lasso estimator is sparse: for sufficiently large $\alpha$ , some coefficients $\hat{\beta}_j$ are exactly zero. The number of zero coefficients increases with $\alpha$ . This is due to the non-differentiability of the $|\beta_j|$ term at zero, which creates corners in the constraint region.

from sklearn.linear_model import Lasso, LassoCV

# Lasso with cross-validation
lasso_cv = LassoCV(alphas=alphas, cv=5, random_state=42, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)

print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Test MSE:   {mean_squared_error(y_test, lasso_cv.predict(X_test_scaled)):.4f}")

# Feature selection
n_zero = np.sum(lasso_cv.coef_ == 0)
print(f"\nFeatures eliminated: {n_zero}/{X.shape[1]}")
print("\nLasso Coefficients:")
for i, coef in enumerate(lasso_cv.coef_):
    if abs(coef) > 0.01:
        print(f"  Feature {i:2d}: {coef:+.4f}")

ElasticNet: Best of Both Worlds

Combines L1 and L2 penalties:

L_{\text{ElasticNet}} = \|y - X\beta\|_2^2 + \alpha \left[ \rho \|\beta\|_1 + \frac{(1-\rho)}{2} \|\beta\|_2^2 \right]

Where:

$\alpha$ controls overall regularization strength
$\rho$ (l1_ratio) controls the mix between L1 and L2

Architecture Diagram

ElasticNet Spectrum:

  ρ = 1.0                ρ = 0.5                ρ = 0.0
  Pure Lasso             Balanced               Pure Ridge
  ◆════════════════════════════════════════════════◆
  ║  Feature selection  │  Shrinks all features  ║
  ║  + grouped effects  │                         ║
  ◆════════════════════════════════════════════════◆

ElasticNet is preferred over Lasso when you have groups of correlated features. Lasso tends to arbitrarily select one feature from a correlated group and zero out the rest, while ElasticNet tends to keep or drop the entire group. Set $\rho = 0.5$ as a balanced default, and increase toward 1.0 if you want more feature selection.

from sklearn.linear_model import ElasticNet, ElasticNetCV

# ElasticNet with cross-validation
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0]

enet_cv = ElasticNetCV(
    l1_ratio=l1_ratios,
    alphas=alphas,
    cv=5,
    random_state=42,
    max_iter=10000
)
enet_cv.fit(X_train_scaled, y_train)

print(f"Best alpha:     {enet_cv.alpha_:.4f}")
print(f"Best l1_ratio:  {enet_cv.l1_ratio_:.4f}")
print(f"Test MSE:       {mean_squared_error(y_test, enet_cv.predict(X_test_scaled)):.4f}")

n_zero = np.sum(enet_cv.coef_ == 0)
print(f"\nFeatures eliminated: {n_zero}/{X.shape[1]}")

Coefficient Path Visualization

from sklearn.linear_model import lasso_path, ridge_path

# Compute coefficient paths
alphas_path = np.logspace(-2, 4, 100)

# Ridge path
ridge_coefs = []
for a in alphas_path:
    ridge = Ridge(alpha=a)
    ridge.fit(X_train_scaled, y_train)
    ridge_coefs.append(ridge.coef_)
ridge_coefs = np.array(ridge_coefs)

# Lasso path
lasso_alphas, lasso_coefs, _ = lasso_path(
    X_train_scaled, y_train, alphas=alphas_path
)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge paths
for i in range(X.shape[1]):
    axes[0].plot(np.log10(alphas_path), ridge_coefs[:, i], linewidth=0.8)
axes[0].axvline(x=np.log10(ridge_cv.alpha_), color='k', linestyle='--', label='CV alpha')
axes[0].set_xlabel('log10(alpha)')
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('Ridge Coefficient Paths')
axes[0].legend()

# Lasso paths
for i in range(X.shape[1]):
    axes[1].plot(np.log10(lasso_alphas), lasso_coefs[:, i], linewidth=0.8)
axes[1].axvline(x=np.log10(lasso_cv.alpha_), color='k', linestyle='--', label='CV alpha')
axes[1].set_xlabel('log10(alpha)')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('Lasso Coefficient Paths')
axes[1].legend()

plt.tight_layout()
plt.savefig('regularization_paths.png', dpi=150)
plt.show()

📝Comparing Ridge, Lasso, and ElasticNet

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
feature_names = housing.feature_names

X_h_train, X_h_test, y_h_train, y_h_test = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

scaler_h = StandardScaler()
X_h_train_s = scaler_h.fit_transform(X_h_train)
X_h_test_s = scaler_h.transform(X_h_test)

models = {
    'OLS': Ridge(alpha=0),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.01),
    'ElasticNet': ElasticNet(alpha=0.01, l1_ratio=0.5),
}

print("California Housing - Model Comparison:")
print(f"{'Model':<15} {'Train R²':>10} {'Test R²':>10} {'Features':>10}")
print("-" * 50)

for name, model in models.items():
    model.fit(X_h_train_s, y_h_train)
    train_r2 = model.score(X_h_train_s, y_h_train)
    test_r2 = model.score(X_h_test_s, y_h_test)
    n_features = np.sum(np.abs(model.coef_) > 1e-4)
    print(f"{name:<15} {train_r2:>10.4f} {test_r2:>10.4f} {n_features:>10}")

When to Use Each

from sklearn.feature_selection import SelectFromModel

def choose_regularization(X, y, n_samples, n_features):
    """Decision logic for regularization choice."""

    # High-dimensional, sparse signal
    if n_features > n_samples:
        print("Recommendation: Lasso or ElasticNet (L1)")
        return 'lasso'

    # Many correlated features
    corr_matrix = np.abs(np.corrcoef(X.T))
    high_corr = (corr_matrix > 0.8).sum() - X.shape[1]
    if high_corr > X.shape[1] * 0.3:
        print("Recommendation: Ridge or ElasticNet (L2-heavy)")
        return 'ridge'

    # General case
    print("Recommendation: ElasticNet (balanced)")
    return 'elasticnet'

# Apply feature selection with Lasso
selector = SelectFromModel(
    Lasso(alpha=lasso_cv.alpha_),
    threshold='median'
)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

print(f"Features before selection: {X_train_scaled.shape[1]}")
print(f"Features after selection:  {X_train_selected.shape[1]}")
print(f"Selected features: {np.where(selector.get_support())[0]}")

Comparison Table

Architecture Diagram

┌──────────────┬────────────────────┬────────────────────┬────────────────────┐
│   Property   │      Ridge (L2)    │     Lasso (L1)     │    ElasticNet      │
├──────────────┼────────────────────┼────────────────────┼────────────────────┤
│ Penalty      │ Σβⱼ²              │ Σ|βⱼ|             │ ρ·Σ|βⱼ|+(1-ρ)·Σβⱼ²│
│ Sparsity     │ No (shrinks to 0) │ Yes (exactly 0)   │ Yes                │
│ Feature Sel. │ No                 │ Yes                │ Yes (grouped)      │
│ Multicollin. │ Handles well       │ Picks one          │ Groups correlated  │
│ Computation  │ Closed-form        │ No closed-form     │ No closed-form     │
│ Use Case     │ Many small effects │ Few large effects  │ Mixed effects      │
└──────────────┴────────────────────┴────────────────────┴────────────────────┘

Key Takeaways

Ridge (L2): Shrinks all coefficients — best when most features contribute
Lasso (L1): Zeroes out coefficients — best for feature selection
ElasticNet: Combines both — best when features are correlated or unknown
Always standardize features before applying regularization
Use cross-validation to select $\alpha$ and $\rho$ — avoid manual tuning
Coefficient paths reveal how regularization affects each feature

📋Summary: Regularization — Lasso, Ridge, ElasticNet

Regularization adds a penalty to the loss function to prevent overfitting, trading increased bias for reduced variance.
The bias-variance decomposition $E[\text{Error}] = \text{Bias}^2 + \text{Variance} + \sigma^2$ explains why regularization works: it reduces variance more than it increases bias.
Ridge (L2) adds $\alpha \|\beta\|_2^2$ and shrinks all coefficients toward zero but never sets them exactly to zero. Best when many features contribute small effects.
Lasso (L1) adds $\alpha \|\beta\|_1$ and can set coefficients exactly to zero, performing automatic feature selection. Best when few features dominate.
ElasticNet combines L1 and L2: $\alpha [\rho \|\beta\|_1 + (1-\rho)/2 \|\beta\|_2^2]$ . Best for correlated feature groups — it selects or drops entire groups.
The geometric intuition: L2 constraint is a circle (no corners, no sparsity); L1 constraint is a diamond (corners create sparsity).
The Ridge closed-form $\hat{\beta} = (X^TX + \alpha I)^{-1}X^Ty$ is always invertible, even when $X^TX$ is singular (multicollinearity).
Always standardize features before regularization — the penalty is scale-sensitive.
Use cross-validation (RidgeCV, LassoCV, ElasticNetCV) to select $\alpha$ and $\rho$ .
Coefficient paths visualize how each feature's coefficient changes with $\alpha$ — useful for understanding feature importance.

Practice Exercises

Feature Selection: Compare Ridge, Lasso, and ElasticNet on a dataset with 100 features but only 5 relevant. Which method recovers the true features?
Multicollinearity: Create 3 highly correlated features. How does each method handle them?
Alpha Sensitivity: Plot test R² vs log(alpha) for all three methods. Where do they diverge?
Real Dataset: Apply regularized regression to a gene expression dataset (n << p). Which method performs best and why?

Regularization: Lasso, Ridge, ElasticNet

Regularization: Lasso, Ridge, ElasticNet

Why Regularization?

DfRegularization

The Bias-Variance Tradeoff

ThBias-Variance Decomposition

Ridge Regression (L2 Regularization)

Geometric Interpretation

Mathematical Derivation

DfRidge Regression

Ridge Closed-Form Solution

Lasso Regression (L1 Regularization)

Geometric Interpretation

Why Lasso Produces Sparsity

DfLasso Regression

ThLasso Sparsity

ElasticNet: Best of Both Worlds

Coefficient Path Visualization

📝Comparing Ridge, Lasso, and ElasticNet

When to Use Each

Comparison Table

Key Takeaways

📋Summary: Regularization — Lasso, Ridge, ElasticNet

Practice Exercises

Need Expert Data Science Help?