Regularization: Lasso, Ridge, ElasticNet

Module 2: Machine LearningFree Lesson

Advertisement

Regularization: Lasso, Ridge, ElasticNet

Why Regularization?

Without constraints, models minimize training error at the cost of generalization. Regularization adds a penalty term that discourages complex models.

DfRegularization

A technique that adds a penalty to the loss function to constrain the model's complexity, preventing overfitting by discouraging large coefficient values. Regularization introduces a bias-variance tradeoff: it slightly increases bias but significantly reduces variance, leading to better generalization on unseen data.

Architecture Diagram
Overfitting Visualization:
                    Loss
                      โ†‘
    Training Loss โ”€โ”€โ”€โ†’โ”‚   โ•ฒ
                      โ”‚    โ•ฒโ”€โ”€โ”€โ”€โ”€โ”€โ”€ Training
                      โ”‚     โ•ฒ
                      โ”‚      โ•ฒ
                      โ”‚       โ•ฒ_____ Validation
                      โ”‚
                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Model Complexity

  Regularization shifts the minimum toward simpler models:
                      โ†‘
    Loss              โ”‚  โ•ฒ
                      โ”‚   โ•ฒ
                      โ”‚    โ•ฒ    โ† Regularized
                      โ”‚     โ•ฒ
                      โ”‚      โ•ฒ
                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’

The Bias-Variance Tradeoff

E[Totalย Error]=Bias2+Variance+ฯƒirreducible2E[\text{Total Error}] = \text{Bias}^2 + \text{Variance} + \sigma^2_{\text{irreducible}}

ThBias-Variance Decomposition

For a model f^\hat{f} trained on dataset DD, the expected prediction error for any test point xx decomposes as: ED[(yโˆ’f^D(x))2]=Bias2(f^(x))+Var(f^(x))+ฯƒ2E_D[(y - \hat{f}_D(x))^2] = \text{Bias}^2(\hat{f}(x)) + \text{Var}(\hat{f}(x)) + \sigma^2 where ฯƒ2\sigma^2 is the irreducible noise variance. Regularization increases Bias2\text{Bias}^2 but reduces Variance\text{Variance}, often decreasing the total error.

  • Bias: Error from wrong assumptions (underfitting)
  • Variance: Error from sensitivity to training data (overfitting)

Regularization increases bias slightly but reduces variance significantly.

Ridge Regression (L2 Regularization)

Adds the sum of squared coefficients to the loss function:

LRidge=โˆ‘i=1n(yiโˆ’y^i)2+ฮฑโˆ‘j=1pฮฒj2=โˆฅyโˆ’Xฮฒโˆฅ22+ฮฑโˆฅฮฒโˆฅ22L_{\text{Ridge}} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} \beta_j^2 = \|y - X\beta\|_2^2 + \alpha\|\beta\|_2^2

Geometric Interpretation

Architecture Diagram
L2 Constraint Region (circle):

  ฮฒโ‚‚
   โ†‘     โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
   โ”‚    โ”‚            โ”‚
   โ”‚   โ”‚   Ellipse   โ”‚
   โ”‚   โ”‚ (OLS Loss)  โ”‚
   โ”‚    โ”‚            โ”‚
   โ”‚     โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ ฮฒโ‚

The optimal ฮฒฬ‚ is where the loss ellipse first touches the L2 circle.
Result: coefficients are shrunk but never exactly zero.

Mathematical Derivation

DfRidge Regression

Also known as Tikhonov regularization, Ridge regression adds an L2 penalty ฮฑโˆฅฮฒโˆฅ22\alpha \|\beta\|_2^2 to the ordinary least squares loss. The closed-form solution is ฮฒ^ridge=(XTX+ฮฑI)โˆ’1XTy\hat{\beta}_{\text{ridge}} = (X^TX + \alpha I)^{-1}X^Ty. The penalty shrinks all coefficients toward zero but never sets them exactly to zero.

The closed-form solution:

Ridge Closed-Form Solution

ฮฒ^ridge=(XTX+ฮฑI)โˆ’1XTy\hat{\beta}_{\text{ridge}} = (X^TX + \alpha I)^{-1}X^Ty

Here,

  • =p ร— p identity matrix
  • =Regularization strength (\geq 0)
  • =Feature matrix (n ร— p)

As ฮฑโ†’0\alpha \to 0: \hat{\beta}_{\text{ridge}} \to \hat{\beta}_{\text{OLS} As ฮฑโ†’โˆž\alpha \to \infty: ฮฒ^ridgeโ†’0\hat{\beta}_{\text{ridge}} \to 0

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Generate data with multicollinearity
np.random.seed(42)
X, y, true_coefs = make_regression(
    n_samples=200, n_features=20, n_informative=5,
    noise=10, coef=True, random_state=42
)

# Add correlated features
X[:, 5] = X[:, 0] + np.random.randn(200) * 0.1
X[:, 6] = X[:, 1] + np.random.randn(200) * 0.1

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Ridge regression with cross-validation
alphas = np.logspace(-4, 4, 100)
ridge_cv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error')
ridge_cv.fit(X_train_scaled, y_train)

print(f"Best alpha: {ridge_cv.alpha_:.4f}")
print(f"Test MSE:   {mean_squared_error(y_test, ridge_cv.predict(X_test_scaled)):.4f}")

Ridge regression is particularly effective when many features contribute small effects (the "many weak predictors" scenario). It shrinks all coefficients proportionally, keeping all features in the model while reducing the influence of noisy ones. It handles multicollinearity well because the ฮฑI\alpha I term makes XTX+ฮฑIX^TX + \alpha I invertible even when XTXX^TX is singular.

Lasso Regression (L1 Regularization)

Adds the sum of absolute coefficient values:

LLasso=โˆ‘i=1n(yiโˆ’y^i)2+ฮฑโˆ‘j=1pโˆฃฮฒjโˆฃ=โˆฅyโˆ’Xฮฒโˆฅ22+ฮฑโˆฅฮฒโˆฅ1L_{\text{Lasso}} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2 + \alpha \sum_{j=1}^{p} |\beta_j| = \|y - X\beta\|_2^2 + \alpha\|\beta\|_1

Geometric Interpretation

Architecture Diagram
L1 Constraint Region (diamond):

  ฮฒโ‚‚
   โ†‘      โ—‡
   โ”‚     โ•ฑ โ•ฒ
   โ”‚    โ•ฑ   โ•ฒ
   โ”‚   โ•ฑ Lossโ•ฒ
   โ”‚  โ•ฑEllipseโ•ฒ
   โ”‚ โ•ฑ         โ•ฒ
   โ—‡โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ—‡โ”€โ”€โ†’ ฮฒโ‚
    โ•ฒ           โ•ฑ
     โ•ฒ         โ•ฑ
      โ•ฒ       โ•ฑ
       โ•ฒโ”€โ”€โ”€โ”€โ”€โ•ฑ

The optimal ฮฒฬ‚ is at a corner โ†’ coefficients become exactly zero!
Result: automatic feature selection (sparse model).

Why Lasso Produces Sparsity

DfLasso Regression

Least Absolute Shrinkage and Selection Operator (Lasso) regression adds an L1 penalty ฮฑโˆฅฮฒโˆฅ1\alpha \|\beta\|_1 to the OLS loss. Unlike Ridge, Lasso can set coefficients exactly to zero, performing automatic feature selection. The L1 penalty's diamond-shaped constraint region has corners where the loss ellipse is most likely to intersect, producing sparse solutions.

The L1 penalty creates corners in the constraint region where the loss ellipse is most likely to touch, setting coefficients to exactly zero.

ThLasso Sparsity

The Lasso estimator is sparse: for sufficiently large ฮฑ\alpha, some coefficients ฮฒ^j\hat{\beta}_j are exactly zero. The number of zero coefficients increases with ฮฑ\alpha. This is due to the non-differentiability of the โˆฃฮฒjโˆฃ|\beta_j| term at zero, which creates corners in the constraint region.

from sklearn.linear_model import Lasso, LassoCV

# Lasso with cross-validation
lasso_cv = LassoCV(alphas=alphas, cv=5, random_state=42, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)

print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Test MSE:   {mean_squared_error(y_test, lasso_cv.predict(X_test_scaled)):.4f}")

# Feature selection
n_zero = np.sum(lasso_cv.coef_ == 0)
print(f"\nFeatures eliminated: {n_zero}/{X.shape[1]}")
print("\nLasso Coefficients:")
for i, coef in enumerate(lasso_cv.coef_):
    if abs(coef) > 0.01:
        print(f"  Feature {i:2d}: {coef:+.4f}")

ElasticNet: Best of Both Worlds

Combines L1 and L2 penalties:

LElasticNet=โˆฅyโˆ’Xฮฒโˆฅ22+ฮฑ[ฯโˆฅฮฒโˆฅ1+(1โˆ’ฯ)2โˆฅฮฒโˆฅ22]L_{\text{ElasticNet}} = \|y - X\beta\|_2^2 + \alpha \left[ \rho \|\beta\|_1 + \frac{(1-\rho)}{2} \|\beta\|_2^2 \right]

Where:

  • ฮฑ\alpha controls overall regularization strength
  • ฯ\rho (l1_ratio) controls the mix between L1 and L2
Architecture Diagram
ElasticNet Spectrum:

  ฯ = 1.0                ฯ = 0.5                ฯ = 0.0
  Pure Lasso             Balanced               Pure Ridge
  โ—†โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ—†
  โ•‘  Feature selection  โ”‚  Shrinks all features  โ•‘
  โ•‘  + grouped effects  โ”‚                         โ•‘
  โ—†โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ—†

ElasticNet is preferred over Lasso when you have groups of correlated features. Lasso tends to arbitrarily select one feature from a correlated group and zero out the rest, while ElasticNet tends to keep or drop the entire group. Set ฯ=0.5\rho = 0.5 as a balanced default, and increase toward 1.0 if you want more feature selection.

from sklearn.linear_model import ElasticNet, ElasticNetCV

# ElasticNet with cross-validation
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0]

enet_cv = ElasticNetCV(
    l1_ratio=l1_ratios,
    alphas=alphas,
    cv=5,
    random_state=42,
    max_iter=10000
)
enet_cv.fit(X_train_scaled, y_train)

print(f"Best alpha:     {enet_cv.alpha_:.4f}")
print(f"Best l1_ratio:  {enet_cv.l1_ratio_:.4f}")
print(f"Test MSE:       {mean_squared_error(y_test, enet_cv.predict(X_test_scaled)):.4f}")

n_zero = np.sum(enet_cv.coef_ == 0)
print(f"\nFeatures eliminated: {n_zero}/{X.shape[1]}")

Coefficient Path Visualization

from sklearn.linear_model import lasso_path, ridge_path

# Compute coefficient paths
alphas_path = np.logspace(-2, 4, 100)

# Ridge path
ridge_coefs = []
for a in alphas_path:
    ridge = Ridge(alpha=a)
    ridge.fit(X_train_scaled, y_train)
    ridge_coefs.append(ridge.coef_)
ridge_coefs = np.array(ridge_coefs)

# Lasso path
lasso_alphas, lasso_coefs, _ = lasso_path(
    X_train_scaled, y_train, alphas=alphas_path
)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge paths
for i in range(X.shape[1]):
    axes[0].plot(np.log10(alphas_path), ridge_coefs[:, i], linewidth=0.8)
axes[0].axvline(x=np.log10(ridge_cv.alpha_), color='k', linestyle='--', label='CV alpha')
axes[0].set_xlabel('log10(alpha)')
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('Ridge Coefficient Paths')
axes[0].legend()

# Lasso paths
for i in range(X.shape[1]):
    axes[1].plot(np.log10(lasso_alphas), lasso_coefs[:, i], linewidth=0.8)
axes[1].axvline(x=np.log10(lasso_cv.alpha_), color='k', linestyle='--', label='CV alpha')
axes[1].set_xlabel('log10(alpha)')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('Lasso Coefficient Paths')
axes[1].legend()

plt.tight_layout()
plt.savefig('regularization_paths.png', dpi=150)
plt.show()

๐Ÿ“Comparing Ridge, Lasso, and ElasticNet

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
feature_names = housing.feature_names

X_h_train, X_h_test, y_h_train, y_h_test = train_test_split(
    X_h, y_h, test_size=0.2, random_state=42
)

scaler_h = StandardScaler()
X_h_train_s = scaler_h.fit_transform(X_h_train)
X_h_test_s = scaler_h.transform(X_h_test)

models = {
    'OLS': Ridge(alpha=0),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.01),
    'ElasticNet': ElasticNet(alpha=0.01, l1_ratio=0.5),
}

print("California Housing - Model Comparison:")
print(f"{'Model':<15} {'Train Rยฒ':>10} {'Test Rยฒ':>10} {'Features':>10}")
print("-" * 50)

for name, model in models.items():
    model.fit(X_h_train_s, y_h_train)
    train_r2 = model.score(X_h_train_s, y_h_train)
    test_r2 = model.score(X_h_test_s, y_h_test)
    n_features = np.sum(np.abs(model.coef_) > 1e-4)
    print(f"{name:<15} {train_r2:>10.4f} {test_r2:>10.4f} {n_features:>10}")

When to Use Each

from sklearn.feature_selection import SelectFromModel

def choose_regularization(X, y, n_samples, n_features):
    """Decision logic for regularization choice."""

    # High-dimensional, sparse signal
    if n_features > n_samples:
        print("Recommendation: Lasso or ElasticNet (L1)")
        return 'lasso'

    # Many correlated features
    corr_matrix = np.abs(np.corrcoef(X.T))
    high_corr = (corr_matrix > 0.8).sum() - X.shape[1]
    if high_corr > X.shape[1] * 0.3:
        print("Recommendation: Ridge or ElasticNet (L2-heavy)")
        return 'ridge'

    # General case
    print("Recommendation: ElasticNet (balanced)")
    return 'elasticnet'

# Apply feature selection with Lasso
selector = SelectFromModel(
    Lasso(alpha=lasso_cv.alpha_),
    threshold='median'
)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

print(f"Features before selection: {X_train_scaled.shape[1]}")
print(f"Features after selection:  {X_train_selected.shape[1]}")
print(f"Selected features: {np.where(selector.get_support())[0]}")

Comparison Table

Architecture Diagram
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Property   โ”‚      Ridge (L2)    โ”‚     Lasso (L1)     โ”‚    ElasticNet      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Penalty      โ”‚ ฮฃฮฒโฑผยฒ              โ”‚ ฮฃ|ฮฒโฑผ|             โ”‚ ฯยทฮฃ|ฮฒโฑผ|+(1-ฯ)ยทฮฃฮฒโฑผยฒโ”‚
โ”‚ Sparsity     โ”‚ No (shrinks to 0) โ”‚ Yes (exactly 0)   โ”‚ Yes                โ”‚
โ”‚ Feature Sel. โ”‚ No                 โ”‚ Yes                โ”‚ Yes (grouped)      โ”‚
โ”‚ Multicollin. โ”‚ Handles well       โ”‚ Picks one          โ”‚ Groups correlated  โ”‚
โ”‚ Computation  โ”‚ Closed-form        โ”‚ No closed-form     โ”‚ No closed-form     โ”‚
โ”‚ Use Case     โ”‚ Many small effects โ”‚ Few large effects  โ”‚ Mixed effects      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Takeaways

  1. Ridge (L2): Shrinks all coefficients โ€” best when most features contribute
  2. Lasso (L1): Zeroes out coefficients โ€” best for feature selection
  3. ElasticNet: Combines both โ€” best when features are correlated or unknown
  4. Always standardize features before applying regularization
  5. Use cross-validation to select ฮฑ\alpha and ฯ\rho โ€” avoid manual tuning
  6. Coefficient paths reveal how regularization affects each feature

๐Ÿ“‹Summary: Regularization โ€” Lasso, Ridge, ElasticNet

  1. Regularization adds a penalty to the loss function to prevent overfitting, trading increased bias for reduced variance.
  2. The bias-variance decomposition E[Error]=Bias2+Variance+ฯƒ2E[\text{Error}] = \text{Bias}^2 + \text{Variance} + \sigma^2 explains why regularization works: it reduces variance more than it increases bias.
  3. Ridge (L2) adds ฮฑโˆฅฮฒโˆฅ22\alpha \|\beta\|_2^2 and shrinks all coefficients toward zero but never sets them exactly to zero. Best when many features contribute small effects.
  4. Lasso (L1) adds ฮฑโˆฅฮฒโˆฅ1\alpha \|\beta\|_1 and can set coefficients exactly to zero, performing automatic feature selection. Best when few features dominate.
  5. ElasticNet combines L1 and L2: ฮฑ[ฯโˆฅฮฒโˆฅ1+(1โˆ’ฯ)/2โˆฅฮฒโˆฅ22]\alpha [\rho \|\beta\|_1 + (1-\rho)/2 \|\beta\|_2^2]. Best for correlated feature groups โ€” it selects or drops entire groups.
  6. The geometric intuition: L2 constraint is a circle (no corners, no sparsity); L1 constraint is a diamond (corners create sparsity).
  7. The Ridge closed-form ฮฒ^=(XTX+ฮฑI)โˆ’1XTy\hat{\beta} = (X^TX + \alpha I)^{-1}X^Ty is always invertible, even when XTXX^TX is singular (multicollinearity).
  8. Always standardize features before regularization โ€” the penalty is scale-sensitive.
  9. Use cross-validation (RidgeCV, LassoCV, ElasticNetCV) to select ฮฑ\alpha and ฯ\rho.
  10. Coefficient paths visualize how each feature's coefficient changes with ฮฑ\alpha โ€” useful for understanding feature importance.

Practice Exercises

  1. Feature Selection: Compare Ridge, Lasso, and ElasticNet on a dataset with 100 features but only 5 relevant. Which method recovers the true features?
  2. Multicollinearity: Create 3 highly correlated features. How does each method handle them?
  3. Alpha Sensitivity: Plot test Rยฒ vs log(alpha) for all three methods. Where do they diverge?
  4. Real Dataset: Apply regularized regression to a gene expression dataset (n << p). Which method performs best and why?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement