CW

Regularization: Ridge, Lasso and ElasticNet

Module 7: Machine Learning FundamentalsFree Lesson

Advertisement

Regularization: Ridge, Lasso and ElasticNet

The Problem of Overfitting

When we fit a model to training data, we want it to generalize well to unseen data. A model that memorizes training data but fails on test data is said to overfit.

UnderfitHigh bias, low varianceModel too simpleUnder-parameterizedGood FitBalanced bias & varianceGeneralizes wellRight complexityOverfitLow bias, high varianceMemorizes noiseOver-parameterized

Bias-Variance Tradeoff

The fundamental tension in supervised learning:

SourceDescriptionEffect on Test Error
BiasError from wrong assumptionsHigh → underfitting
VarianceError from sensitivity to training dataHigh → overfitting
IrreducibleNoise in the dataCannot reduce
Expected Error=Bias2+Variance+σnoise2\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \sigma^2_{\text{noise}}

As model complexity increases:

  • Bias decreases (model captures more patterns)
  • Variance increases (model becomes more sensitive)
  • Total error follows a U-shaped curve

Why Regularization?

Regularization addresses overfitting by penalizing model complexity. Instead of minimizing only the loss function, we minimize:

Regularized Objective=Loss(w)fit to data+λΩ(w)penalty on complexity\text{Regularized Objective} = \underbrace{\text{Loss}(\mathbf{w})}_{\text{fit to data}} + \underbrace{\lambda \cdot \Omega(\mathbf{w})}_{\text{penalty on complexity}}

where:

  • λ0\lambda \geq 0 is the regularization strength (hyperparameter)
  • Ω(w)\Omega(\mathbf{w}) is the regularization term (penalty function)
  • w\mathbf{w} are the model parameters

Key insight: Regularization trades training performance for generalization.


Ridge Regression (L2 Regularization)

Formulation

Ridge regression adds an L2 penalty (squared magnitude of coefficients):

LRidge=yXw22+λw22\mathcal{L}_{\text{Ridge}} = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2_2 + \lambda \|\mathbf{w}\|^2_2

Expanding the penalty term:

LRidge=i=1n(yixiTw)2+λj=1pwj2\mathcal{L}_{\text{Ridge}} = \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\mathbf{w})^2 + \lambda \sum_{j=1}^{p} w_j^2

Closed-Form Solution

Taking the derivative and setting to zero:

Lw=2XT(yXw)+2λw=0\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = -2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) + 2\lambda\mathbf{w} = 0
w^Ridge=(XTX+λI)1XTy\boxed{\hat{\mathbf{w}}_{\text{Ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}}

Key properties:

  • Always has a unique solution (matrix is always invertible for λ>0\lambda > 0)
  • Coefficients are shrunk toward zero but never exactly zero
  • Equivalent to OLS with modified covariance matrix

Geometric Interpretation

L1 vs L2 Constraint Regionsw₁w₂OLS solutionL2 (Ridge)L1 (Lasso)RidgeLasso

The Ridge solution is where the elliptical contours of the OLS loss meet the circular L2 constraint. Since circles are smooth, the intersection rarely occurs exactly on an axis.


Lasso Regression (L1 Regularization)

Formulation

Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty (absolute magnitude):

LLasso=yXw22+λw1\mathcal{L}_{\text{Lasso}} = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2_2 + \lambda \|\mathbf{w}\|_1

Expanding:

LLasso=i=1n(yixiTw)2+λj=1pwj\mathcal{L}_{\text{Lasso}} = \sum_{i=1}^{n}(y_i - \mathbf{x}_i^T\mathbf{w})^2 + \lambda \sum_{j=1}^{p} |w_j|

No Closed-Form Solution

Unlike Ridge, Lasso has no analytical solution due to the non-differentiable absolute value. We use:

  • Coordinate descent
  • Proximal gradient methods
  • Subgradient methods

Sparsity Property

The L1 penalty induces sparsity — it drives some coefficients to exactly zero:

w^jLasso={0if Sjλ/2shrunken valueotherwise\hat{w}_j^{\text{Lasso}} = \begin{cases} 0 & \text{if } |S_j| \leq \lambda/2 \\ \text{shrunken value} & \text{otherwise} \end{cases}

where Sj=xjT(yXjwj)S_j = \mathbf{x}_j^T(\mathbf{y} - \mathbf{X}_{-j}\mathbf{w}_{-j}) is the soft-thresholding operator.

Why does L1 produce sparsity?

The L1 constraint region has corners on the axes. The elliptical contours of OLS loss are more likely to intersect at these corners, yielding solutions where some wj=0w_j = 0.


ElasticNet: Best of Both Worlds

Formulation

ElasticNet combines L1 and L2 penalties:

LElasticNet=yXw22+λ1w1+λ2w22\mathcal{L}_{\text{ElasticNet}} = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2_2 + \lambda_1 \|\mathbf{w}\|_1 + \lambda_2 \|\mathbf{w}\|_2^2

Using the mixing parameter α[0,1]\alpha \in [0, 1]:

LElasticNet=yXw22+λ[αw1+(1α)2w22]\mathcal{L}_{\text{ElasticNet}} = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2_2 + \lambda \left[\alpha \|\mathbf{w}\|_1 + \frac{(1-\alpha)}{2} \|\mathbf{w}\|_2^2\right]
α\alpha ValueBehavior
α=0\alpha = 0Pure Ridge (L2 only)
α=1\alpha = 1Pure Lasso (L1 only)
0<α<10 < \alpha < 1ElasticNet

Advantages over Lasso

  1. Grouped selection: When features are correlated, Lasso selects one arbitrarily; ElasticNet selects the group
  2. Smooth penalty: Differentiable everywhere (unlike L1)
  3. More stable: Less sensitive to small changes in data
  4. p>np > n: Works when number of features exceeds samples

Coefficient Shrinkage Visualization

Coefficient Shrinkage: λ vs |w||Coefficient Value|Regularization Strength (λ)00.11.010100RidgeLassoElasticNetKey Observations:• Ridge: gradual shrinkage• Lasso: sparse solution• ElasticNet: compromise

Regularization Path

The regularization path shows how coefficients change as λ\lambda varies:

Lasso Regularization PathCoefficient Value (wⱼ)log(λ)0λ_minλ_1seFeatures:SelectedDroppedAs λ increases,coefficients → 0

Key Observations

  1. Left side (λ\lambda small): All features included, model close to OLS
  2. Right side (λ\lambda large): Most coefficients zero, simple model
  3. Sparsity: Lasso drives coefficients to exactly zero sequentially
  4. λmin\lambda_{\min}: Value that minimizes cross-validation error
  5. λ1se\lambda_{1\text{se}}: Largest λ\lambda within 1 SE of λmin\lambda_{\min} (sparser model)

Choosing Lambda: Cross-Validation

We select λ\lambda using k-fold cross-validation:

CV(λ)=1ki=1kLoss(w^λ(i),Di)\text{CV}(\lambda) = \frac{1}{k}\sum_{i=1}^{k} \text{Loss}\left(\hat{\mathbf{w}}^{(-i)}_\lambda, \mathcal{D}_i\right)

Common Selection Strategies

StrategyDescriptionWhen to Use
λmin\lambda_{\min}Minimizes CV errorMaximum predictive power
λ1se\lambda_{1\text{se}}Largest λ\lambda within 1 SE of λmin\lambda_{\min}Simpler, more interpretable model
Cross-Validation CurveMean Squared Errorlog(λ)λ_minλ_1se±1 SEMinimum error1 SE rule

Implementation in Python

Basic Setup

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
n_samples, n_features = 200, 50
X, y, true_coef = make_regression(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=10,
    noise=20,
    coef=True,
    random_state=42
)

correlation_matrix = np.eye(n_features)
for i in range(n_features - 1):
    correlation_matrix[i, i+1] = 0.8
    correlation_matrix[i+1, i] = 0.8

L = np.linalg.cholesky(correlation_matrix)
X = X @ L.T

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Ridge Regression

ridge_cv = RidgeCV(
    alphas=np.logspace(-3, 3, 100),
    scoring='neg_mean_squared_error',
    cv=5
)
ridge_cv.fit(X_train_scaled, y_train)

print(f"Best alpha (Ridge): {ridge_cv.alpha_:.4f}")

y_pred_ridge = ridge_cv.predict(X_test_scaled)
print(f"Ridge R²: {r2_score(y_test, y_pred_ridge):.4f}")
print(f"Ridge MSE: {mean_squared_error(y_test, y_pred_ridge):.4f}")
print(f"Non-zero coefficients: {np.sum(ridge_cv.coef_ != 0)}/{n_features}")

Lasso Regression

lasso_cv = LassoCV(
    alphas=np.logspace(-3, 1, 100),
    cv=5,
    max_iter=10000,
    random_state=42
)
lasso_cv.fit(X_train_scaled, y_train)

print(f"Best alpha (Lasso): {lasso_cv.alpha_:.4f}")

y_pred_lasso = lasso_cv.predict(X_test_scaled)
print(f"Lasso R²: {r2_score(y_test, y_pred_lasso):.4f}")
print(f"Lasso MSE: {mean_squared_error(y_test, y_pred_lasso):.4f}")
print(f"Non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}/{n_features}")
print(f"Zero coefficients: {np.sum(lasso_cv.coef_ == 0)}/{n_features}")

selected_features = np.where(lasso_cv.coef_ != 0)[0]
print(f"Selected features: {selected_features}")

ElasticNet

elastic_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],
    alphas=np.logspace(-3, 1, 100),
    cv=5,
    max_iter=10000,
    random_state=42
)
elastic_cv.fit(X_train_scaled, y_train)

print(f"Best alpha: {elastic_cv.alpha_:.4f}")
print(f"Best l1_ratio: {elastic_cv.l1_ratio_:.4f}")

y_pred_elastic = elastic_cv.predict(X_test_scaled)
print(f"ElasticNet R²: {r2_score(y_test, y_pred_elastic):.4f}")
print(f"ElasticNet MSE: {mean_squared_error(y_test, y_pred_elastic):.4f}")
print(f"Non-zero coefficients: {np.sum(elastic_cv.coef_ != 0)}/{n_features}")

Visualizing the Regularization Path

def plot_regularization_path(X, y, alphas):
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    models = [
        ('Ridge', Ridge()),
        ('Lasso', Lasso()),
        ('ElasticNet', ElasticNet(l1_ratio=0.5))
    ]
    for ax, (name, model) in zip(axes, models):
        coefs = []
        for a in alphas:
            model.set_params(alpha=a)
            model.fit(X, y)
            coefs.append(model.coef_)
        coefs = np.array(coefs)
        for i in range(coefs.shape[1]):
            ax.plot(np.log10(alphas), coefs[:, i], linewidth=0.8)
        ax.set_xlabel('log10(alpha)')
        ax.set_ylabel('Coefficient Value')
        ax.set_title(f'{name} Path')
        ax.axhline(y=0, color='k', linestyle='--', linewidth=0.5)
        ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('regularization_paths.png', dpi=150, bbox_inches='tight')
    plt.show()

alphas = np.logspace(-3, 3, 200)
plot_regularization_path(X_train_scaled, y_train, alphas)

Comparing Models

def compare_regularization(X_train, y_train, X_test, y_test):
    results = {}
    ridge = RidgeCV(
        alphas=np.logspace(-3, 3, 100),
        scoring='neg_mean_squared_error',
        cv=5
    )
    ridge.fit(X_train, y_train)
    results['Ridge'] = {
        'alpha': ridge.alpha_,
        'r2': r2_score(y_test, ridge.predict(X_test)),
        'mse': mean_squared_error(y_test, ridge.predict(X_test)),
        'n_nonzero': np.sum(ridge.coef_ != 0)
    }
    lasso = LassoCV(
        alphas=np.logspace(-3, 1, 100),
        cv=5,
        max_iter=10000,
        random_state=42
    )
    lasso.fit(X_train, y_train)
    results['Lasso'] = {
        'alpha': lasso.alpha_,
        'r2': r2_score(y_test, lasso.predict(X_test)),
        'mse': mean_squared_error(y_test, lasso.predict(X_test)),
        'n_nonzero': np.sum(lasso.coef_ != 0)
    }
    elastic = ElasticNetCV(
        l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],
        alphas=np.logspace(-3, 1, 100),
        cv=5,
        max_iter=10000,
        random_state=42
    )
    elastic.fit(X_train, y_train)
    results['ElasticNet'] = {
        'alpha': elastic.alpha_,
        'l1_ratio': elastic.l1_ratio_,
        'r2': r2_score(y_test, elastic.predict(X_test)),
        'mse': mean_squared_error(y_test, elastic.predict(X_test)),
        'n_nonzero': np.sum(elastic.coef_ != 0)
    }
    return results

results = compare_regularization(X_train_scaled, y_train, X_test_scaled, y_test)

for model, metrics in results.items():
    print(f"\n{model}:")
    print(f"  alpha = {metrics['alpha']:.4f}")
    if 'l1_ratio' in metrics:
        print(f"  l1_ratio = {metrics['l1_ratio']:.2f}")
    print(f"  R2 = {metrics['r2']:.4f}")
    print(f"  MSE = {metrics['mse']:.4f}")
    print(f"  Non-zero coefficients: {metrics['n_nonzero']}/{n_features}")

When to Use Each Method

ScenarioRecommended MethodReason
Many small effectsRidgeKeeps all features, reduces magnitude
Few strong predictorsLassoAutomatic feature selection
Correlated featuresElasticNetGrouped selection, stability
High-dimensional (p>np > n)ElasticNetHandles collinearity, selects features
Interpretability neededLassoSparse model
Maximum accuracy neededRidge/ElasticNetDepends on data structure

Practical Guidelines

1. Always Standardize

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Start with Cross-Validation

from sklearn.linear_model import ElasticNetCV

model = ElasticNetCV(
    l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0],
    alphas=np.logspace(-4, 2, 50),
    cv=5,
    max_iter=10000
)
model.fit(X_train_scaled, y_train)

3. Examine the Regularization Path

from sklearn.linear_model import lasso_path

alphas, coefs, _ = lasso_path(X_train_scaled, y_train, alphas=np.logspace(-3, 1, 50))
n_nonzero = [np.sum(np.abs(c) > 1e-10) for c in coefs.T]

plt.figure(figsize=(8, 5))
plt.plot(np.log10(alphas), n_nonzero)
plt.xlabel('log10(alpha)')
plt.ylabel('Number of Non-zero Coefficients')
plt.title('Feature Selection vs Regularization Strength')
plt.grid(True, alpha=0.3)
plt.show()

4. Compare with OLS

from sklearn.linear_model import LinearRegression

ols = LinearRegression()
ols.fit(X_train_scaled, y_train)

print("Model Comparison:")
print(f"  OLS: R2 = {r2_score(y_test, ols.predict(X_test_scaled)):.4f}")
print(f"  Ridge: R2 = {r2_score(y_test, ridge_cv.predict(X_test_scaled)):.4f}")
print(f"  Lasso: R2 = {r2_score(y_test, lasso_cv.predict(X_test_scaled)):.4f}")
print(f"  ElasticNet: R2 = {r2_score(y_test, elastic_cv.predict(X_test_scaled)):.4f}")

Summary

PropertyRidge (L2)Lasso (L1)ElasticNet
Penaltywj2\sum w_j^2wj\sum \|w_j\|αwj+(1α)wj2\alpha\sum\|w_j\| + (1-\alpha)\sum w_j^2
SparsityNoYesYes
Feature SelectionNoYesYes
Correlated FeaturesKeeps allSelects oneSelects group
SolutionClosed-formIterativeIterative
When to UseMany small effectsFew strong predictorsMixed scenarios

Key Takeaways:

  1. Regularization prevents overfitting by penalizing model complexity
  2. Ridge shrinks coefficients but keeps all features
  3. Lasso performs automatic feature selection via sparsity
  4. ElasticNet combines both benefits, often the best choice
  5. Always use cross-validation to select the regularization strength λ\lambda
  6. Standardize features before applying regularization

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement