Regularization: Ridge, Lasso and ElasticNet

The Problem of Overfitting

When we fit a model to training data, we want it to generalize well to unseen data. A model that memorizes training data but fails on test data is said to overfit.

Bias-Variance Tradeoff

The fundamental tension in supervised learning:

Source	Description	Effect on Test Error
Bias	Error from wrong assumptions	High ?� underfitting
Variance	Error from sensitivity to training data	High ?� overfitting
Irreducible	Noise in the data	Cannot reduce

As model complexity increases:

Bias decreases (model captures more patterns)
Variance increases (model becomes more sensitive)
Total error follows a U-shaped curve

Why Regularization?

Regularization addresses overfitting by penalizing model complexity. Instead of minimizing only the loss function, we minimize:

where:

is the regularization strength (hyperparameter)
is the regularization term (penalty function)
are the model parameters

Key insight: Regularization trades training performance for generalization.

Ridge Regression (L2 Regularization)

Formulation

Ridge regression adds an L2 penalty (squared magnitude of coefficients):

Expanding the penalty term:

Closed-Form Solution

Taking the derivative and setting to zero:

Key properties:

Always has a unique solution (matrix is always invertible for )
Coefficients are shrunk toward zero but never exactly zero
Equivalent to OLS with modified covariance matrix

Geometric Interpretation

The Ridge solution is where the elliptical contours of the OLS loss meet the circular L2 constraint. Since circles are smooth, the intersection rarely occurs exactly on an axis.

Lasso Regression (L1 Regularization)

Formulation

Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty (absolute magnitude):

Expanding:

No Closed-Form Solution

Unlike Ridge, Lasso has no analytical solution due to the non-differentiable absolute value. We use:

Coordinate descent
Proximal gradient methods
Subgradient methods

Sparsity Property

The L1 penalty induces sparsity — it drives some coefficients to exactly zero:

where is the soft-thresholding operator.

Why does L1 produce sparsity?

The L1 constraint region has corners on the axes. The elliptical contours of OLS loss are more likely to intersect at these corners, yielding solutions where some .

ElasticNet: Best of Both Worlds

Formulation

ElasticNet combines L1 and L2 penalties:

Using the mixing parameter :

Value	Behavior
	Pure Ridge (L2 only)
	Pure Lasso (L1 only)
	ElasticNet

Advantages over Lasso

Grouped selection: When features are correlated, Lasso selects one arbitrarily; ElasticNet selects the group
Smooth penalty: Differentiable everywhere (unlike L1)
More stable: Less sensitive to small changes in data
: Works when number of features exceeds samples

Coefficient Shrinkage Visualization

Regularization Path

The regularization path shows how coefficients change as varies:

Lasso Regularization Path
Coefficient Value (wⱼ)
log(λ)
0
λ_min
λ_1se
Features:
Selected
Dropped
As λ increases,
coefficients ?� 0

Key Observations

Left side ( small): All features included, model close to OLS
Right side ( large): Most coefficients zero, simple model
Sparsity: Lasso drives coefficients to exactly zero sequentially
: Value that minimizes cross-validation error
: Largest within 1 SE of (sparser model)

Choosing Lambda: Cross-Validation

We select using k-fold cross-validation:

Common Selection Strategies

Strategy	Description	When to Use
	Minimizes CV error	Maximum predictive power
	Largest within 1 SE of	Simpler, more interpretable model

Implementation in Python

Basic Setup

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)
n_samples, n_features = 200, 50
X, y, true_coef = make_regression(
    n_samples=n_samples,
    n_features=n_features,
    n_informative=10,
    noise=20,
    coef=True,
    random_state=42
)

correlation_matrix = np.eye(n_features)
for i in range(n_features - 1):
    correlation_matrix[i, i+1] = 0.8
    correlation_matrix[i+1, i] = 0.8

L = np.linalg.cholesky(correlation_matrix)
X = X @ L.T

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Ridge Regression

ridge_cv = RidgeCV(
    alphas=np.logspace(-3, 3, 100),
    scoring='neg_mean_squared_error',
    cv=5
)
ridge_cv.fit(X_train_scaled, y_train)

print(f"Best alpha (Ridge): {ridge_cv.alpha_:.4f}")

y_pred_ridge = ridge_cv.predict(X_test_scaled)
print(f"Ridge R²: {r2_score(y_test, y_pred_ridge):.4f}")
print(f"Ridge MSE: {mean_squared_error(y_test, y_pred_ridge):.4f}")
print(f"Non-zero coefficients: {np.sum(ridge_cv.coef_ != 0)}/{n_features}")

Lasso Regression

lasso_cv = LassoCV(
    alphas=np.logspace(-3, 1, 100),
    cv=5,
    max_iter=10000,
    random_state=42
)
lasso_cv.fit(X_train_scaled, y_train)

print(f"Best alpha (Lasso): {lasso_cv.alpha_:.4f}")

y_pred_lasso = lasso_cv.predict(X_test_scaled)
print(f"Lasso R²: {r2_score(y_test, y_pred_lasso):.4f}")
print(f"Lasso MSE: {mean_squared_error(y_test, y_pred_lasso):.4f}")
print(f"Non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}/{n_features}")
print(f"Zero coefficients: {np.sum(lasso_cv.coef_ == 0)}/{n_features}")

selected_features = np.where(lasso_cv.coef_ != 0)[0]
print(f"Selected features: {selected_features}")

ElasticNet

elastic_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],
    alphas=np.logspace(-3, 1, 100),
    cv=5,
    max_iter=10000,
    random_state=42
)
elastic_cv.fit(X_train_scaled, y_train)

print(f"Best alpha: {elastic_cv.alpha_:.4f}")
print(f"Best l1_ratio: {elastic_cv.l1_ratio_:.4f}")

y_pred_elastic = elastic_cv.predict(X_test_scaled)
print(f"ElasticNet R²: {r2_score(y_test, y_pred_elastic):.4f}")
print(f"ElasticNet MSE: {mean_squared_error(y_test, y_pred_elastic):.4f}")
print(f"Non-zero coefficients: {np.sum(elastic_cv.coef_ != 0)}/{n_features}")

Visualizing the Regularization Path

def plot_regularization_path(X, y, alphas):
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    models = [
        ('Ridge', Ridge()),
        ('Lasso', Lasso()),
        ('ElasticNet', ElasticNet(l1_ratio=0.5))
    ]
    for ax, (name, model) in zip(axes, models):
        coefs = []
        for a in alphas:
            model.set_params(alpha=a)
            model.fit(X, y)
            coefs.append(model.coef_)
        coefs = np.array(coefs)
        for i in range(coefs.shape[1]):
            ax.plot(np.log10(alphas), coefs[:, i], linewidth=0.8)
        ax.set_xlabel('log10(alpha)')
        ax.set_ylabel('Coefficient Value')
        ax.set_title(f'{name} Path')
        ax.axhline(y=0, color='k', linestyle='--', linewidth=0.5)
        ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('regularization_paths.png', dpi=150, bbox_inches='tight')
    plt.show()

alphas = np.logspace(-3, 3, 200)
plot_regularization_path(X_train_scaled, y_train, alphas)

Comparing Models

def compare_regularization(X_train, y_train, X_test, y_test):
    results = {}
    ridge = RidgeCV(
        alphas=np.logspace(-3, 3, 100),
        scoring='neg_mean_squared_error',
        cv=5
    )
    ridge.fit(X_train, y_train)
    results['Ridge'] = {
        'alpha': ridge.alpha_,
        'r2': r2_score(y_test, ridge.predict(X_test)),
        'mse': mean_squared_error(y_test, ridge.predict(X_test)),
        'n_nonzero': np.sum(ridge.coef_ != 0)
    }
    lasso = LassoCV(
        alphas=np.logspace(-3, 1, 100),
        cv=5,
        max_iter=10000,
        random_state=42
    )
    lasso.fit(X_train, y_train)
    results['Lasso'] = {
        'alpha': lasso.alpha_,
        'r2': r2_score(y_test, lasso.predict(X_test)),
        'mse': mean_squared_error(y_test, lasso.predict(X_test)),
        'n_nonzero': np.sum(lasso.coef_ != 0)
    }
    elastic = ElasticNetCV(
        l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],
        alphas=np.logspace(-3, 1, 100),
        cv=5,
        max_iter=10000,
        random_state=42
    )
    elastic.fit(X_train, y_train)
    results['ElasticNet'] = {
        'alpha': elastic.alpha_,
        'l1_ratio': elastic.l1_ratio_,
        'r2': r2_score(y_test, elastic.predict(X_test)),
        'mse': mean_squared_error(y_test, elastic.predict(X_test)),
        'n_nonzero': np.sum(elastic.coef_ != 0)
    }
    return results

results = compare_regularization(X_train_scaled, y_train, X_test_scaled, y_test)

for model, metrics in results.items():
    print(f"\n{model}:")
    print(f"  alpha = {metrics['alpha']:.4f}")
    if 'l1_ratio' in metrics:
        print(f"  l1_ratio = {metrics['l1_ratio']:.2f}")
    print(f"  R2 = {metrics['r2']:.4f}")
    print(f"  MSE = {metrics['mse']:.4f}")
    print(f"  Non-zero coefficients: {metrics['n_nonzero']}/{n_features}")

When to Use Each Method

Scenario	Recommended Method	Reason
Many small effects	Ridge	Keeps all features, reduces magnitude
Few strong predictors	Lasso	Automatic feature selection
Correlated features	ElasticNet	Grouped selection, stability
High-dimensional ()	ElasticNet	Handles collinearity, selects features
Interpretability needed	Lasso	Sparse model
Maximum accuracy needed	Ridge/ElasticNet	Depends on data structure

Practical Guidelines

1. Always Standardize

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Start with Cross-Validation

from sklearn.linear_model import ElasticNetCV

model = ElasticNetCV(
    l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0],
    alphas=np.logspace(-4, 2, 50),
    cv=5,
    max_iter=10000
)
model.fit(X_train_scaled, y_train)

3. Examine the Regularization Path

from sklearn.linear_model import lasso_path

alphas, coefs, _ = lasso_path(X_train_scaled, y_train, alphas=np.logspace(-3, 1, 50))
n_nonzero = [np.sum(np.abs(c) > 1e-10) for c in coefs.T]

plt.figure(figsize=(8, 5))
plt.plot(np.log10(alphas), n_nonzero)
plt.xlabel('log10(alpha)')
plt.ylabel('Number of Non-zero Coefficients')
plt.title('Feature Selection vs Regularization Strength')
plt.grid(True, alpha=0.3)
plt.show()

4. Compare with OLS

from sklearn.linear_model import LinearRegression

ols = LinearRegression()
ols.fit(X_train_scaled, y_train)

print("Model Comparison:")
print(f"  OLS: R2 = {r2_score(y_test, ols.predict(X_test_scaled)):.4f}")
print(f"  Ridge: R2 = {r2_score(y_test, ridge_cv.predict(X_test_scaled)):.4f}")
print(f"  Lasso: R2 = {r2_score(y_test, lasso_cv.predict(X_test_scaled)):.4f}")
print(f"  ElasticNet: R2 = {r2_score(y_test, elastic_cv.predict(X_test_scaled)):.4f}")

Summary

Property	Ridge (L2)	Lasso (L1)	ElasticNet
Penalty
Sparsity	No	Yes	Yes
Feature Selection	No	Yes	Yes
Correlated Features	Keeps all	Selects one	Selects group
Solution	Closed-form	Iterative	Iterative
When to Use	Many small effects	Few strong predictors	Mixed scenarios

Key Takeaways:

Regularization prevents overfitting by penalizing model complexity
Ridge shrinks coefficients but keeps all features
Lasso performs automatic feature selection via sparsity
ElasticNet combines both benefits, often the best choice
Always use cross-validation to select the regularization strength
Standardize features before applying regularization

Regularization: Ridge, Lasso and ElasticNet

Regularization: Ridge, Lasso and ElasticNet

The Problem of Overfitting

Bias-Variance Tradeoff

Why Regularization?

Ridge Regression (L2 Regularization)

Formulation

Closed-Form Solution

Geometric Interpretation

Lasso Regression (L1 Regularization)

Formulation

No Closed-Form Solution

Sparsity Property

ElasticNet: Best of Both Worlds

Formulation

Advantages over Lasso

Coefficient Shrinkage Visualization

Regularization Path

Key Observations

Choosing Lambda: Cross-Validation

Common Selection Strategies

Implementation in Python

Basic Setup

Ridge Regression

Lasso Regression

ElasticNet

Visualizing the Regularization Path

Comparing Models

When to Use Each Method

Practical Guidelines

1. Always Standardize

2. Start with Cross-Validation

3. Examine the Regularization Path

4. Compare with OLS

Summary

Need Expert Data Science Help?