Regularization: Lasso, Ridge, ElasticNet
Why Regularization?
Without constraints, models minimize training error at the cost of generalization. Regularization adds a penalty term that discourages complex models.
DfRegularization
A technique that adds a penalty to the loss function to constrain the model's complexity, preventing overfitting by discouraging large coefficient values. Regularization introduces a bias-variance tradeoff: it slightly increases bias but significantly reduces variance, leading to better generalization on unseen data.
Overfitting Visualization:
Loss
โ
Training Loss โโโโโ โฒ
โ โฒโโโโโโโ Training
โ โฒ
โ โฒ
โ โฒ_____ Validation
โ
โโโโโโโโโโโโโโโโโโโโ Model Complexity
Regularization shifts the minimum toward simpler models:
โ
Loss โ โฒ
โ โฒ
โ โฒ โ Regularized
โ โฒ
โ โฒ
โโโโโโโโโโโโโโโโโโโโ
The Bias-Variance Tradeoff
ThBias-Variance Decomposition
For a model trained on dataset , the expected prediction error for any test point decomposes as: where is the irreducible noise variance. Regularization increases but reduces , often decreasing the total error.
- Bias: Error from wrong assumptions (underfitting)
- Variance: Error from sensitivity to training data (overfitting)
Regularization increases bias slightly but reduces variance significantly.
Ridge Regression (L2 Regularization)
Adds the sum of squared coefficients to the loss function:
Geometric Interpretation
L2 Constraint Region (circle):
ฮฒโ
โ โญโโโโโโโโโโโฎ
โ โ โ
โ โ Ellipse โ
โ โ (OLS Loss) โ
โ โ โ
โ โฐโโโโโโโโโโโฏ
โโโโโโโโโโโโโโโโโโโโ ฮฒโ
The optimal ฮฒฬ is where the loss ellipse first touches the L2 circle.
Result: coefficients are shrunk but never exactly zero.
Mathematical Derivation
DfRidge Regression
Also known as Tikhonov regularization, Ridge regression adds an L2 penalty to the ordinary least squares loss. The closed-form solution is . The penalty shrinks all coefficients toward zero but never sets them exactly to zero.
The closed-form solution:
Ridge Closed-Form Solution
Here,
- =p ร p identity matrix
- =Regularization strength (\geq 0)
- =Feature matrix (n ร p)
As : \hat{\beta}_{\text{ridge}} \to \hat{\beta}_{\text{OLS} As :
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
# Generate data with multicollinearity
np.random.seed(42)
X, y, true_coefs = make_regression(
n_samples=200, n_features=20, n_informative=5,
noise=10, coef=True, random_state=42
)
# Add correlated features
X[:, 5] = X[:, 0] + np.random.randn(200) * 0.1
X[:, 6] = X[:, 1] + np.random.randn(200) * 0.1
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Ridge regression with cross-validation
alphas = np.logspace(-4, 4, 100)
ridge_cv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error')
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {ridge_cv.alpha_:.4f}")
print(f"Test MSE: {mean_squared_error(y_test, ridge_cv.predict(X_test_scaled)):.4f}")
Ridge regression is particularly effective when many features contribute small effects (the "many weak predictors" scenario). It shrinks all coefficients proportionally, keeping all features in the model while reducing the influence of noisy ones. It handles multicollinearity well because the term makes invertible even when is singular.
Lasso Regression (L1 Regularization)
Adds the sum of absolute coefficient values:
Geometric Interpretation
L1 Constraint Region (diamond):
ฮฒโ
โ โ
โ โฑ โฒ
โ โฑ โฒ
โ โฑ Lossโฒ
โ โฑEllipseโฒ
โ โฑ โฒ
โโโโโโโโโโโโโโโโโ ฮฒโ
โฒ โฑ
โฒ โฑ
โฒ โฑ
โฒโโโโโโฑ
The optimal ฮฒฬ is at a corner โ coefficients become exactly zero!
Result: automatic feature selection (sparse model).
Why Lasso Produces Sparsity
DfLasso Regression
Least Absolute Shrinkage and Selection Operator (Lasso) regression adds an L1 penalty to the OLS loss. Unlike Ridge, Lasso can set coefficients exactly to zero, performing automatic feature selection. The L1 penalty's diamond-shaped constraint region has corners where the loss ellipse is most likely to intersect, producing sparse solutions.
The L1 penalty creates corners in the constraint region where the loss ellipse is most likely to touch, setting coefficients to exactly zero.
ThLasso Sparsity
The Lasso estimator is sparse: for sufficiently large , some coefficients are exactly zero. The number of zero coefficients increases with . This is due to the non-differentiability of the term at zero, which creates corners in the constraint region.
from sklearn.linear_model import Lasso, LassoCV
# Lasso with cross-validation
lasso_cv = LassoCV(alphas=alphas, cv=5, random_state=42, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {lasso_cv.alpha_:.4f}")
print(f"Test MSE: {mean_squared_error(y_test, lasso_cv.predict(X_test_scaled)):.4f}")
# Feature selection
n_zero = np.sum(lasso_cv.coef_ == 0)
print(f"\nFeatures eliminated: {n_zero}/{X.shape[1]}")
print("\nLasso Coefficients:")
for i, coef in enumerate(lasso_cv.coef_):
if abs(coef) > 0.01:
print(f" Feature {i:2d}: {coef:+.4f}")
ElasticNet: Best of Both Worlds
Combines L1 and L2 penalties:
Where:
- controls overall regularization strength
- (l1_ratio) controls the mix between L1 and L2
ElasticNet Spectrum:
ฯ = 1.0 ฯ = 0.5 ฯ = 0.0
Pure Lasso Balanced Pure Ridge
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Feature selection โ Shrinks all features โ
โ + grouped effects โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
ElasticNet is preferred over Lasso when you have groups of correlated features. Lasso tends to arbitrarily select one feature from a correlated group and zero out the rest, while ElasticNet tends to keep or drop the entire group. Set as a balanced default, and increase toward 1.0 if you want more feature selection.
from sklearn.linear_model import ElasticNet, ElasticNetCV
# ElasticNet with cross-validation
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0]
enet_cv = ElasticNetCV(
l1_ratio=l1_ratios,
alphas=alphas,
cv=5,
random_state=42,
max_iter=10000
)
enet_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {enet_cv.alpha_:.4f}")
print(f"Best l1_ratio: {enet_cv.l1_ratio_:.4f}")
print(f"Test MSE: {mean_squared_error(y_test, enet_cv.predict(X_test_scaled)):.4f}")
n_zero = np.sum(enet_cv.coef_ == 0)
print(f"\nFeatures eliminated: {n_zero}/{X.shape[1]}")
Coefficient Path Visualization
from sklearn.linear_model import lasso_path, ridge_path
# Compute coefficient paths
alphas_path = np.logspace(-2, 4, 100)
# Ridge path
ridge_coefs = []
for a in alphas_path:
ridge = Ridge(alpha=a)
ridge.fit(X_train_scaled, y_train)
ridge_coefs.append(ridge.coef_)
ridge_coefs = np.array(ridge_coefs)
# Lasso path
lasso_alphas, lasso_coefs, _ = lasso_path(
X_train_scaled, y_train, alphas=alphas_path
)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Ridge paths
for i in range(X.shape[1]):
axes[0].plot(np.log10(alphas_path), ridge_coefs[:, i], linewidth=0.8)
axes[0].axvline(x=np.log10(ridge_cv.alpha_), color='k', linestyle='--', label='CV alpha')
axes[0].set_xlabel('log10(alpha)')
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('Ridge Coefficient Paths')
axes[0].legend()
# Lasso paths
for i in range(X.shape[1]):
axes[1].plot(np.log10(lasso_alphas), lasso_coefs[:, i], linewidth=0.8)
axes[1].axvline(x=np.log10(lasso_cv.alpha_), color='k', linestyle='--', label='CV alpha')
axes[1].set_xlabel('log10(alpha)')
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('Lasso Coefficient Paths')
axes[1].legend()
plt.tight_layout()
plt.savefig('regularization_paths.png', dpi=150)
plt.show()
๐Comparing Ridge, Lasso, and ElasticNet
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target
feature_names = housing.feature_names
X_h_train, X_h_test, y_h_train, y_h_test = train_test_split(
X_h, y_h, test_size=0.2, random_state=42
)
scaler_h = StandardScaler()
X_h_train_s = scaler_h.fit_transform(X_h_train)
X_h_test_s = scaler_h.transform(X_h_test)
models = {
'OLS': Ridge(alpha=0),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.01),
'ElasticNet': ElasticNet(alpha=0.01, l1_ratio=0.5),
}
print("California Housing - Model Comparison:")
print(f"{'Model':<15} {'Train Rยฒ':>10} {'Test Rยฒ':>10} {'Features':>10}")
print("-" * 50)
for name, model in models.items():
model.fit(X_h_train_s, y_h_train)
train_r2 = model.score(X_h_train_s, y_h_train)
test_r2 = model.score(X_h_test_s, y_h_test)
n_features = np.sum(np.abs(model.coef_) > 1e-4)
print(f"{name:<15} {train_r2:>10.4f} {test_r2:>10.4f} {n_features:>10}")
When to Use Each
from sklearn.feature_selection import SelectFromModel
def choose_regularization(X, y, n_samples, n_features):
"""Decision logic for regularization choice."""
# High-dimensional, sparse signal
if n_features > n_samples:
print("Recommendation: Lasso or ElasticNet (L1)")
return 'lasso'
# Many correlated features
corr_matrix = np.abs(np.corrcoef(X.T))
high_corr = (corr_matrix > 0.8).sum() - X.shape[1]
if high_corr > X.shape[1] * 0.3:
print("Recommendation: Ridge or ElasticNet (L2-heavy)")
return 'ridge'
# General case
print("Recommendation: ElasticNet (balanced)")
return 'elasticnet'
# Apply feature selection with Lasso
selector = SelectFromModel(
Lasso(alpha=lasso_cv.alpha_),
threshold='median'
)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)
print(f"Features before selection: {X_train_scaled.shape[1]}")
print(f"Features after selection: {X_train_selected.shape[1]}")
print(f"Selected features: {np.where(selector.get_support())[0]}")
Comparison Table
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโ
โ Property โ Ridge (L2) โ Lasso (L1) โ ElasticNet โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโค
โ Penalty โ ฮฃฮฒโฑผยฒ โ ฮฃ|ฮฒโฑผ| โ ฯยทฮฃ|ฮฒโฑผ|+(1-ฯ)ยทฮฃฮฒโฑผยฒโ
โ Sparsity โ No (shrinks to 0) โ Yes (exactly 0) โ Yes โ
โ Feature Sel. โ No โ Yes โ Yes (grouped) โ
โ Multicollin. โ Handles well โ Picks one โ Groups correlated โ
โ Computation โ Closed-form โ No closed-form โ No closed-form โ
โ Use Case โ Many small effects โ Few large effects โ Mixed effects โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโ
Key Takeaways
- Ridge (L2): Shrinks all coefficients โ best when most features contribute
- Lasso (L1): Zeroes out coefficients โ best for feature selection
- ElasticNet: Combines both โ best when features are correlated or unknown
- Always standardize features before applying regularization
- Use cross-validation to select and โ avoid manual tuning
- Coefficient paths reveal how regularization affects each feature
๐Summary: Regularization โ Lasso, Ridge, ElasticNet
- Regularization adds a penalty to the loss function to prevent overfitting, trading increased bias for reduced variance.
- The bias-variance decomposition explains why regularization works: it reduces variance more than it increases bias.
- Ridge (L2) adds and shrinks all coefficients toward zero but never sets them exactly to zero. Best when many features contribute small effects.
- Lasso (L1) adds and can set coefficients exactly to zero, performing automatic feature selection. Best when few features dominate.
- ElasticNet combines L1 and L2: . Best for correlated feature groups โ it selects or drops entire groups.
- The geometric intuition: L2 constraint is a circle (no corners, no sparsity); L1 constraint is a diamond (corners create sparsity).
- The Ridge closed-form is always invertible, even when is singular (multicollinearity).
- Always standardize features before regularization โ the penalty is scale-sensitive.
- Use cross-validation (RidgeCV, LassoCV, ElasticNetCV) to select and .
- Coefficient paths visualize how each feature's coefficient changes with โ useful for understanding feature importance.
Practice Exercises
- Feature Selection: Compare Ridge, Lasso, and ElasticNet on a dataset with 100 features but only 5 relevant. Which method recovers the true features?
- Multicollinearity: Create 3 highly correlated features. How does each method handle them?
- Alpha Sensitivity: Plot test Rยฒ vs log(alpha) for all three methods. Where do they diverge?
- Real Dataset: Apply regularized regression to a gene expression dataset (n << p). Which method performs best and why?