Regularization: Ridge, Lasso and ElasticNet
The Problem of Overfitting
When we fit a model to training data, we want it to generalize well to unseen data. A model that memorizes training data but fails on test data is said to overfit.
Bias-Variance Tradeoff
The fundamental tension in supervised learning:
| Source | Description | Effect on Test Error |
|---|---|---|
| Bias | Error from wrong assumptions | High → underfitting |
| Variance | Error from sensitivity to training data | High → overfitting |
| Irreducible | Noise in the data | Cannot reduce |
As model complexity increases:
- Bias decreases (model captures more patterns)
- Variance increases (model becomes more sensitive)
- Total error follows a U-shaped curve
Why Regularization?
Regularization addresses overfitting by penalizing model complexity. Instead of minimizing only the loss function, we minimize:
where:
- is the regularization strength (hyperparameter)
- is the regularization term (penalty function)
- are the model parameters
Key insight: Regularization trades training performance for generalization.
Ridge Regression (L2 Regularization)
Formulation
Ridge regression adds an L2 penalty (squared magnitude of coefficients):
Expanding the penalty term:
Closed-Form Solution
Taking the derivative and setting to zero:
Key properties:
- Always has a unique solution (matrix is always invertible for )
- Coefficients are shrunk toward zero but never exactly zero
- Equivalent to OLS with modified covariance matrix
Geometric Interpretation
The Ridge solution is where the elliptical contours of the OLS loss meet the circular L2 constraint. Since circles are smooth, the intersection rarely occurs exactly on an axis.
Lasso Regression (L1 Regularization)
Formulation
Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty (absolute magnitude):
Expanding:
No Closed-Form Solution
Unlike Ridge, Lasso has no analytical solution due to the non-differentiable absolute value. We use:
- Coordinate descent
- Proximal gradient methods
- Subgradient methods
Sparsity Property
The L1 penalty induces sparsity — it drives some coefficients to exactly zero:
where is the soft-thresholding operator.
Why does L1 produce sparsity?
The L1 constraint region has corners on the axes. The elliptical contours of OLS loss are more likely to intersect at these corners, yielding solutions where some .
ElasticNet: Best of Both Worlds
Formulation
ElasticNet combines L1 and L2 penalties:
Using the mixing parameter :
| Value | Behavior |
|---|---|
| Pure Ridge (L2 only) | |
| Pure Lasso (L1 only) | |
| ElasticNet |
Advantages over Lasso
- Grouped selection: When features are correlated, Lasso selects one arbitrarily; ElasticNet selects the group
- Smooth penalty: Differentiable everywhere (unlike L1)
- More stable: Less sensitive to small changes in data
- : Works when number of features exceeds samples
Coefficient Shrinkage Visualization
Regularization Path
The regularization path shows how coefficients change as varies:
Key Observations
- Left side ( small): All features included, model close to OLS
- Right side ( large): Most coefficients zero, simple model
- Sparsity: Lasso drives coefficients to exactly zero sequentially
- : Value that minimizes cross-validation error
- : Largest within 1 SE of (sparser model)
Choosing Lambda: Cross-Validation
We select using k-fold cross-validation:
Common Selection Strategies
| Strategy | Description | When to Use |
|---|---|---|
| Minimizes CV error | Maximum predictive power | |
| Largest within 1 SE of | Simpler, more interpretable model |
Implementation in Python
Basic Setup
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error, r2_score
np.random.seed(42)
n_samples, n_features = 200, 50
X, y, true_coef = make_regression(
n_samples=n_samples,
n_features=n_features,
n_informative=10,
noise=20,
coef=True,
random_state=42
)
correlation_matrix = np.eye(n_features)
for i in range(n_features - 1):
correlation_matrix[i, i+1] = 0.8
correlation_matrix[i+1, i] = 0.8
L = np.linalg.cholesky(correlation_matrix)
X = X @ L.T
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Ridge Regression
ridge_cv = RidgeCV(
alphas=np.logspace(-3, 3, 100),
scoring='neg_mean_squared_error',
cv=5
)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best alpha (Ridge): {ridge_cv.alpha_:.4f}")
y_pred_ridge = ridge_cv.predict(X_test_scaled)
print(f"Ridge R²: {r2_score(y_test, y_pred_ridge):.4f}")
print(f"Ridge MSE: {mean_squared_error(y_test, y_pred_ridge):.4f}")
print(f"Non-zero coefficients: {np.sum(ridge_cv.coef_ != 0)}/{n_features}")
Lasso Regression
lasso_cv = LassoCV(
alphas=np.logspace(-3, 1, 100),
cv=5,
max_iter=10000,
random_state=42
)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best alpha (Lasso): {lasso_cv.alpha_:.4f}")
y_pred_lasso = lasso_cv.predict(X_test_scaled)
print(f"Lasso R²: {r2_score(y_test, y_pred_lasso):.4f}")
print(f"Lasso MSE: {mean_squared_error(y_test, y_pred_lasso):.4f}")
print(f"Non-zero coefficients: {np.sum(lasso_cv.coef_ != 0)}/{n_features}")
print(f"Zero coefficients: {np.sum(lasso_cv.coef_ == 0)}/{n_features}")
selected_features = np.where(lasso_cv.coef_ != 0)[0]
print(f"Selected features: {selected_features}")
ElasticNet
elastic_cv = ElasticNetCV(
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],
alphas=np.logspace(-3, 1, 100),
cv=5,
max_iter=10000,
random_state=42
)
elastic_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {elastic_cv.alpha_:.4f}")
print(f"Best l1_ratio: {elastic_cv.l1_ratio_:.4f}")
y_pred_elastic = elastic_cv.predict(X_test_scaled)
print(f"ElasticNet R²: {r2_score(y_test, y_pred_elastic):.4f}")
print(f"ElasticNet MSE: {mean_squared_error(y_test, y_pred_elastic):.4f}")
print(f"Non-zero coefficients: {np.sum(elastic_cv.coef_ != 0)}/{n_features}")
Visualizing the Regularization Path
def plot_regularization_path(X, y, alphas):
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
models = [
('Ridge', Ridge()),
('Lasso', Lasso()),
('ElasticNet', ElasticNet(l1_ratio=0.5))
]
for ax, (name, model) in zip(axes, models):
coefs = []
for a in alphas:
model.set_params(alpha=a)
model.fit(X, y)
coefs.append(model.coef_)
coefs = np.array(coefs)
for i in range(coefs.shape[1]):
ax.plot(np.log10(alphas), coefs[:, i], linewidth=0.8)
ax.set_xlabel('log10(alpha)')
ax.set_ylabel('Coefficient Value')
ax.set_title(f'{name} Path')
ax.axhline(y=0, color='k', linestyle='--', linewidth=0.5)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('regularization_paths.png', dpi=150, bbox_inches='tight')
plt.show()
alphas = np.logspace(-3, 3, 200)
plot_regularization_path(X_train_scaled, y_train, alphas)
Comparing Models
def compare_regularization(X_train, y_train, X_test, y_test):
results = {}
ridge = RidgeCV(
alphas=np.logspace(-3, 3, 100),
scoring='neg_mean_squared_error',
cv=5
)
ridge.fit(X_train, y_train)
results['Ridge'] = {
'alpha': ridge.alpha_,
'r2': r2_score(y_test, ridge.predict(X_test)),
'mse': mean_squared_error(y_test, ridge.predict(X_test)),
'n_nonzero': np.sum(ridge.coef_ != 0)
}
lasso = LassoCV(
alphas=np.logspace(-3, 1, 100),
cv=5,
max_iter=10000,
random_state=42
)
lasso.fit(X_train, y_train)
results['Lasso'] = {
'alpha': lasso.alpha_,
'r2': r2_score(y_test, lasso.predict(X_test)),
'mse': mean_squared_error(y_test, lasso.predict(X_test)),
'n_nonzero': np.sum(lasso.coef_ != 0)
}
elastic = ElasticNetCV(
l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 0.95, 1.0],
alphas=np.logspace(-3, 1, 100),
cv=5,
max_iter=10000,
random_state=42
)
elastic.fit(X_train, y_train)
results['ElasticNet'] = {
'alpha': elastic.alpha_,
'l1_ratio': elastic.l1_ratio_,
'r2': r2_score(y_test, elastic.predict(X_test)),
'mse': mean_squared_error(y_test, elastic.predict(X_test)),
'n_nonzero': np.sum(elastic.coef_ != 0)
}
return results
results = compare_regularization(X_train_scaled, y_train, X_test_scaled, y_test)
for model, metrics in results.items():
print(f"\n{model}:")
print(f" alpha = {metrics['alpha']:.4f}")
if 'l1_ratio' in metrics:
print(f" l1_ratio = {metrics['l1_ratio']:.2f}")
print(f" R2 = {metrics['r2']:.4f}")
print(f" MSE = {metrics['mse']:.4f}")
print(f" Non-zero coefficients: {metrics['n_nonzero']}/{n_features}")
When to Use Each Method
| Scenario | Recommended Method | Reason |
|---|---|---|
| Many small effects | Ridge | Keeps all features, reduces magnitude |
| Few strong predictors | Lasso | Automatic feature selection |
| Correlated features | ElasticNet | Grouped selection, stability |
| High-dimensional () | ElasticNet | Handles collinearity, selects features |
| Interpretability needed | Lasso | Sparse model |
| Maximum accuracy needed | Ridge/ElasticNet | Depends on data structure |
Practical Guidelines
1. Always Standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
2. Start with Cross-Validation
from sklearn.linear_model import ElasticNetCV
model = ElasticNetCV(
l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1.0],
alphas=np.logspace(-4, 2, 50),
cv=5,
max_iter=10000
)
model.fit(X_train_scaled, y_train)
3. Examine the Regularization Path
from sklearn.linear_model import lasso_path
alphas, coefs, _ = lasso_path(X_train_scaled, y_train, alphas=np.logspace(-3, 1, 50))
n_nonzero = [np.sum(np.abs(c) > 1e-10) for c in coefs.T]
plt.figure(figsize=(8, 5))
plt.plot(np.log10(alphas), n_nonzero)
plt.xlabel('log10(alpha)')
plt.ylabel('Number of Non-zero Coefficients')
plt.title('Feature Selection vs Regularization Strength')
plt.grid(True, alpha=0.3)
plt.show()
4. Compare with OLS
from sklearn.linear_model import LinearRegression
ols = LinearRegression()
ols.fit(X_train_scaled, y_train)
print("Model Comparison:")
print(f" OLS: R2 = {r2_score(y_test, ols.predict(X_test_scaled)):.4f}")
print(f" Ridge: R2 = {r2_score(y_test, ridge_cv.predict(X_test_scaled)):.4f}")
print(f" Lasso: R2 = {r2_score(y_test, lasso_cv.predict(X_test_scaled)):.4f}")
print(f" ElasticNet: R2 = {r2_score(y_test, elastic_cv.predict(X_test_scaled)):.4f}")
Summary
| Property | Ridge (L2) | Lasso (L1) | ElasticNet |
|---|---|---|---|
| Penalty | |||
| Sparsity | No | Yes | Yes |
| Feature Selection | No | Yes | Yes |
| Correlated Features | Keeps all | Selects one | Selects group |
| Solution | Closed-form | Iterative | Iterative |
| When to Use | Many small effects | Few strong predictors | Mixed scenarios |
Key Takeaways:
- Regularization prevents overfitting by penalizing model complexity
- Ridge shrinks coefficients but keeps all features
- Lasso performs automatic feature selection via sparsity
- ElasticNet combines both benefits, often the best choice
- Always use cross-validation to select the regularization strength
- Standardize features before applying regularization