Lasso Regression (L1 Regularization)
Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty:
Unlike Ridge, Lasso produces sparse solutions — many coefficients shrink to exactly zero.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split
np.random.seed(42)
n, p = 150, 30 # 30 features, only 5 are truly relevant
X = np.random.randn(n, p)
true_beta = np.array([3, -2, 1.5, -1, 0.8] + [0]*25) # only 5 nonzero
y = X @ true_beta + np.random.randn(n)*1.5
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Lasso path
lambdas = np.logspace(-3, 2, 200)
lasso_coefs = []
ridge_coefs = []
for lam in lambdas:
l = Pipeline([('s', StandardScaler()), ('m', Lasso(alpha=lam, max_iter=10000))])
r = Pipeline([('s', StandardScaler()), ('m', Ridge(alpha=lam))])
l.fit(X_train, y_train)
r.fit(X_train, y_train)
lasso_coefs.append(l.named_steps['m'].coef_)
ridge_coefs.append(r.named_steps['m'].coef_)
lasso_coefs = np.array(lasso_coefs)
ridge_coefs = np.array(ridge_coefs)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Lasso path
for j in range(p):
axes[0].plot(np.log10(lambdas), lasso_coefs[:, j],
linewidth=1.5, color='red' if j < 5 else 'lightgray', alpha=0.7)
axes[0].set_xlabel('log₁₀(λ)')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('Lasso Path — Sparse Solutions
(coefficients become exactly 0)')
# Compare: sparsity at optimal λ
lasso_cv = Pipeline([('s', StandardScaler()), ('m', LassoCV(cv=5, random_state=42))])
lasso_cv.fit(X_train, y_train)
best_lam = lasso_cv.named_steps['m'].alpha_
best_coefs = lasso_cv.named_steps['m'].coef_
nonzero = (best_coefs != 0).sum()
axes[1].bar(range(p), best_coefs, color=['red' if j<5 else 'steelblue' for j in range(p)])
axes[1].axhline(0, color='black', linewidth=0.5)
axes[1].set_title(f'Lasso Coefficients (λ={best_lam:.4f})
{nonzero}/{p} nonzero features selected')
axes[1].set_xlabel('Feature Index')
# Lasso vs Ridge: number of nonzero features
lasso_nonzero = [(lasso_coefs[i] != 0).sum() for i in range(len(lambdas))]
ridge_nonzero = [(ridge_coefs[i] != 0).sum() for i in range(len(lambdas))]
axes[2].plot(np.log10(lambdas), lasso_nonzero, 'r-', linewidth=2, label='Lasso')
axes[2].plot(np.log10(lambdas), ridge_nonzero, 'b-', linewidth=2, label='Ridge')
axes[2].set_title('Sparsity: Lasso vs Ridge')
axes[2].set_xlabel('log₁₀(λ)')
axes[2].set_ylabel('# Nonzero Coefficients')
axes[2].legend()
plt.tight_layout()
plt.savefig('lasso_regression.png', dpi=150)
plt.show()
test_mse_lasso = np.mean((y_test - lasso_cv.predict(X_test))**2)
print(f"Lasso: best λ={best_lam:.4f}, {nonzero} features selected, Test MSE={test_mse_lasso:.4f}")
print(f"True nonzero features: {(true_beta!=0).sum()}")
print(f"Correctly selected: {sum(1 for j in range(p) if (best_coefs[j]!=0) == (true_beta[j]!=0))}/{p}")
Ridge vs Lasso
| Aspect | Ridge | Lasso |
|---|---|---|
| Penalty | L2 (squared) | L1 (absolute) |
| Shrinkage | Toward 0, not to 0 | Can be exactly 0 |
| Feature selection | ❌ No | ✅ Yes (sparse) |
| Multicollinearity | Keeps all | Picks one arbitrarily |
| Solution | Closed form | Iterative (coordinate descent) |
Key Takeaways
- Lasso performs feature selection — coefficients shrink to exactly zero
- L1 penalty creates sparsity because the L1 ball has corners at axes
- Use Lasso when you believe few features truly matter (sparse true model)
- Use Ridge when all features contribute (dense true model)
- Elastic Net combines L1 + L2 — best of both worlds