Lasso Regression (L1 Regularization) — Feature Selection

Regression AnalysisRegularizationFree Lesson

Advertisement

Lasso Regression (L1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty:

Minimize: yXβ2+λjβj\text{Minimize: } \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda\sum_j|\beta_j|

Unlike Ridge, Lasso produces sparse solutions — many coefficients shrink to exactly zero.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, LassoCV, Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, train_test_split

np.random.seed(42)
n, p = 150, 30  # 30 features, only 5 are truly relevant
X = np.random.randn(n, p)
true_beta = np.array([3, -2, 1.5, -1, 0.8] + [0]*25)  # only 5 nonzero
y = X @ true_beta + np.random.randn(n)*1.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Lasso path
lambdas = np.logspace(-3, 2, 200)
lasso_coefs = []
ridge_coefs = []
for lam in lambdas:
    l = Pipeline([('s', StandardScaler()), ('m', Lasso(alpha=lam, max_iter=10000))])
    r = Pipeline([('s', StandardScaler()), ('m', Ridge(alpha=lam))])
    l.fit(X_train, y_train)
    r.fit(X_train, y_train)
    lasso_coefs.append(l.named_steps['m'].coef_)
    ridge_coefs.append(r.named_steps['m'].coef_)

lasso_coefs = np.array(lasso_coefs)
ridge_coefs = np.array(ridge_coefs)

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Lasso path
for j in range(p):
    axes[0].plot(np.log10(lambdas), lasso_coefs[:, j],
                 linewidth=1.5, color='red' if j < 5 else 'lightgray', alpha=0.7)
axes[0].set_xlabel('log₁₀(λ)')
axes[0].set_ylabel('Coefficient')
axes[0].set_title('Lasso Path — Sparse Solutions
(coefficients become exactly 0)')

# Compare: sparsity at optimal λ
lasso_cv = Pipeline([('s', StandardScaler()), ('m', LassoCV(cv=5, random_state=42))])
lasso_cv.fit(X_train, y_train)
best_lam = lasso_cv.named_steps['m'].alpha_
best_coefs = lasso_cv.named_steps['m'].coef_

nonzero = (best_coefs != 0).sum()
axes[1].bar(range(p), best_coefs, color=['red' if j<5 else 'steelblue' for j in range(p)])
axes[1].axhline(0, color='black', linewidth=0.5)
axes[1].set_title(f'Lasso Coefficients (λ={best_lam:.4f})
{nonzero}/{p} nonzero features selected')
axes[1].set_xlabel('Feature Index')

# Lasso vs Ridge: number of nonzero features
lasso_nonzero = [(lasso_coefs[i] != 0).sum() for i in range(len(lambdas))]
ridge_nonzero = [(ridge_coefs[i] != 0).sum() for i in range(len(lambdas))]
axes[2].plot(np.log10(lambdas), lasso_nonzero, 'r-', linewidth=2, label='Lasso')
axes[2].plot(np.log10(lambdas), ridge_nonzero, 'b-', linewidth=2, label='Ridge')
axes[2].set_title('Sparsity: Lasso vs Ridge')
axes[2].set_xlabel('log₁₀(λ)')
axes[2].set_ylabel('# Nonzero Coefficients')
axes[2].legend()

plt.tight_layout()
plt.savefig('lasso_regression.png', dpi=150)
plt.show()

test_mse_lasso = np.mean((y_test - lasso_cv.predict(X_test))**2)
print(f"Lasso: best λ={best_lam:.4f}, {nonzero} features selected, Test MSE={test_mse_lasso:.4f}")
print(f"True nonzero features: {(true_beta!=0).sum()}")
print(f"Correctly selected: {sum(1 for j in range(p) if (best_coefs[j]!=0) == (true_beta[j]!=0))}/{p}")

Ridge vs Lasso

AspectRidgeLasso
PenaltyL2 (squared)L1 (absolute)
ShrinkageToward 0, not to 0Can be exactly 0
Feature selection❌ No✅ Yes (sparse)
MulticollinearityKeeps allPicks one arbitrarily
SolutionClosed formIterative (coordinate descent)

Key Takeaways

  1. Lasso performs feature selection — coefficients shrink to exactly zero
  2. L1 penalty creates sparsity because the L1 ball has corners at axes
  3. Use Lasso when you believe few features truly matter (sparse true model)
  4. Use Ridge when all features contribute (dense true model)
  5. Elastic Net combines L1 + L2 — best of both worlds

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement