R-Squared (Coefficient of Determination)
R² measures the proportion of variance in Y explained by the model.
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
np.random.seed(42)
# Show R² visually for different strengths of relationship
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
noise_levels = [0.5, 2, 5, 10, 20, 50]
X = np.linspace(0, 10, 100)
for ax, noise in zip(axes.flat, noise_levels):
y = 2*X + 5 + np.random.normal(0, noise, 100)
model = sm.OLS(y, sm.add_constant(X)).fit()
ax.scatter(X, y, alpha=0.4, s=20, color='steelblue')
ax.plot(X, model.fittedvalues, 'r-', linewidth=2)
ax.set_title(f'Noise SD = {noise}
R² = {model.rsquared:.4f}')
plt.tight_layout()
plt.savefig('r_squared_examples.png', dpi=150)
plt.show()
Adjusted R-Squared
Problem with R²: Adding any predictor, even irrelevant ones, increases R² (it never decreases).
Solution: Adjusted R² penalizes for number of predictors p:
- If a new predictor improves fit beyond chance: R²_adj increases
- If a new predictor adds noise: R²_adj decreases
# Demonstrate: adding useless predictors increases R² but decreases adj-R²
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
np.random.seed(42)
n = 100
X_real, y = make_regression(n_samples=n, n_features=3, noise=20, random_state=42)
def compute_r2(y, X_cols, n):
X_dm = sm.add_constant(X_cols)
model = sm.OLS(y, X_dm).fit()
return model.rsquared, model.rsquared_adj
results = []
X_all = X_real.copy()
for k in range(1, 21):
if k <= 3:
X_k = X_real[:, :k]
else:
# Add random (useless) predictors
X_k = np.column_stack([X_real, np.random.normal(0, 1, (n, k-3))])
r2, r2_adj = compute_r2(y, X_k, n)
results.append({'k': k, 'R2': r2, 'R2_adj': r2_adj})
import pandas as pd
results_df = pd.DataFrame(results)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(results_df['k'], results_df['R2'], 'b-o', label='R²')
ax.plot(results_df['k'], results_df['R2_adj'], 'r-o', label='Adjusted R²')
ax.axvline(3, color='gray', linestyle=':', label='True # predictors')
ax.set_xlabel('Number of Predictors')
ax.set_ylabel('R² Value')
ax.set_title('R² Always Increases; Adjusted R² Penalizes Irrelevant Predictors')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('r2_adj_comparison.png', dpi=150)
plt.show()
print("Adding useless predictors (k>3):")
print(results_df[results_df['k']>=3].to_string(index=False))
Limitations of R²
| Limitation | Description |
|---|---|
| Domain-specific | R²=0.30 can be good in social science, terrible in engineering |
| Not for model comparison | Can't compare R² across different Y variables |
| Doesn't validate predictions | High R² doesn't mean predictions are accurate |
| Sensitive to outliers | One influential point can change R² dramatically |
| Not appropriate for nonlinear models | Use pseudo-R² instead |
Key Takeaways
- R² = proportion of Y variance explained — ranges 0 to 1
- Adjusted R² penalizes for extra predictors — use for model comparison
- R² = r² (Pearson r squared) for simple regression only
- High R² ≠ good model: check residuals, assumptions, and out-of-sample performance
- Use AIC/BIC for formal model comparison instead of adj-R²