R-Squared and Adjusted R-Squared — Measuring Model Fit

Regression AnalysisModel EvaluationFree Lesson

Advertisement

R-Squared (Coefficient of Determination)

R2=1SSresSStot=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum(y_i-\hat{y}_i)^2}{\sum(y_i-\bar{y})^2}

R² measures the proportion of variance in Y explained by the model.

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

np.random.seed(42)

# Show R² visually for different strengths of relationship
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
noise_levels = [0.5, 2, 5, 10, 20, 50]
X = np.linspace(0, 10, 100)

for ax, noise in zip(axes.flat, noise_levels):
    y = 2*X + 5 + np.random.normal(0, noise, 100)
    model = sm.OLS(y, sm.add_constant(X)).fit()
    
    ax.scatter(X, y, alpha=0.4, s=20, color='steelblue')
    ax.plot(X, model.fittedvalues, 'r-', linewidth=2)
    ax.set_title(f'Noise SD = {noise}
R² = {model.rsquared:.4f}')

plt.tight_layout()
plt.savefig('r_squared_examples.png', dpi=150)
plt.show()

Adjusted R-Squared

Problem with R²: Adding any predictor, even irrelevant ones, increases R² (it never decreases).

Solution: Adjusted R² penalizes for number of predictors p:

Radj2=1(1R2)(n1)np1R^2_{adj} = 1 - \frac{(1-R^2)(n-1)}{n-p-1}

  • If a new predictor improves fit beyond chance: R²_adj increases
  • If a new predictor adds noise: R²_adj decreases
# Demonstrate: adding useless predictors increases R² but decreases adj-R²
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression

np.random.seed(42)
n = 100
X_real, y = make_regression(n_samples=n, n_features=3, noise=20, random_state=42)

def compute_r2(y, X_cols, n):
    X_dm = sm.add_constant(X_cols)
    model = sm.OLS(y, X_dm).fit()
    return model.rsquared, model.rsquared_adj

results = []
X_all = X_real.copy()
for k in range(1, 21):
    if k <= 3:
        X_k = X_real[:, :k]
    else:
        # Add random (useless) predictors
        X_k = np.column_stack([X_real, np.random.normal(0, 1, (n, k-3))])
    r2, r2_adj = compute_r2(y, X_k, n)
    results.append({'k': k, 'R2': r2, 'R2_adj': r2_adj})

import pandas as pd
results_df = pd.DataFrame(results)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(results_df['k'], results_df['R2'], 'b-o', label='R²')
ax.plot(results_df['k'], results_df['R2_adj'], 'r-o', label='Adjusted R²')
ax.axvline(3, color='gray', linestyle=':', label='True # predictors')
ax.set_xlabel('Number of Predictors')
ax.set_ylabel('R² Value')
ax.set_title('R² Always Increases; Adjusted R² Penalizes Irrelevant Predictors')
ax.legend()
ax.grid(True, alpha=0.3)
plt.savefig('r2_adj_comparison.png', dpi=150)
plt.show()

print("Adding useless predictors (k>3):")
print(results_df[results_df['k']>=3].to_string(index=False))

Limitations of R²

LimitationDescription
Domain-specificR²=0.30 can be good in social science, terrible in engineering
Not for model comparisonCan't compare R² across different Y variables
Doesn't validate predictionsHigh R² doesn't mean predictions are accurate
Sensitive to outliersOne influential point can change R² dramatically
Not appropriate for nonlinear modelsUse pseudo-R² instead

Key Takeaways

  1. R² = proportion of Y variance explained — ranges 0 to 1
  2. Adjusted R² penalizes for extra predictors — use for model comparison
  3. R² = r² (Pearson r squared) for simple regression only
  4. High R² ≠ good model: check residuals, assumptions, and out-of-sample performance
  5. Use AIC/BIC for formal model comparison instead of adj-R²

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement