R-Squared (Coefficient of Determination)

Regression Analysis

Measuring How Well Your Model Fits the Data

R-squared quantifies the proportion of variance explained by your model, while adjusted R-squared penalizes for unnecessary predictors. Together they provide a balanced view of model quality versus complexity.

Model Comparison — Choose between competing models with different numbers of predictors
Business Reporting — Communicate model performance to stakeholders in intuitive percentages
Feature Engineering — Evaluate whether adding variables actually improves explanatory power

Explained variance matters, but parsimony prevents overfitting.

R² measures the proportion of variance in Y explained by the model.


import numpy as np

import matplotlib.pyplot as plt

import statsmodels.api as sm



np.random.seed(42)



# Show R² visually for different strengths of relationship

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

noise_levels = [0.5, 2, 5, 10, 20, 50]

X = np.linspace(0, 10, 100)



for ax, noise in zip(axes.flat, noise_levels):

    y = 2*X + 5 + np.random.normal(0, noise, 100)

    model = sm.OLS(y, sm.add_constant(X)).fit()

    

    ax.scatter(X, y, alpha=0.4, s=20, color='steelblue')

    ax.plot(X, model.fittedvalues, 'r-', linewidth=2)

    ax.set_title(f'Noise SD = {noise}\nR² = {model.rsquared:.4f}')



plt.tight_layout()

plt.savefig('r_squared_examples.png', dpi=150)

plt.show()

Adjusted R-Squared

Problem with R²: Adding any predictor, even irrelevant ones, increases R² (it never decreases).

Solution: Adjusted R² penalizes for number of predictors p:

If a new predictor improves fit beyond chance: R²_adj increases
If a new predictor adds noise: R²_adj decreases


# Demonstrate: adding useless predictors increases R² but decreases adj-R²

from sklearn.datasets import make_regression

from sklearn.linear_model import LinearRegression



np.random.seed(42)

n = 100

X_real, y = make_regression(n_samples=n, n_features=3, noise=20, random_state=42)



def compute_r2(y, X_cols, n):

    X_dm = sm.add_constant(X_cols)

    model = sm.OLS(y, X_dm).fit()

    return model.rsquared, model.rsquared_adj



results = []

X_all = X_real.copy()

for k in range(1, 21):

    if k <= 3:

        X_k = X_real[:, :k]

    else:

        # Add random (useless) predictors

        X_k = np.column_stack([X_real, np.random.normal(0, 1, (n, k-3))])

    r2, r2_adj = compute_r2(y, X_k, n)

    results.append({'k': k, 'R2': r2, 'R2_adj': r2_adj})



import pandas as pd

results_df = pd.DataFrame(results)



fig, ax = plt.subplots(figsize=(10, 5))

ax.plot(results_df['k'], results_df['R2'], 'b-o', label='R²')

ax.plot(results_df['k'], results_df['R2_adj'], 'r-o', label='Adjusted R²')

ax.axvline(3, color='gray', linestyle=':', label='True # predictors')

ax.set_xlabel('Number of Predictors')

ax.set_ylabel('R² Value')

ax.set_title('R² Always Increases; Adjusted R² Penalizes Irrelevant Predictors')

ax.legend()

ax.grid(True, alpha=0.3)

plt.savefig('r2_adj_comparison.png', dpi=150)

plt.show()



print("Adding useless predictors (k>3):")

print(results_df[results_df['k']>=3].to_string(index=False))

Limitations of R²

| Limitation | Description |

|-----------|-------------|

| Domain-specific | R²=0.30 can be good in social science, terrible in engineering |

| Not for model comparison | Can't compare R² across different Y variables |

| Doesn't validate predictions | High R² doesn't mean predictions are accurate |

| Sensitive to outliers | One influential point can change R² dramatically |

| Not appropriate for nonlinear models | Use pseudo-R² instead |

R-Squared and Adjusted R-Squared — Measuring Model Fit

R-Squared (Coefficient of Determination)

Measuring How Well Your Model Fits the Data

Adjusted R-Squared

Limitations of R²

Key Takeaways

Need Expert Statistics Help?