Regression Analysis
ℹ️ Why It Matters
Regression models relationships between variables, enabling prediction and causal inference. It is the foundation of predictive modeling, from simple trend lines to complex machine learning pipelines. Understanding regression assumptions and diagnostics ensures that model coefficients, p-values, and predictions are trustworthy. Without checking assumptions, regression results can be deeply misleading.
Overview
Simple linear regression models the relationship between one predictor and one outcome: . Multiple linear regression extends this to multiple predictors. Coefficients are estimated via ordinary least squares (OLS), which minimizes the sum of squared residuals. Key diagnostics include checking linearity (residuals vs fitted plot), normality (Q-Q plot of residuals), homoscedasticity (constant residual variance), and independence (no autocorrelation). R² measures the proportion of variance explained; adjusted R² penalizes for adding predictors. Violated assumptions lead to biased coefficients, incorrect standard errors, and invalid inference.
Key Concepts
Simple Linear Regression
Here,
- =Intercept (value of y when x = 0)
- =Slope (change in y per unit change in x)
- =Error term, $\epsilon \sim N(0, \sigma^2)$
OLS Estimator for Slope
Here,
- =Estimated slope (BLUE under Gauss-Markov)
OLS Estimator for Intercept
Here,
- =Estimated intercept
Multiple Linear Regression
Here,
- =Predictor variables
- =Partial regression coefficients (effect of each predictor holding others constant)
R-Squared
Here,
- =Sum of squared residuals
- =Total sum of squares
Adjusted R-Squared
Here,
- =Number of predictors
- =Sample size
Diagnostic Checklist
| Assumption | What to Check | How to Check | Remedy if Violated |
|---|---|---|---|
| Linearity | Linear relationship | Residuals vs. fitted plot | Add polynomial terms, transforms |
| Normality | Residuals ~ Normal | Q-Q plot, Shapiro-Wilk | Transform, robust regression |
| Homoscedasticity | Constant variance | Residuals vs. fitted (funnel = bad) | Weighted least squares, robust SE |
| Independence | No autocorrelation | Durbin-Watson test | Time series models, mixed effects |
| No multicollinearity | Predictors not highly correlated | VIF > 10 threshold | Remove/combine predictors, Ridge |
Quick Example
📝Multiple Regression Prediction
Model: . If , :
Each coefficient represents the change in for a one-unit increase in the predictor, holding all others constant. So means each unit increase in increases by 3 units, regardless of .
📝R-Squared Interpretation
means 85% of the variance in is explained by the predictors. But adjusted after penalizing for 10 predictors — suggesting some predictors may not be adding value.
Key Takeaways
📋Summary: Regression Analysis
- Simple Linear Regression: . Slope captures the linear relationship between x and y.
- Multiple Regression: Extends to multiple predictors. Each is a partial effect (holding others constant).
- OLS: Minimizes . Coefficients are BLUE (Best Linear Unbiased Estimators) under Gauss-Markov assumptions.
- R²: Proportion of variance explained. Adjusted R² penalizes for number of predictors.
- Diagnostics: Always check linearity, normality, homoscedasticity, and independence before interpreting coefficients.
- Multicollinearity: Correlated predictors inflate standard errors. Check VIF > 10 threshold.
- Extrapolation: Predictions outside the range of observed x values are unreliable.
Deep Dive
For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:
Simple Linear Regression
- Simple Linear Regression — Full derivation, OLS, geometric interpretation, and examples
OLS Estimation
- OLS Estimation — Gauss-Markov theorem, BLUE properties, matrix formulation, and efficiency
Assumptions
- Regression Assumptions — Gauss-Markov assumptions, what happens when they fail, and remedies
Diagnostics
- Residual Analysis — Residual plots, Q-Q plots, influence measures, Cook's distance, and leverage
- R-Squared and Adjusted R-Squared — Interpreting model fit, adjusted R², and information criteria
Multiple Regression
- Multiple Linear Regression — Extending to multiple predictors, interpretation, and variable selection
Related Topics
- Multicollinearity — Diagnosing and addressing correlated predictors
- Heteroscedasticity — Non-constant variance and robust standard errors
- Autocorrelation — Serial correlation in time series regression
- Polynomial Regression — Modeling non-linear relationships