← Math|52 of 100
Statistics

Regression Analysis

Master simple and multiple linear regression, OLS estimation, assumptions, diagnostics, and applications.

📂 Regression📖 Lesson 52 of 100🎓 Free Course

Advertisement

Regression Analysis

ℹ️ Why It Matters

Regression models relationships between variables, enabling prediction and causal inference. It is the foundation of predictive modeling, from simple trend lines to complex machine learning pipelines. Understanding regression assumptions and diagnostics ensures that model coefficients, p-values, and predictions are trustworthy. Without checking assumptions, regression results can be deeply misleading.


Overview

Simple linear regression models the relationship between one predictor and one outcome: y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon. Multiple linear regression extends this to multiple predictors. Coefficients are estimated via ordinary least squares (OLS), which minimizes the sum of squared residuals. Key diagnostics include checking linearity (residuals vs fitted plot), normality (Q-Q plot of residuals), homoscedasticity (constant residual variance), and independence (no autocorrelation). measures the proportion of variance explained; adjusted R² penalizes for adding predictors. Violated assumptions lead to biased coefficients, incorrect standard errors, and invalid inference.


Key Concepts

Simple Linear Regression

y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon

Here,

  • β0\beta_0=Intercept (value of y when x = 0)
  • β1\beta_1=Slope (change in y per unit change in x)
  • ϵ\epsilon=Error term, $\epsilon \sim N(0, \sigma^2)$

OLS Estimator for Slope

β^1=(xixˉ)(yiyˉ)(xixˉ)2\hat{\beta}_1 = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2}

Here,

  • β^1\hat{\beta}_1=Estimated slope (BLUE under Gauss-Markov)

OLS Estimator for Intercept

β^0=yˉβ^1xˉ\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}

Here,

  • β^0\hat{\beta}_0=Estimated intercept

Multiple Linear Regression

y=β0+β1x1+β2x2++βpxp+ϵy = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon

Here,

  • x1,,xpx_1, \ldots, x_p=Predictor variables
  • β1,,βp\beta_1, \ldots, \beta_p=Partial regression coefficients (effect of each predictor holding others constant)

R-Squared

R2=1SSresSStot=1(yiy^i)2(yiyˉ)2R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Here,

  • SSresSS_{res}=Sum of squared residuals
  • SStotSS_{tot}=Total sum of squares

Adjusted R-Squared

Radj2=1(1R2)(n1)np1R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}

Here,

  • pp=Number of predictors
  • nn=Sample size

Diagnostic Checklist

AssumptionWhat to CheckHow to CheckRemedy if Violated
LinearityLinear relationshipResiduals vs. fitted plotAdd polynomial terms, transforms
NormalityResiduals ~ NormalQ-Q plot, Shapiro-WilkTransform, robust regression
HomoscedasticityConstant varianceResiduals vs. fitted (funnel = bad)Weighted least squares, robust SE
IndependenceNo autocorrelationDurbin-Watson testTime series models, mixed effects
No multicollinearityPredictors not highly correlatedVIF > 10 thresholdRemove/combine predictors, Ridge

Quick Example

📝Multiple Regression Prediction

Model: y^=2+3x1x2\hat{y} = 2 + 3x_1 - x_2. If x1=4x_1 = 4, x2=2x_2 = 2:

y^=2+3(4)2=2+122=12\hat{y} = 2 + 3(4) - 2 = 2 + 12 - 2 = 12

Each coefficient represents the change in yy for a one-unit increase in the predictor, holding all others constant. So β1=3\beta_1 = 3 means each unit increase in x1x_1 increases yy by 3 units, regardless of x2x_2.

📝R-Squared Interpretation

R2=0.85R^2 = 0.85 means 85% of the variance in yy is explained by the predictors. But adjusted R2=0.82R^2 = 0.82 after penalizing for 10 predictors — suggesting some predictors may not be adding value.


Key Takeaways

📋Summary: Regression Analysis

  • Simple Linear Regression: y=β0+β1x+ϵy = \beta_0 + \beta_1 x + \epsilon. Slope captures the linear relationship between x and y.
  • Multiple Regression: Extends to multiple predictors. Each βj\beta_j is a partial effect (holding others constant).
  • OLS: Minimizes (yiy^i)2\sum(y_i - \hat{y}_i)^2. Coefficients are BLUE (Best Linear Unbiased Estimators) under Gauss-Markov assumptions.
  • : Proportion of variance explained. Adjusted R² penalizes for number of predictors.
  • Diagnostics: Always check linearity, normality, homoscedasticity, and independence before interpreting coefficients.
  • Multicollinearity: Correlated predictors inflate standard errors. Check VIF > 10 threshold.
  • Extrapolation: Predictions outside the range of observed x values are unreliable.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Simple Linear Regression

OLS Estimation

  • OLS Estimation — Gauss-Markov theorem, BLUE properties, matrix formulation, and efficiency

Assumptions

Diagnostics

Multiple Regression

Related Topics

Lesson Progress52 / 100