R Linear Regression — Modeling Relationships
Learning Objectives
By the end of this tutorial, you will be able to:
- Fit simple and multiple linear regression models
- Interpret regression coefficients and diagnostics
- Check model assumptions (normality, homoscedasticity, multicollinearity)
- Perform model selection and comparison
- Visualize regression results
Simple Linear Regression
# Model: y = β₀ + β₁x + ε
# Using mtcars
model <- lm(mpg ~ wt, data = mtcars)
# Summary
summary(model)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 37.285 1.878 19.858 < 2e-16 ***
# wt -5.344 0.559 -9.559 1.29e-10 ***
# Predictions
predict(model, newdata = data.frame(wt = 3))
# Confidence intervals
confint(model, level = 0.95)
Multiple Linear Regression
# Multiple predictors
model <- lm(mpg ~ wt + hp + cyl, data = mtcars)
summary(model)
# Variable selection
step_model <- step(model, direction = "both")
summary(step_model)
# Interactions
model_interaction <- lm(mpg ~ wt * hp, data = mtcars)
summary(model_interaction)
# Polynomial regression
model_poly <- lm(mpg ~ poly(wt, 2), data = mtcars)
summary(model_poly)
Model Diagnostics
model <- lm(mpg ~ wt + hp, data = mtcars)
# Diagnostic plots
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
# Residuals
residuals(model)
rstandard(model) # Standardized residuals
rstudent(model) # Studentized residuals
# Influential points
cooks.distance(model)
hatvalues(model)
# Normality test
shapiro.test(residuals(model))
# Homoscedasticity
library(lmtest)
bptest(model) # Breusch-Pagan test
# Multicollinearity
library(car)
vif(model) # Variance Inflation Factor
# VIF > 5 indicates multicollinearity
Model Comparison
# Full vs reduced model
model1 <- lm(mpg ~ wt, data = mtcars)
model2 <- lm(mpg ~ wt + hp, data = mtcars)
model3 <- lm(mpg ~ wt + hp + cyl, data = mtcars)
# ANOVA comparison
anova(model1, model2, model3)
# Information criteria
AIC(model1, model2, model3)
BIC(model1, model2, model3)
# Adjusted R²
summary(model1)$adj.r.squared
summary(model2)$adj.r.squared
summary(model3)$adj.r.squared
Visualization
library(ggplot2)
# Scatter + regression line
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm")
# Multiple regression surface
ggplot(mtcars, aes(x = wt, y = hp, color = mpg)) +
geom_point(size = 3) +
scale_color_viridis_c()
# Residual plot
ggplot(data.frame(fitted = fitted(model), residuals = residuals(model)),
aes(x = fitted, y = residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed")
# QQ plot
qqnorm(residuals(model))
qqline(residuals(model))
Practical Examples
Example 1: Predict House Prices
# Simulated data
set.seed(42)
n <- 200
data <- data.frame(
sqft = runif(n, 1000, 3000),
bedrooms = sample(2:5, n, replace = TRUE),
age = runif(n, 0, 50)
)
data$price <- 50000 + 150 * data$sqft + 10000 * data$bedrooms - 500 * data$age + rnorm(n, 0, 20000)
model <- lm(price ~ sqft + bedrooms + age, data = data)
summary(model)
# Predict
new_house <- data.frame(sqft = 2000, bedrooms = 3, age = 10)
predict(model, newdata = new_house, interval = "prediction")
Practice Exercises
Exercise 1: Model Building
Build a model to predict mpg from mtcars using all available predictors. Then use stepwise selection to find the best model.
Solution
# Full model
full_model <- lm(mpg ~ ., data = mtcars)
summary(full_model)
# Stepwise selection
best_model <- step(full_model, direction = "both", trace = 0)
summary(best_model)
# Compare
AIC(full_model, best_model)
Key Takeaways
lm()fits linear models —y ~ x1 + x2 + x3summary()shows coefficients, R², p-values- Check assumptions: normality, homoscedasticity, no multicollinearity
- VIF greater than 5 indicates multicollinearity problems
- Use
step()for variable selection — AIC-based predict()generates predictions with confidence intervals- Diagnostic plots reveal model problems
Next: Learn about R Logistic Regression — modeling binary outcomes.