Sampling Bias and Errors — Types, Detection, and Prevention

Foundations of StatisticsSampling TheoryFree Lesson

Advertisement

Sampling Bias and Errors

Even the most carefully designed study can be undermined by bias. Understanding bias is not pessimism — it's statistical integrity.


Types of Error in Statistics

Sampling Error

The unavoidable random difference between a sample statistic and the true population parameter. It decreases with larger samples and can be quantified.

Sampling Error=xˉμ\text{Sampling Error} = \bar{x} - \mu

This is expected and manageable. Confidence intervals are designed to account for it.

Non-Sampling Error (Bias)

Systematic errors that don't go away with larger samples. A biased survey of 1 million people is still biased.

Bias=E[θ^]θ\text{Bias} = E[\hat{\theta}] - \theta


Types of Bias

Selection Bias

Certain individuals are more likely to be included in the sample due to how the sample was drawn.

Classic example: The Literary Digest 1936 US Election Poll

  • Sent 10 million surveys to car owners and phone subscribers
  • Got 2.4 million responses
  • Predicted Landon would beat Roosevelt 57% to 43%
  • Roosevelt won 62% to 37%

Problem: In 1936, car owners and phone subscribers were wealthy — systematically more Republican than the general population.

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n = 10000

# True population: 55% support policy X
true_support = 0.55
population_support = np.random.binomial(1, true_support, n)
population_income = np.random.normal(50000, 20000, n)  # income in USD

# Selection bias: online survey — higher income people more likely to respond
prob_respond = np.clip(0.1 + (population_income - 30000) / 200000, 0.05, 0.95)
responded = np.random.binomial(1, prob_respond) 

# High-income people support policy less (e.g., a wealth tax)
true_support_by_income = np.where(population_income > 60000, 0.35, 0.65)
population_support = np.random.binomial(1, true_support_by_income)

selected = population_support[responded == 1]
print(f"True support: {true_support_by_income.mean():.3f}")
print(f"Biased sample support: {selected.mean():.3f}")
print(f"Bias: {selected.mean() - true_support_by_income.mean():.3f}")

Nonresponse Bias

People who don't respond to a survey differ systematically from those who do.

Examples:

  • Happy customers ignore feedback surveys; dissatisfied customers respond
  • Busy people (who may have different characteristics) skip phone surveys
  • Sensitive topics (income, drug use) get more refusals from those actually affected

Detection:

# Compare early vs late respondents (late respondents ≈ nonrespondents)
# (Heckman's correction technique)

# Simulate survey with nonresponse
n = 1000
true_job_satisfaction = np.random.normal(6.5, 2, n)  # 1-10 scale

# Dissatisfied people less likely to respond
prob_respond = np.clip(0.3 + 0.08 * true_job_satisfaction, 0.1, 0.95)
responded = np.random.binomial(1, prob_respond)

all_responses = true_job_satisfaction[responded == 1]
print(f"True mean satisfaction: {true_job_satisfaction.mean():.3f}")
print(f"Survey mean satisfaction (biased): {all_responses.mean():.3f}")
print(f"Nonresponse bias: +{all_responses.mean() - true_job_satisfaction.mean():.3f}")
print("Survey overestimates satisfaction because unhappy people didn't respond!")

Survivorship Bias

Analyzing only survivors (successes) while ignoring those that failed and are no longer visible.

Examples:

  • Studying successful startups to find success strategies (ignoring the thousands that failed with the same strategies)
  • WWII plane damage example: reinforce the planes that returned, not where they were hit
  • Investment fund performance: funds that failed were removed from databases
# Survivorship bias in investment funds
np.random.seed(7)
n_funds = 1000
years = 10

# Each fund has a 70% chance of surviving each year
survival = np.random.binomial(1, 0.70, (n_funds, years)).cumprod(axis=1)
# Simulate annual returns: mean 5% with 20% std dev  
returns = np.random.normal(0.05, 0.20, (n_funds, years))

# Surviving funds (still alive at year 10)
survived = survival[:, -1] == 1
n_survived = survived.sum()
print(f"Funds surviving 10 years: {n_survived}/{n_funds} ({100*n_survived/n_funds:.1f}%)")

# Compare average returns
all_fund_returns = returns.mean(axis=1)
survivor_returns = returns[survived].mean(axis=1)

print(f"All funds average return: {all_fund_returns.mean():.3f}")
print(f"Surviving funds average return: {survivor_returns.mean():.3f}")
print(f"Survivorship bias inflates returns by: {survivor_returns.mean() - all_fund_returns.mean():.3f}")

Measurement Bias

Systematic errors in how variables are measured, causing values to be consistently too high or too low.

Examples:

  • Self-reported weight: people tend to underreport
  • Social desirability bias: answering to appear socially acceptable
  • Question ordering effects
  • Interviewer effects (different interviewers get different answers)
# Social desirability bias: hours of exercise per week
np.random.seed(3)
n = 500
true_hours = np.random.exponential(scale=3, size=n)  # true behavior

# People exaggerate to the interviewer
exaggeration = np.random.normal(loc=1.5, scale=0.5, size=n)  # add ~1.5 hours bias
reported_hours = true_hours + exaggeration

print(f"True mean exercise: {true_hours.mean():.2f} hours/week")
print(f"Reported mean exercise: {reported_hours.mean():.2f} hours/week")
print(f"Measurement bias: +{reported_hours.mean() - true_hours.mean():.2f} hours")

Detecting and Reducing Bias

Bias TypeDetectionReduction
SelectionCompare selected vs. population on known variablesProbability sampling, weighting
NonresponseFollow-up of nonrespondents, callback analysisMaximize response rate, imputation
SurvivorshipCheck if missing data is randomInclude all units, censoring analysis
MeasurementValidate against objective measuresAnonymize surveys, indirect questions

Key Takeaways

  1. Sampling error is random and decreases with n — it's manageable
  2. Bias is systematic and doesn't decrease with n — bigger biased samples are still biased
  3. Survivorship bias is everywhere in business, medicine, and social science
  4. Nonresponse bias is often larger than sampling error in practice
  5. Probability sampling (not convenience sampling) is the best protection against selection bias
  6. Validate your measures — what you measure may not be what you intend to measure

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement