Data Collection Methods
The quality of any statistical analysis depends entirely on the quality of the data. Poor data collection means no amount of sophisticated analysis will save your conclusions.
"Garbage in, garbage out." — Computer science proverb that applies equally to statistics.
Primary vs Secondary Data
| Type | Definition | Examples |
|---|---|---|
| Primary | Collected directly for the current study | Survey you design, experiment you run |
| Secondary | Pre-existing data collected by others | Government census, hospital records |
Primary advantages: tailored to your question, you control quality
Secondary advantages: cheap, large scale, historical depth
Experimental Studies
The gold standard for establishing causation. The researcher:
- Randomly assigns subjects to treatment/control groups
- Applies a treatment (intervention)
- Measures the outcome
Randomization is the key — it distributes confounders equally across groups.
import numpy as np
import pandas as pd
from scipy.stats import ttest_ind
np.random.seed(42)
n = 100 # 100 participants
# Random assignment to treatment or control
assignments = np.random.choice(['treatment', 'control'], size=n)
# Simulate outcomes (treatment has true effect of +5 points)
outcomes = np.where(assignments == 'treatment',
np.random.normal(75, 10, n), # treatment group
np.random.normal(70, 10, n)) # control group
df = pd.DataFrame({'group': assignments, 'score': outcomes})
# Compare groups
for group in ['treatment', 'control']:
subset = df[df['group'] == group]['score']
print(f"{group}: mean={subset.mean():.2f}, n={len(subset)}")
# Hypothesis test
t_stat, p_val = ttest_ind(
df[df['group']=='treatment']['score'],
df[df['group']=='control']['score']
)
print(f"\nt-statistic = {t_stat:.3f}, p-value = {p_val:.4f}")
print("Conclusion:", "Significant difference" if p_val < 0.05 else "No significant difference")
Observational Studies
The researcher observes without intervening. Can show association but cannot prove causation due to confounding.
Types of Observational Studies
| Type | Direction in Time | Strength | Example |
|---|---|---|---|
| Cross-sectional | Snapshot (no time) | Weak | Survey of current diet and BMI |
| Case-Control | Backward (retrospective) | Moderate | Lung cancer patients vs. controls → smoking history |
| Cohort | Forward (prospective) | Strong | Follow smokers vs. non-smokers for 20 years |
# Observational study simulation: coffee and productivity
# True causal structure: Exercise → (Coffee consumption + Productivity)
# Naive analysis might conclude coffee CAUSES productivity
np.random.seed(0)
n = 500
exercise = np.random.normal(5, 2, n) # hours/week — the true cause
coffee = 0.5 * exercise + np.random.normal(2, 1, n) # coffee correlated with exercise
productivity = 0.8 * exercise + np.random.normal(7, 2, n) # productivity caused by exercise
# Naive correlation
from scipy.stats import pearsonr
r_naive, p_naive = pearsonr(coffee, productivity)
print(f"Coffee × Productivity correlation: r = {r_naive:.3f}, p = {p_naive:.4f}")
# Partial correlation (controlling for exercise) — the truth
from numpy.linalg import lstsq
# Residualize out exercise
coffee_resid = coffee - (lstsq([[1, e] for e in exercise], coffee, rcond=None)[0][0] +
lstsq([[1, e] for e in exercise], coffee, rcond=None)[0][1] * exercise)
prod_resid = productivity - (lstsq([[1, e] for e in exercise], productivity, rcond=None)[0][0] +
lstsq([[1, e] for e in exercise], productivity, rcond=None)[0][1] * exercise)
r_partial, p_partial = pearsonr(coffee_resid, prod_resid)
print(f"Coffee × Productivity (controlling exercise): r = {r_partial:.3f}, p = {p_partial:.4f}")
print("\nCoffee's apparent effect was mostly due to exercise (confounding)!")
Survey Methods
Key Survey Design Principles
1. Question Wording
- Clear, unambiguous language
- Avoid leading questions: "Don't you agree that X is better?" → BAD
- Neutral framing: "Which do you prefer, X or Y?" → BETTER
2. Response Options
- Exhaustive (cover all possibilities)
- Mutually exclusive (no overlap)
- Balanced scale (equal positive and negative options)
3. Survey Modes
| Mode | Cost | Response Rate | Coverage | Best For |
|---|---|---|---|---|
| In-person | High | ~80% | Limited | Detailed interviews |
| Phone | Medium | ~15-30% | Broad | National surveys |
| Low-Medium | ~10-20% | Very broad | Sensitive topics | |
| Online | Very low | ~10-30% | Internet users | Large-scale, fast |
# Simulating response bias in surveys
# Suppose true population approval = 55%
np.random.seed(1)
true_approval = 0.55
# Random phone sample (representative)
n_phone = 1000
phone_responses = np.random.binomial(1, true_approval, n_phone)
print(f"Phone survey: {phone_responses.mean():.3f} (true: {true_approval})")
# Online opt-in sample (selection bias — engaged users more opinionated)
# People who disapprove are angrier and more likely to respond
prob_respond_approve = 0.3
prob_respond_disapprove = 0.6
population = np.random.binomial(1, true_approval, 10000)
responded = np.where(population == 1,
np.random.binomial(1, prob_respond_approve, 10000),
np.random.binomial(1, prob_respond_disapprove, 10000))
online_sample = population[responded == 1]
print(f"Online opt-in: {online_sample.mean():.3f} (true: {true_approval})")
print("Selection bias introduced error!")
Common Data Collection Errors
| Error Type | Description | Example |
|---|---|---|
| Sampling error | Random variation from sample to sample | Poll shows 52% support; true value is 50% |
| Coverage error | Population not fully covered | Phone survey misses people without phones |
| Nonresponse error | Non-responders differ systematically | Dissatisfied customers less likely to respond |
| Measurement error | Inaccurate responses | People underreport alcohol consumption |
| Processing error | Data entry mistakes | Mistyped values during transcription |
Key Takeaways
- Experimental designs with randomization are the only way to establish causation
- Observational studies show association — confounding is always a threat
- Surveys require careful design — question wording and mode affect results dramatically
- Nonresponse bias is one of the biggest practical threats to survey validity
- Secondary data is valuable but comes with limitations you didn't control
- Pre-register your study design before collecting data to avoid p-hacking