Data Collection Methods — Surveys, Experiments, Observations

Foundations of StatisticsData CollectionFree Lesson

Advertisement

Data Collection Methods

The quality of any statistical analysis depends entirely on the quality of the data. Poor data collection means no amount of sophisticated analysis will save your conclusions.

"Garbage in, garbage out." — Computer science proverb that applies equally to statistics.


Primary vs Secondary Data

TypeDefinitionExamples
PrimaryCollected directly for the current studySurvey you design, experiment you run
SecondaryPre-existing data collected by othersGovernment census, hospital records

Primary advantages: tailored to your question, you control quality
Secondary advantages: cheap, large scale, historical depth


Experimental Studies

The gold standard for establishing causation. The researcher:

  1. Randomly assigns subjects to treatment/control groups
  2. Applies a treatment (intervention)
  3. Measures the outcome

Randomization is the key — it distributes confounders equally across groups.

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind

np.random.seed(42)
n = 100  # 100 participants

# Random assignment to treatment or control
assignments = np.random.choice(['treatment', 'control'], size=n)

# Simulate outcomes (treatment has true effect of +5 points)
outcomes = np.where(assignments == 'treatment',
                    np.random.normal(75, 10, n),   # treatment group
                    np.random.normal(70, 10, n))   # control group

df = pd.DataFrame({'group': assignments, 'score': outcomes})

# Compare groups
for group in ['treatment', 'control']:
    subset = df[df['group'] == group]['score']
    print(f"{group}: mean={subset.mean():.2f}, n={len(subset)}")

# Hypothesis test
t_stat, p_val = ttest_ind(
    df[df['group']=='treatment']['score'],
    df[df['group']=='control']['score']
)
print(f"\nt-statistic = {t_stat:.3f}, p-value = {p_val:.4f}")
print("Conclusion:", "Significant difference" if p_val < 0.05 else "No significant difference")

Observational Studies

The researcher observes without intervening. Can show association but cannot prove causation due to confounding.

Types of Observational Studies

TypeDirection in TimeStrengthExample
Cross-sectionalSnapshot (no time)WeakSurvey of current diet and BMI
Case-ControlBackward (retrospective)ModerateLung cancer patients vs. controls → smoking history
CohortForward (prospective)StrongFollow smokers vs. non-smokers for 20 years
# Observational study simulation: coffee and productivity
# True causal structure: Exercise → (Coffee consumption + Productivity)
# Naive analysis might conclude coffee CAUSES productivity

np.random.seed(0)
n = 500

exercise = np.random.normal(5, 2, n)  # hours/week — the true cause
coffee = 0.5 * exercise + np.random.normal(2, 1, n)  # coffee correlated with exercise
productivity = 0.8 * exercise + np.random.normal(7, 2, n)  # productivity caused by exercise

# Naive correlation
from scipy.stats import pearsonr
r_naive, p_naive = pearsonr(coffee, productivity)
print(f"Coffee × Productivity correlation: r = {r_naive:.3f}, p = {p_naive:.4f}")

# Partial correlation (controlling for exercise) — the truth
from numpy.linalg import lstsq
# Residualize out exercise
coffee_resid = coffee - (lstsq([[1, e] for e in exercise], coffee, rcond=None)[0][0] + 
                          lstsq([[1, e] for e in exercise], coffee, rcond=None)[0][1] * exercise)
prod_resid = productivity - (lstsq([[1, e] for e in exercise], productivity, rcond=None)[0][0] +
                              lstsq([[1, e] for e in exercise], productivity, rcond=None)[0][1] * exercise)

r_partial, p_partial = pearsonr(coffee_resid, prod_resid)
print(f"Coffee × Productivity (controlling exercise): r = {r_partial:.3f}, p = {p_partial:.4f}")
print("\nCoffee's apparent effect was mostly due to exercise (confounding)!")

Survey Methods

Key Survey Design Principles

1. Question Wording

  • Clear, unambiguous language
  • Avoid leading questions: "Don't you agree that X is better?" → BAD
  • Neutral framing: "Which do you prefer, X or Y?" → BETTER

2. Response Options

  • Exhaustive (cover all possibilities)
  • Mutually exclusive (no overlap)
  • Balanced scale (equal positive and negative options)

3. Survey Modes

ModeCostResponse RateCoverageBest For
In-personHigh~80%LimitedDetailed interviews
PhoneMedium~15-30%BroadNational surveys
MailLow-Medium~10-20%Very broadSensitive topics
OnlineVery low~10-30%Internet usersLarge-scale, fast
# Simulating response bias in surveys
# Suppose true population approval = 55%

np.random.seed(1)
true_approval = 0.55

# Random phone sample (representative)
n_phone = 1000
phone_responses = np.random.binomial(1, true_approval, n_phone)
print(f"Phone survey: {phone_responses.mean():.3f} (true: {true_approval})")

# Online opt-in sample (selection bias — engaged users more opinionated)
# People who disapprove are angrier and more likely to respond
prob_respond_approve = 0.3
prob_respond_disapprove = 0.6
population = np.random.binomial(1, true_approval, 10000)
responded = np.where(population == 1,
                     np.random.binomial(1, prob_respond_approve, 10000),
                     np.random.binomial(1, prob_respond_disapprove, 10000))
online_sample = population[responded == 1]
print(f"Online opt-in: {online_sample.mean():.3f} (true: {true_approval})")
print("Selection bias introduced error!")

Common Data Collection Errors

Error TypeDescriptionExample
Sampling errorRandom variation from sample to samplePoll shows 52% support; true value is 50%
Coverage errorPopulation not fully coveredPhone survey misses people without phones
Nonresponse errorNon-responders differ systematicallyDissatisfied customers less likely to respond
Measurement errorInaccurate responsesPeople underreport alcohol consumption
Processing errorData entry mistakesMistyped values during transcription

Key Takeaways

  1. Experimental designs with randomization are the only way to establish causation
  2. Observational studies show association — confounding is always a threat
  3. Surveys require careful design — question wording and mode affect results dramatically
  4. Nonresponse bias is one of the biggest practical threats to survey validity
  5. Secondary data is valuable but comes with limitations you didn't control
  6. Pre-register your study design before collecting data to avoid p-hacking

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement