Sampling Techniques
Choosing the right sampling method is crucial. Different methods trade off between statistical efficiency, practical feasibility, and cost.
Simple Random Sampling (SRS)
Every individual in the population has an equal probability of selection, and selections are independent.
When to use: When you have a complete list (sampling frame) and no reason to believe subgroups differ.
import numpy as np
import pandas as pd
np.random.seed(42)
# Simulate a population of 1,000 employees
population = pd.DataFrame({
'employee_id': range(1, 1001),
'department': np.random.choice(['Engineering', 'Sales', 'HR', 'Marketing'], 1000,
p=[0.4, 0.3, 0.15, 0.15]),
'salary': np.random.normal(75000, 15000, 1000).round(2)
})
# Simple random sample: select 50 employees
srs = population.sample(n=50, random_state=42)
print("SRS — Department distribution:")
print(srs['department'].value_counts(normalize=True).round(3))
print(f"SRS mean salary: ${srs['salary'].mean():,.2f}")
print(f"True mean salary: ${population['salary'].mean():,.2f}")
Advantages: Unbiased, easy to implement, theoretical foundation is solid
Disadvantages: Requires complete sampling frame, may miss small subgroups
Systematic Sampling
Select every k-th individual from an ordered list, starting at a random point.
Sampling interval: k = N/n (population size / sample size)
def systematic_sample(df, n):
"""Select every k-th observation from a DataFrame."""
N = len(df)
k = N // n # sampling interval
start = np.random.randint(0, k) # random starting point
indices = range(start, N, k)
return df.iloc[list(indices)[:n]]
systematic = systematic_sample(population, n=50)
print("\nSystematic Sample:")
print(f"Mean salary: ${systematic['salary'].mean():,.2f}")
print(f"Dept distribution:")
print(systematic['department'].value_counts(normalize=True).round(3))
Advantages: Simple to execute, good spread across list
Disadvantages: Periodicity bias if the list has a pattern at interval k
Stratified Sampling
Divide population into strata (subgroups) based on a variable, then sample from each stratum.
Proportional Stratification
Sample size from each stratum ∝ stratum size.
Optimal Stratification
Sample more from strata with higher variability (Neyman allocation).
# Proportional stratified sample by department
n_total = 50
dept_sizes = population['department'].value_counts()
dept_proportions = dept_sizes / len(population)
print("Dept proportions:", dept_proportions.to_dict())
stratified_samples = []
for dept, prop in dept_proportions.items():
n_dept = round(prop * n_total)
dept_population = population[population['department'] == dept]
sample = dept_population.sample(n=min(n_dept, len(dept_population)), random_state=42)
stratified_samples.append(sample)
stratified = pd.concat(stratified_samples)
print(f"\nStratified sample n={len(stratified)}")
print(f"Mean salary: ${stratified['salary'].mean():,.2f}")
# Compare precision: stratified vs SRS
# Run many samples and compare standard errors
srs_means = [population.sample(50)['salary'].mean() for _ in range(1000)]
strat_means = []
for _ in range(1000):
samples = [population[population['department']==d].sample(round(p*50))
for d, p in dept_proportions.items()]
strat_means.append(pd.concat(samples)['salary'].mean())
print(f"\nSE of SRS mean: ${np.std(srs_means):,.2f}")
print(f"SE of Stratified mean: ${np.std(strat_means):,.2f}")
print("Stratified sampling is more efficient when strata differ in means!")
Cluster Sampling
Divide population into clusters (naturally occurring groups), randomly select clusters, then survey all members within selected clusters.
Unlike stratified: We want clusters to be heterogeneous (diverse internally).
# Cluster sampling: schools in a district
# Population: 100 schools, each with 200 students
n_schools = 100
students_per_school = 50
schools = pd.DataFrame({
'school_id': range(1, n_schools+1),
'district': np.repeat(['North', 'South', 'East', 'West'], 25),
})
# Expand to students
students = schools.loc[schools.index.repeat(students_per_school)].reset_index(drop=True)
students['score'] = np.random.normal(70, 12, len(students))
# Select 10 clusters (schools) at random
selected_schools = np.random.choice(schools['school_id'], size=10, replace=False)
cluster_sample = students[students['school_id'].isin(selected_schools)]
print(f"Cluster sample: {len(cluster_sample)} students from {len(selected_schools)} schools")
print(f"Mean score: {cluster_sample['score'].mean():.2f}")
print(f"True mean score: {students['score'].mean():.2f}")
Advantages: Practical when population is geographically spread, no need for complete list of individuals
Disadvantages: Less precise than SRS (intra-cluster correlation inflates variance), need design effect correction
Comparison Summary
| Method | Cost | Precision | Best When |
|---|---|---|---|
| SRS | Medium | Moderate | Homogeneous population, complete frame available |
| Systematic | Low | Moderate | Ordered list, no periodicity |
| Stratified | Medium | High | Subgroups differ on outcome variable |
| Cluster | Low | Lower | No complete frame, clustered population |
| Multistage | Medium | Moderate | Large national surveys |
Design Effect (DEFF)
The design effect compares the variance of an estimator from a complex sample to what it would be under SRS:
- DEFF > 1: Complex sample is less efficient than SRS (common in cluster sampling)
- DEFF < 1: Complex sample is more efficient (common in stratified sampling)
# Design effect example for cluster sampling
# Intraclass correlation (ICC) measures similarity within clusters
def design_effect_cluster(icc, cluster_size):
"""Calculate design effect for cluster sampling."""
return 1 + (cluster_size - 1) * icc
icc = 0.3 # moderate intraclass correlation
cluster_size = 50 # students per school
deff = design_effect_cluster(icc, cluster_size)
print(f"ICC = {icc}, cluster size = {cluster_size}")
print(f"Design Effect = {deff:.2f}")
print(f"Effective sample size = actual n / DEFF")
print(f"To get precision of n=500 SRS, need {500*deff:.0f} cluster sample observations")
Key Takeaways
- SRS is the theoretical ideal but often impractical at scale
- Stratified sampling improves precision when strata differ on the outcome
- Cluster sampling sacrifices precision for practicality
- Systematic sampling is quick but beware periodic patterns in the frame
- Multistage sampling combines methods — it's how major national surveys work
- Always account for your sampling design when estimating standard errors