Sampling Techniques

Sampling Theory

The Art of Choosing Who Gets Counted

How you sample determines what you can conclude. The wrong method turns a million-dollar study into a confident wrong answer.

Key things this concept helps with:

Simple Random Sampling — The gold standard when you have a complete list and every voice deserves equal weight
Stratified Sampling — Guaranteed representation from every subgroup, delivering sharper estimates where it matters
Cluster Sampling — Practical precision when the population is spread across geography or organizations
Design Effect — Quantifying exactly how much efficiency you gain or lose with complex designs

Every dataset begins with a sample — choose your sampling method wisely, or your conclusions stand on sand.

What is Sampling?

Definition

Choosing the right sampling method is crucial. Different methods trade off between statistical efficiency, practical feasibility, and cost.

Simple Random Sampling (SRS)

When to use: When you have a complete list (sampling frame) and no reason to believe subgroups differ.

import numpy as np
import pandas as pd

np.random.seed(42)

# Simulate a population of 1,000 employees
population = pd.DataFrame({
    'employee_id': range(1, 1001),
    'department': np.random.choice(['Engineering', 'Sales', 'HR', 'Marketing'], 1000,
                                   p=[0.4, 0.3, 0.15, 0.15]),
    'salary': np.random.normal(75000, 15000, 1000).round(2)
})

# Simple random sample: select 50 employees
srs = population.sample(n=50, random_state=42)
print("SRS — Department distribution:")
print(srs['department'].value_counts(normalize=True).round(3))
print(f"SRS mean salary: ${srs['salary'].mean():,.2f}")
print(f"True mean salary: ${population['salary'].mean():,.2f}")

Advantages: Unbiased, easy to implement, theoretical foundation is solid Disadvantages: Requires complete sampling frame, may miss small subgroups

Systematic Sampling

Sampling interval: k = N/n (population size / sample size)

def systematic_sample(df, n):
    """Select every k-th observation from a DataFrame."""
    N = len(df)
    k = N // n  # sampling interval
    start = np.random.randint(0, k)  # random starting point
    indices = range(start, N, k)
    return df.iloc[list(indices)[:n]]

systematic = systematic_sample(population, n=50)
print("\nSystematic Sample:")
print(f"Mean salary: ${systematic['salary'].mean():,.2f}")
print(f"Dept distribution:")
print(systematic['department'].value_counts(normalize=True).round(3))

Advantages: Simple to execute, good spread across list Disadvantages: Periodicity bias if the list has a pattern at interval k

Stratified Sampling

Proportional Stratification

Sample size from each stratum ∝ stratum size.

Optimal Stratification

Sample more from strata with higher variability (Neyman allocation).

# Proportional stratified sample by department
n_total = 50

dept_sizes = population['department'].value_counts()
dept_proportions = dept_sizes / len(population)

print("Dept proportions:", dept_proportions.to_dict())

stratified_samples = []
for dept, prop in dept_proportions.items():
    n_dept = round(prop * n_total)
    dept_population = population[population['department'] == dept]
    sample = dept_population.sample(n=min(n_dept, len(dept_population)), random_state=42)
    stratified_samples.append(sample)

stratified = pd.concat(stratified_samples)
print(f"\nStratified sample n={len(stratified)}")
print(f"Mean salary: ${stratified['salary'].mean():,.2f}")

# Compare precision: stratified vs SRS
# Run many samples and compare standard errors
srs_means = [population.sample(50)['salary'].mean() for _ in range(1000)]
strat_means = []
for _ in range(1000):
    samples = [population[population['department']==d].sample(round(p*50))
               for d, p in dept_proportions.items()]
    strat_means.append(pd.concat(samples)['salary'].mean())

print(f"\nSE of SRS mean:        ${np.std(srs_means):,.2f}")
print(f"SE of Stratified mean: ${np.std(strat_means):,.2f}")
print("Stratified sampling is more efficient when strata differ in means!")

Cluster Sampling

Unlike stratified: We want clusters to be heterogeneous (diverse internally).

# Cluster sampling: schools in a district
# Population: 100 schools, each with 200 students

n_schools = 100
students_per_school = 50

schools = pd.DataFrame({
    'school_id': range(1, n_schools+1),
    'district': np.repeat(['North', 'South', 'East', 'West'], 25),
})

# Expand to students
students = schools.loc[schools.index.repeat(students_per_school)].reset_index(drop=True)
students['score'] = np.random.normal(70, 12, len(students))

# Select 10 clusters (schools) at random
selected_schools = np.random.choice(schools['school_id'], size=10, replace=False)
cluster_sample = students[students['school_id'].isin(selected_schools)]

print(f"Cluster sample: {len(cluster_sample)} students from {len(selected_schools)} schools")
print(f"Mean score: {cluster_sample['score'].mean():.2f}")
print(f"True mean score: {students['score'].mean():.2f}")

Advantages: Practical when population is geographically spread, no need for complete list of individuals Disadvantages: Less precise than SRS (intra-cluster correlation inflates variance), need design effect correction

Comparison Summary

Method	Cost	Precision	Best When
SRS	Medium	Moderate	Homogeneous population, complete frame available
Systematic	Low	Moderate	Ordered list, no periodicity
Stratified	Medium	High	Subgroups differ on outcome variable
Cluster	Low	Lower	No complete frame, clustered population
Multistage	Medium	Moderate	Large national surveys

Design Effect (DEFF)

DEFF greater than 1: Complex sample is less efficient than SRS (common in cluster sampling)
DEFF less than 1: Complex sample is more efficient (common in stratified sampling)

# Design effect example for cluster sampling
# Intraclass correlation (ICC) measures similarity within clusters
def design_effect_cluster(icc, cluster_size):
    """Calculate design effect for cluster sampling."""
    return 1 + (cluster_size - 1) * icc

icc = 0.3  # moderate intraclass correlation
cluster_size = 50  # students per school

deff = design_effect_cluster(icc, cluster_size)
print(f"ICC = {icc}, cluster size = {cluster_size}")
print(f"Design Effect = {deff:.2f}")
print(f"Effective sample size = actual n / DEFF")
print(f"To get precision of n=500 SRS, need {500*deff:.0f} cluster sample observations")

Sampling in Machine Learning

Sampling Method	ML Use Case	Why
Simple Random	Train/test split	Unbiased performance estimate
Stratified	Classification splits	Preserve class balance
Systematic	Time series splits	Respect temporal order
Cluster	Distributed training	Data parallelism across GPUs
Bootstrap	Bagging, Random Forests	Ensemble diversity

from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np
import pandas as pd

np.random.seed(42)

# Simulated customer dataset
n = 2000
X = np.random.randn(n, 5)
y = (X[:,0] + X[:,1] > 0).astype(int)  # binary classification

# 1. Simple Random Split (standard ML)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print("=== Random Split ===")
print(f"Train class balance: {y_train.mean():.3f}")
print(f"Test class balance:  {y_test.mean():.3f}")

# 2. Stratified Split (preserves class balance)
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
print("\n=== Stratified Split ===")
print(f"Train class balance: {y_train_s.mean():.3f}")
print(f"Test class balance:  {y_test_s.mean():.3f}")

# 3. K-Fold Cross-Validation (repeated sampling)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    print(f"Fold {fold+1}: train={len(train_idx)}, val={len(val_idx)}, "
          f"train_balance={y[train_idx].mean():.3f}")

Key Takeaways

SRS is the theoretical gold standard — but often impractical when populations are large or spread out.

Stratified sampling boosts precision when subgroups differ on your outcome variable.

Cluster sampling trades precision for practicality — use design effect to quantify the cost.

Always account for your sampling design when computing standard errors — ignoring it underestimates uncertainty.

The best sampling method is not the one that is easiest — it is the one that gets you closest to the truth with the resources you have.

Sampling Techniques — Simple Random, Stratified, Cluster, Systematic