Sampling Methods & Distributions

ℹ️ Why It Matters

In the real world, examining an entire population is almost never feasible — it is too expensive, too time-consuming, or literally impossible. Sampling is the disciplined art of selecting a subset to learn about the whole. Done well, a sample of 1,000 people can predict election outcomes within 2%. Done poorly, even a million data points can mislead. Understanding sampling methods, bias, and sampling distributions is the bedrock of statistics, A/B testing, clinical trials, and machine learning.

Overview

Every dataset used in machine learning is a sample from some larger data-generating process. Simple random sampling gives every subset equal probability of selection, making it the gold standard for unbiasedness. Stratified sampling divides the population into subgroups (strata) and samples within each, guaranteeing representation and reducing variance when strata differ. Cluster sampling selects groups and surveys everyone in them, dramatically reducing cost for geographically dispersed populations. Systematic sampling picks every k-th individual after a random start — simple but vulnerable to periodicity. The sampling distribution of a statistic describes how it varies across all possible samples, and the standard error ( $SE = \sigma/\sqrt{n}$ ) quantifies that variability. The Central Limit Theorem guarantees that sample means are approximately normal for large $n$ , regardless of the population distribution.

Key Concepts

Key Estimators

\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i, \quad s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2, \quad \hat{p} = \frac{\text{successes}}{n}

Here,

$\bar{X}$ =Sample mean — unbiased estimator of μ
$s^2$ =Sample variance (Bessel-corrected) — unbiased estimator of σ²
$\hat{p}$ =Sample proportion — unbiased estimator of π

Standard Error of the Mean

SE(\bar{X}) = \frac{\sigma}{\sqrt{n}}

Here,

$\sigma$ =Population standard deviation
$n$ =Sample size
$SE$ =Standard deviation of the sampling distribution of X̄

Stratified Estimator

\bar{X}_{st} = \sum_{h=1}^{H} W_h \bar{X}_h, \quad \text{Var}(\bar{X}_{st}) = \sum_{h=1}^{H} W_h^2 \frac{\sigma_h^2}{n_h}

Here,

$H$ =Number of strata
$W_h = N_h / N$ =Population weight of stratum h
$\bar{X}_h$ =Sample mean within stratum h
$n_h$ =Sample size allocated to stratum h

Sample Size for Mean

n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2

Here,

$E$ =Desired margin of error
$z_{\alpha/2}$ =Critical value (1.96 for 95%)

Sample Size for Proportion

n = \frac{z_{\alpha/2}^2 \cdot \hat{p}(1 - \hat{p})}{E^2}

Here,

$\hat{p}$ =Prior estimate of proportion (use 0.5 if unknown)
$E$ =Desired margin of error

Central Limit Theorem

\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

Here,

$\bar{X}_n$ =Sample mean of n observations
$\mu$ =Population mean
$\sigma$ =Population standard deviation

Sampling Methods Comparison

Method	How It Works	Best For	Key Advantage	Key Disadvantage
Simple Random	Equal chance for every subset	Homogeneous populations	Unbiased, easy to analyze	Requires complete sampling frame
Stratified	Sample within each known subgroup	Heterogeneous populations	Lower variance than SRS	Requires prior knowledge of strata
Cluster	Sample clusters, survey everyone	Geographically dispersed	Cheaper than SRS	Higher variance due to ICC
Systematic	Every k-th after random start	Ordered lists	Simple to implement	Vulnerable to periodicity

Sampling Bias Types

Type	Description	Example
Selection bias	Sampling method excludes groups	Voluntary response surveys
Non-response bias	Selected individuals decline	Phone surveys missing workers
Survivorship bias	Only "surviving" cases observed	Studying successful companies only
Undercoverage bias	Some members have zero selection chance	Online-only surveys
Convenience sampling	Easiest-to-reach individuals	Surveying friends and family

Quick Example

📝Standard Error and Sample Size

The standard deviation of monthly incomes is $4,000. You sample 64 people.

SE = \frac{\sigma}{\sqrt{n}} = \frac{4000}{\sqrt{64}} = \frac{4000}{8} = 500

The SE is $500 — the sample mean typically deviates from the true mean by about$ 500. To halve the margin of error to $250, you need 4× the sample size (256 people), because$ SE \propto 1/\sqrt{n}$.

📝Non-Response Bias

In a survey, 1,000 are selected, response rate = 40%. Respondent mean income = $55,000. Non-respondent mean =$ 38,000.

True population mean: $\mu = 0.4 \times 55000 + 0.6 \times 38000 = \$ 44{,}800$.

Bias = $55,000 - 44,800 = \$ 10,200$ — a 22.8% overestimate.

Key Takeaways

📋Summary: Sampling Methods

SE decreases with $\sqrt{n}$ : quadrupling $n$ halves the standard error. This is the fundamental law of statistical precision.
Probability sampling (SRS, stratified, cluster, systematic) is required for valid inference. Convenience samples are biased by definition.
Stratified > SRS when strata differ substantially in the outcome. It controls for known heterogeneity and reduces variance.
Non-response bias occurs when selected individuals differ systematically from respondents. Track response rates and apply weighting corrections.
CLT convergence depends on population skewness: symmetric distributions need $n \geq 10$ ; heavily skewed may need $n \geq 50$ – $100$ .
Finite population correction applies when $n/N > 5%$ : $SE_{adj} = SE \cdot \sqrt{(N-n)/(N-1)}$ .
To halve margin of error, quadruple $n$ : The $1/\sqrt{n}$ rate means precision improvement is expensive.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Population and Sample

Population vs Sample — Parameters vs. statistics, sampling frames, and the goal of statistical inference

Data Collection

Data Collection Methods — Surveys, experiments, observational studies, and their trade-offs

Sampling Techniques

Sampling Techniques — SRS, stratified, cluster, and systematic sampling with formulas, examples, and allocation strategies

Bias and Errors

Sampling Bias and Errors — Selection bias, non-response bias, survivorship bias, famous polling failures, and mitigation strategies

Sampling Methods & Distributions