← Math|46 of 100
Statistics

Sampling Methods & Distributions

Master sampling methods, sampling distributions, standard error, and the foundations of statistical inference.

📂 Sampling📖 Lesson 46 of 100🎓 Free Course

Advertisement

Sampling Methods & Distributions

â„šī¸ Why It Matters

In the real world, examining an entire population is almost never feasible — it is too expensive, too time-consuming, or literally impossible. Sampling is the disciplined art of selecting a subset to learn about the whole. Done well, a sample of 1,000 people can predict election outcomes within 2%. Done poorly, even a million data points can mislead. Understanding sampling methods, bias, and sampling distributions is the bedrock of statistics, A/B testing, clinical trials, and machine learning.


Overview

Every dataset used in machine learning is a sample from some larger data-generating process. Simple random sampling gives every subset equal probability of selection, making it the gold standard for unbiasedness. Stratified sampling divides the population into subgroups (strata) and samples within each, guaranteeing representation and reducing variance when strata differ. Cluster sampling selects groups and surveys everyone in them, dramatically reducing cost for geographically dispersed populations. Systematic sampling picks every k-th individual after a random start — simple but vulnerable to periodicity. The sampling distribution of a statistic describes how it varies across all possible samples, and the standard error (SE=΃/nSE = \sigma/\sqrt{n}) quantifies that variability. The Central Limit Theorem guarantees that sample means are approximately normal for large nn, regardless of the population distribution.


Key Concepts

Key Estimators

Xˉ=1n∑i=1nXi,s2=1n−1∑i=1n(Xi−Xˉ)2,p^=successesn\bar{X} = \frac{1}{n}\sum_{i=1}^{n} X_i, \quad s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})^2, \quad \hat{p} = \frac{\text{successes}}{n}

Here,

  • Xˉ\bar{X}=Sample mean — unbiased estimator of Îŧ
  • s2s^2=Sample variance (Bessel-corrected) — unbiased estimator of ĪƒÂ˛
  • p^\hat{p}=Sample proportion — unbiased estimator of Ī€

Standard Error of the Mean

SE(Xˉ)=΃nSE(\bar{X}) = \frac{\sigma}{\sqrt{n}}

Here,

  • ΃\sigma=Population standard deviation
  • nn=Sample size
  • SESE=Standard deviation of the sampling distribution of XĖ„

Stratified Estimator

Xˉst=∑h=1HWhXˉh,Var(Xˉst)=∑h=1HWh2΃h2nh\bar{X}_{st} = \sum_{h=1}^{H} W_h \bar{X}_h, \quad \text{Var}(\bar{X}_{st}) = \sum_{h=1}^{H} W_h^2 \frac{\sigma_h^2}{n_h}

Here,

  • HH=Number of strata
  • Wh=Nh/NW_h = N_h / N=Population weight of stratum h
  • Xˉh\bar{X}_h=Sample mean within stratum h
  • nhn_h=Sample size allocated to stratum h

Sample Size for Mean

n=(zÎą/2â‹…ĪƒE)2n = \left(\frac{z_{\alpha/2} \cdot \sigma}{E}\right)^2

Here,

  • EE=Desired margin of error
  • zÎą/2z_{\alpha/2}=Critical value (1.96 for 95%)

Sample Size for Proportion

n=zα/22⋅p^(1−p^)E2n = \frac{z_{\alpha/2}^2 \cdot \hat{p}(1 - \hat{p})}{E^2}

Here,

  • p^\hat{p}=Prior estimate of proportion (use 0.5 if unknown)
  • EE=Desired margin of error

Central Limit Theorem

Xˉn−Îŧ΃/n→dN(0,1)as n→∞\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} N(0, 1) \quad \text{as } n \to \infty

Here,

  • Xˉn\bar{X}_n=Sample mean of n observations
  • Îŧ\mu=Population mean
  • ΃\sigma=Population standard deviation

Sampling Methods Comparison

MethodHow It WorksBest ForKey AdvantageKey Disadvantage
Simple RandomEqual chance for every subsetHomogeneous populationsUnbiased, easy to analyzeRequires complete sampling frame
StratifiedSample within each known subgroupHeterogeneous populationsLower variance than SRSRequires prior knowledge of strata
ClusterSample clusters, survey everyoneGeographically dispersedCheaper than SRSHigher variance due to ICC
SystematicEvery k-th after random startOrdered listsSimple to implementVulnerable to periodicity

Sampling Bias Types

TypeDescriptionExample
Selection biasSampling method excludes groupsVoluntary response surveys
Non-response biasSelected individuals declinePhone surveys missing workers
Survivorship biasOnly "surviving" cases observedStudying successful companies only
Undercoverage biasSome members have zero selection chanceOnline-only surveys
Convenience samplingEasiest-to-reach individualsSurveying friends and family

Quick Example

📝Standard Error and Sample Size

The standard deviation of monthly incomes is $4,000. You sample 64 people.

SE=΃n=400064=40008=500SE = \frac{\sigma}{\sqrt{n}} = \frac{4000}{\sqrt{64}} = \frac{4000}{8} = 500

The SE is 500—thesamplemeantypicallydeviatesfromthetruemeanbyabout500 — the sample mean typically deviates from the true mean by about500. To halve the margin of error to 250,youneed4×thesamplesize(256people),because250, you need 4× the sample size (256 people), becauseSE \propto 1/\sqrt{n}$.

📝Non-Response Bias

In a survey, 1,000 are selected, response rate = 40%. Respondent mean income = 55,000.Non−respondentmean=55,000. Non-respondent mean =38,000.

True population mean: \mu = 0.4 \times 55000 + 0.6 \times 38000 = \44{,}800$.

Bias = 55,000 - 44,800 = \10,200$ — a 22.8% overestimate.


Key Takeaways

📋Summary: Sampling Methods

  • SE decreases with n\sqrt{n}: quadrupling nn halves the standard error. This is the fundamental law of statistical precision.
  • Probability sampling (SRS, stratified, cluster, systematic) is required for valid inference. Convenience samples are biased by definition.
  • Stratified > SRS when strata differ substantially in the outcome. It controls for known heterogeneity and reduces variance.
  • Non-response bias occurs when selected individuals differ systematically from respondents. Track response rates and apply weighting corrections.
  • CLT convergence depends on population skewness: symmetric distributions need nâ‰Ĩ10n \geq 10; heavily skewed may need nâ‰Ĩ50n \geq 50–100100.
  • Finite population correction applies when n/N>5n/N > 5%: SEadj=SE⋅(N−n)/(N−1)SE_{adj} = SE \cdot \sqrt{(N-n)/(N-1)}.
  • To halve margin of error, quadruple nn: The 1/n1/\sqrt{n} rate means precision improvement is expensive.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Population and Sample

  • Population vs Sample — Parameters vs. statistics, sampling frames, and the goal of statistical inference

Data Collection

Sampling Techniques

  • Sampling Techniques — SRS, stratified, cluster, and systematic sampling with formulas, examples, and allocation strategies

Bias and Errors

  • Sampling Bias and Errors — Selection bias, non-response bias, survivorship bias, famous polling failures, and mitigation strategies

Related Topics

Lesson Progress46 / 100