Sampling Methods & Distributions
âšī¸ Why It Matters
In the real world, examining an entire population is almost never feasible â it is too expensive, too time-consuming, or literally impossible. Sampling is the disciplined art of selecting a subset to learn about the whole. Done well, a sample of 1,000 people can predict election outcomes within 2%. Done poorly, even a million data points can mislead. Understanding sampling methods, bias, and sampling distributions is the bedrock of statistics, A/B testing, clinical trials, and machine learning.
Overview
Every dataset used in machine learning is a sample from some larger data-generating process. Simple random sampling gives every subset equal probability of selection, making it the gold standard for unbiasedness. Stratified sampling divides the population into subgroups (strata) and samples within each, guaranteeing representation and reducing variance when strata differ. Cluster sampling selects groups and surveys everyone in them, dramatically reducing cost for geographically dispersed populations. Systematic sampling picks every k-th individual after a random start â simple but vulnerable to periodicity. The sampling distribution of a statistic describes how it varies across all possible samples, and the standard error () quantifies that variability. The Central Limit Theorem guarantees that sample means are approximately normal for large , regardless of the population distribution.
Key Concepts
Key Estimators
Here,
- =Sample mean â unbiased estimator of Îŧ
- =Sample variance (Bessel-corrected) â unbiased estimator of β
- =Sample proportion â unbiased estimator of Ī
Standard Error of the Mean
Here,
- =Population standard deviation
- =Sample size
- =Standard deviation of the sampling distribution of XĖ
Stratified Estimator
Here,
- =Number of strata
- =Population weight of stratum h
- =Sample mean within stratum h
- =Sample size allocated to stratum h
Sample Size for Mean
Here,
- =Desired margin of error
- =Critical value (1.96 for 95%)
Sample Size for Proportion
Here,
- =Prior estimate of proportion (use 0.5 if unknown)
- =Desired margin of error
Central Limit Theorem
Here,
- =Sample mean of n observations
- =Population mean
- =Population standard deviation
Sampling Methods Comparison
| Method | How It Works | Best For | Key Advantage | Key Disadvantage |
|---|---|---|---|---|
| Simple Random | Equal chance for every subset | Homogeneous populations | Unbiased, easy to analyze | Requires complete sampling frame |
| Stratified | Sample within each known subgroup | Heterogeneous populations | Lower variance than SRS | Requires prior knowledge of strata |
| Cluster | Sample clusters, survey everyone | Geographically dispersed | Cheaper than SRS | Higher variance due to ICC |
| Systematic | Every k-th after random start | Ordered lists | Simple to implement | Vulnerable to periodicity |
Sampling Bias Types
| Type | Description | Example |
|---|---|---|
| Selection bias | Sampling method excludes groups | Voluntary response surveys |
| Non-response bias | Selected individuals decline | Phone surveys missing workers |
| Survivorship bias | Only "surviving" cases observed | Studying successful companies only |
| Undercoverage bias | Some members have zero selection chance | Online-only surveys |
| Convenience sampling | Easiest-to-reach individuals | Surveying friends and family |
Quick Example
đStandard Error and Sample Size
The standard deviation of monthly incomes is $4,000. You sample 64 people.
The SE is 500. To halve the margin of error to SE \propto 1/\sqrt{n}$.
đNon-Response Bias
In a survey, 1,000 are selected, response rate = 40%. Respondent mean income = 38,000.
True population mean: \mu = 0.4 \times 55000 + 0.6 \times 38000 = \44{,}800$.
Bias = 55,000 - 44,800 = \10,200$ â a 22.8% overestimate.
Key Takeaways
đSummary: Sampling Methods
- SE decreases with : quadrupling halves the standard error. This is the fundamental law of statistical precision.
- Probability sampling (SRS, stratified, cluster, systematic) is required for valid inference. Convenience samples are biased by definition.
- Stratified > SRS when strata differ substantially in the outcome. It controls for known heterogeneity and reduces variance.
- Non-response bias occurs when selected individuals differ systematically from respondents. Track response rates and apply weighting corrections.
- CLT convergence depends on population skewness: symmetric distributions need ; heavily skewed may need â.
- Finite population correction applies when : .
- To halve margin of error, quadruple : The rate means precision improvement is expensive.
Deep Dive
For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:
Population and Sample
- Population vs Sample â Parameters vs. statistics, sampling frames, and the goal of statistical inference
Data Collection
- Data Collection Methods â Surveys, experiments, observational studies, and their trade-offs
Sampling Techniques
- Sampling Techniques â SRS, stratified, cluster, and systematic sampling with formulas, examples, and allocation strategies
Bias and Errors
- Sampling Bias and Errors â Selection bias, non-response bias, survivorship bias, famous polling failures, and mitigation strategies
Related Topics
- Central Limit Theorem â Why sample means are approximately normal for large
- Standard Error â Quantifying the variability of a statistic across samples
- Confidence Intervals for the Mean â Using SE to build interval estimates
- Sample Size Determination â Formulas for planning studies
- Bootstrap Methods â Distribution-free alternative when CLT conditions are uncertain