Population vs Sample — The Foundation of Statistical Inference

Foundations of StatisticsSampling TheoryFree Lesson

Advertisement

Population vs Sample

Every statistical study starts with a fundamental question: who or what are we studying?


Population

A population is the complete set of all individuals, objects, or measurements of interest for a particular study.

StudyPopulation
Approval rating of a presidentAll eligible voters in the country
Average height of NBA playersAll current NBA players
Effectiveness of a drugAll people who could ever take the drug
Quality control of chipsAll chips produced by the factory

Populations can be:

  • Finite: all 7,500 employees at a company
  • Infinite: all possible measurements a machine could produce
  • Hypothetical: all people who could take an experimental drug

Sample

A sample is a subset of the population that is actually observed and measured.

Why sample instead of study the whole population?

ReasonExample
CostSurveying 2,000 people costs far less than 2 million
TimeCensus takes years; a survey takes months
Destructive testingTesting a lightbulb to failure destroys it
Infinite populationYou cannot measure every future product
Practical impossibilityCan't reach every person on Earth

Parameters vs Statistics

TermSymbolDescriptionExample
Population meanμ (mu)True average of populationμ = average height of all adults
Population std devσ (sigma)True spread of populationσ = spread of heights
Population proportionπ or pTrue fraction with propertyπ = true % who prefer brand A
Sample meanx̄ (x-bar)Average of samplex̄ = avg height in our sample
Sample std devsSample spreads = spread in our sample
Sample proportionp̂ (p-hat)Fraction in samplep̂ = % who prefer A in sample

Key insight: Parameters are fixed but unknown. Statistics are known but variable (different samples give different values).

import numpy as np
from scipy import stats

# Simulate a population (in reality, we wouldn't have this)
np.random.seed(42)
population = np.random.normal(loc=170, scale=10, size=10_000)  # 10,000 adults

# True population parameters
mu = population.mean()
sigma = population.std(ddof=0)  # ddof=0 for population
print(f"Population Parameter μ = {mu:.4f} cm")
print(f"Population Parameter σ = {sigma:.4f} cm")

print("\n--- Drawing samples of different sizes ---")
for n in [10, 30, 100, 500]:
    sample = np.random.choice(population, size=n, replace=False)
    x_bar = sample.mean()
    s = sample.std(ddof=1)  # ddof=1 for sample (unbiased)
    se = s / np.sqrt(n)
    print(f"n={n:4d}: x̄={x_bar:.3f}, s={s:.3f}, SE={se:.3f} | Error = {abs(x_bar-mu):.3f}")

Output:

Population Parameter μ = 170.0694 cm
Population Parameter σ = 10.0048 cm

--- Drawing samples of different sizes ---
n=  10: x̄=169.847, s=10.042, SE=3.175 | Error = 0.222
n=  30: x̄=170.591, s=10.381, SE=1.895 | Error = 0.522
n= 100: x̄=170.204, s= 9.983, SE=0.998 | Error = 0.135
n= 500: x̄=170.082, s=10.017, SE=0.448 | Error = 0.013

Notice: Larger samples → smaller standard error → closer to the true parameter.


The Sampling Distribution

If we draw many different samples and compute a statistic each time, the distribution of those statistics is the sampling distribution.

import matplotlib.pyplot as plt

# Sampling distribution of the mean (n=30)
sample_means = []
for _ in range(10_000):
    sample = np.random.choice(population, size=30, replace=False)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Population distribution
axes[0].hist(population, bins=50, color='steelblue', alpha=0.7, density=True)
axes[0].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[0].set_title(f'Population Distribution\n(N=10,000, μ={mu:.1f}, σ={sigma:.1f})')
axes[0].legend()

# Sampling distribution of x̄
axes[1].hist(sample_means, bins=50, color='coral', alpha=0.7, density=True)
axes[1].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[1].set_title(f'Sampling Distribution of x̄\n(10,000 samples, n=30)')
axes[1].set_xlabel('Sample Mean')
axes[1].legend()

print(f"Mean of sample means = {sample_means.mean():.4f} ≈ μ = {mu:.4f}")
print(f"Std of sample means  = {sample_means.std():.4f} ≈ σ/√n = {sigma/np.sqrt(30):.4f}")

plt.tight_layout()
plt.show()

Census vs Sample

A census attempts to measure the entire population.

CensusSample
CoverageAll unitsSubset
CostVery highLower
TimeLongShorter
AccuracyNo sampling errorSampling error present
FeasibilityLimitedBroad
Non-responseLarger problemManageable

The US Census Bureau conducts a decennial census — it takes years and billions of dollars and still has coverage errors.


Key Takeaways

  1. Population = the complete group of interest; Sample = subset we actually measure
  2. Parameters describe populations (Greek letters: μ, σ, π); Statistics describe samples (Latin: x̄, s, p̂)
  3. We use statistics to estimate parameters — the core engine of inferential statistics
  4. Larger samples → smaller sampling error but there are diminishing returns
  5. The sampling distribution of a statistic tells us how it varies across repeated samples
  6. Every inference has uncertainty — quantifying that uncertainty is the job of statistics

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement