Population vs Sample
Every statistical study starts with a fundamental question: who or what are we studying?
Population
A population is the complete set of all individuals, objects, or measurements of interest for a particular study.
| Study | Population |
|---|---|
| Approval rating of a president | All eligible voters in the country |
| Average height of NBA players | All current NBA players |
| Effectiveness of a drug | All people who could ever take the drug |
| Quality control of chips | All chips produced by the factory |
Populations can be:
- Finite: all 7,500 employees at a company
- Infinite: all possible measurements a machine could produce
- Hypothetical: all people who could take an experimental drug
Sample
A sample is a subset of the population that is actually observed and measured.
Why sample instead of study the whole population?
| Reason | Example |
|---|---|
| Cost | Surveying 2,000 people costs far less than 2 million |
| Time | Census takes years; a survey takes months |
| Destructive testing | Testing a lightbulb to failure destroys it |
| Infinite population | You cannot measure every future product |
| Practical impossibility | Can't reach every person on Earth |
Parameters vs Statistics
| Term | Symbol | Description | Example |
|---|---|---|---|
| Population mean | μ (mu) | True average of population | μ = average height of all adults |
| Population std dev | σ (sigma) | True spread of population | σ = spread of heights |
| Population proportion | π or p | True fraction with property | π = true % who prefer brand A |
| Sample mean | x̄ (x-bar) | Average of sample | x̄ = avg height in our sample |
| Sample std dev | s | Sample spread | s = spread in our sample |
| Sample proportion | p̂ (p-hat) | Fraction in sample | p̂ = % who prefer A in sample |
Key insight: Parameters are fixed but unknown. Statistics are known but variable (different samples give different values).
import numpy as np
from scipy import stats
# Simulate a population (in reality, we wouldn't have this)
np.random.seed(42)
population = np.random.normal(loc=170, scale=10, size=10_000) # 10,000 adults
# True population parameters
mu = population.mean()
sigma = population.std(ddof=0) # ddof=0 for population
print(f"Population Parameter μ = {mu:.4f} cm")
print(f"Population Parameter σ = {sigma:.4f} cm")
print("\n--- Drawing samples of different sizes ---")
for n in [10, 30, 100, 500]:
sample = np.random.choice(population, size=n, replace=False)
x_bar = sample.mean()
s = sample.std(ddof=1) # ddof=1 for sample (unbiased)
se = s / np.sqrt(n)
print(f"n={n:4d}: x̄={x_bar:.3f}, s={s:.3f}, SE={se:.3f} | Error = {abs(x_bar-mu):.3f}")
Output:
Population Parameter μ = 170.0694 cm
Population Parameter σ = 10.0048 cm
--- Drawing samples of different sizes ---
n= 10: x̄=169.847, s=10.042, SE=3.175 | Error = 0.222
n= 30: x̄=170.591, s=10.381, SE=1.895 | Error = 0.522
n= 100: x̄=170.204, s= 9.983, SE=0.998 | Error = 0.135
n= 500: x̄=170.082, s=10.017, SE=0.448 | Error = 0.013
Notice: Larger samples → smaller standard error → closer to the true parameter.
The Sampling Distribution
If we draw many different samples and compute a statistic each time, the distribution of those statistics is the sampling distribution.
import matplotlib.pyplot as plt
# Sampling distribution of the mean (n=30)
sample_means = []
for _ in range(10_000):
sample = np.random.choice(population, size=30, replace=False)
sample_means.append(sample.mean())
sample_means = np.array(sample_means)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# Population distribution
axes[0].hist(population, bins=50, color='steelblue', alpha=0.7, density=True)
axes[0].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[0].set_title(f'Population Distribution\n(N=10,000, μ={mu:.1f}, σ={sigma:.1f})')
axes[0].legend()
# Sampling distribution of x̄
axes[1].hist(sample_means, bins=50, color='coral', alpha=0.7, density=True)
axes[1].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[1].set_title(f'Sampling Distribution of x̄\n(10,000 samples, n=30)')
axes[1].set_xlabel('Sample Mean')
axes[1].legend()
print(f"Mean of sample means = {sample_means.mean():.4f} ≈ μ = {mu:.4f}")
print(f"Std of sample means = {sample_means.std():.4f} ≈ σ/√n = {sigma/np.sqrt(30):.4f}")
plt.tight_layout()
plt.show()
Census vs Sample
A census attempts to measure the entire population.
| Census | Sample | |
|---|---|---|
| Coverage | All units | Subset |
| Cost | Very high | Lower |
| Time | Long | Shorter |
| Accuracy | No sampling error | Sampling error present |
| Feasibility | Limited | Broad |
| Non-response | Larger problem | Manageable |
The US Census Bureau conducts a decennial census — it takes years and billions of dollars and still has coverage errors.
Key Takeaways
- Population = the complete group of interest; Sample = subset we actually measure
- Parameters describe populations (Greek letters: μ, σ, π); Statistics describe samples (Latin: x̄, s, p̂)
- We use statistics to estimate parameters — the core engine of inferential statistics
- Larger samples → smaller sampling error but there are diminishing returns
- The sampling distribution of a statistic tells us how it varies across repeated samples
- Every inference has uncertainty — quantifying that uncertainty is the job of statistics