ChatWhole Learn works best with JavaScript enabled. Please enable JavaScript in your browser settings.

🎉 75% of content is free forever — Unlock Premium from $10/mo →

Search courses…

💼 Services ℹ️ About ✉️ Contact View Pricing Plansfrom $10

Get started free →Sign in

Population vs Sample — The Foundation of Statistical Inference

Foundations of StatisticsSampling Theory🟢 Free LessonUpdated: 2026-07-14

Advertisement

Population vs Sample

Sampling Theory

Every Study Starts With One Question: Who Are We Measuring?

Every statistical study starts with a fundamental question: who or what are we studying? The answer determines everything — from which statistics you calculate to what conclusions you can draw.

Define your population — Know exactly who or what your conclusions apply to
Understand sampling — Learn why studying a subset is often the only option
Parameters vs statistics — Master the Greek letters that separate populations from samples
Census vs sample — Discover why even a full count can be wrong

Get this distinction right, and inferential statistics becomes logical. Get it wrong, and your conclusions stand on sand.

What is Population vs Sample?

Definition

A population is the complete set of all individuals, objects, or measurements of interest. A sample is a subset of that population that is actually observed and measured.

Understanding the distinction between population and sample is the foundation of statistical inference.

Population vs Sample Diagram

Population

Definition

A population is the complete set of all individuals, objects, or measurements of interest for a particular study.

Study	Population	Size
Approval rating of a president	All eligible voters in the country	~250 million
Average height of NBA players	All current NBA players	~450
Effectiveness of a drug	All people who could ever take the drug	Infinite
Quality control of chips	All chips produced by the factory	~1 million/day

Populations can be:

Finite: all 7,500 employees at a company
Infinite: all possible measurements a machine could produce
Hypothetical: all people who could take an experimental drug

Sample

Why sample instead of study the whole population?

Reason	Example
Cost	Surveying 2,000 people costs far less than 2 million
Time	Census takes years; a survey takes months
Destructive testing	Testing a lightbulb to failure destroys it
Infinite population	You cannot measure every future product
Practical impossibility	Can't reach every person on Earth

Parameters vs Statistics

Population Parameters

Parameter	Symbol	Formula
Mean	μ	μ = (1/N)Σxᵢ
Std Dev	σ	σ = √[(1/N)Σ(xᵢ-μ)²]
Proportion	π	π = X/N

Fixed but unknown. We estimate them using statistics.

Sample Statistics

Statistic	Symbol	Formula
Mean	x̄	x̄ = (1/n)Σxᵢ
Std Dev	s	s = √[(1/(n-1))Σ(xᵢ-x̄)²]
Proportion	p̂	p̂ = x/n

Known but variable. Different samples give different values.

import numpy as np
from scipy import stats

# Simulate a population (in reality, we wouldn't have this)
np.random.seed(42)
population = np.random.normal(loc=170, scale=10, size=10_000)  # 10,000 adults

# True population parameters
mu = population.mean()
sigma = population.std(ddof=0)  # ddof=0 for population
print(f"Population Parameter μ = {mu:.4f} cm")
print(f"Population Parameter σ = {sigma:.4f} cm")

print("\n--- Drawing samples of different sizes ---")
for n in [10, 30, 100, 500]:
    sample = np.random.choice(population, size=n, replace=False)
    x_bar = sample.mean()
    s = sample.std(ddof=1)  # ddof=1 for sample (unbiased)
    se = s / np.sqrt(n)
    print(f"n={n:4d}: x̄={x_bar:.3f}, s={s:.3f}, SE={se:.3f} | Error = {abs(x_bar-mu):.3f}")

Output:

Architecture Diagram

Population Parameter μ = 170.0694 cm
Population Parameter σ = 10.0048 cm

--- Drawing samples of different sizes ---
n=  10: x̄=169.847, s=10.042, SE=3.175 | Error = 0.222
n=  30: x̄=170.591, s=10.381, SE=1.895 | Error = 0.522
n= 100: x̄=170.204, s= 9.983, SE=0.998 | Error = 0.135
n= 500: x̄=170.082, s=10.017, SE=0.448 | Error = 0.013

Notice: Larger samples -> smaller standard error -> closer to the true parameter.

The Sampling Distribution

import matplotlib.pyplot as plt

# Sampling distribution of the mean (n=30)
sample_means = []
for _ in range(10_000):
    sample = np.random.choice(population, size=30, replace=False)
    sample_means.append(sample.mean())

sample_means = np.array(sample_means)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Population distribution
axes[0].hist(population, bins=50, color='steelblue', alpha=0.7, density=True)
axes[0].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[0].set_title(f'Population Distribution\n(N=10,000, μ={mu:.1f}, σ={sigma:.1f})')
axes[0].legend()

# Sampling distribution of x̄
axes[1].hist(sample_means, bins=50, color='coral', alpha=0.7, density=True)
axes[1].axvline(mu, color='red', linewidth=2, label=f'μ = {mu:.1f}')
axes[1].set_title(f'Sampling Distribution of x̄\n(10,000 samples, n=30)')
axes[1].set_xlabel('Sample Mean')
axes[1].legend()

print(f"Mean of sample means = {sample_means.mean():.4f} ≈ μ = {mu:.4f}")
print(f"Std of sample means  = {sample_means.std():.4f} ≈ σ/√n = {sigma/np.sqrt(30):.4f}")

plt.tight_layout()
plt.show()

Census vs Sample

	Census	Sample
Coverage	All units	Subset
Cost	Very high	Lower
Time	Long	Shorter
Accuracy	No sampling error	Sampling error present
Feasibility	Limited	Broad
Non-response	Larger problem	Manageable

The US Census Bureau conducts a decennial census — it takes years and billions of dollars and still has coverage errors.

Population vs Sample in Machine Learning

Statistics Term	ML Equivalent	What It Means
Population	All possible data	Everything the model could ever see
Sample	Training set	What the model actually learns from
Parameter (μ, σ)	Model weights (W, b)	True values we want to learn
Statistic (x̄, s)	Loss/Accuracy on train	What we measure from our sample
Sampling error	Generalization gap	Difference between train and test performance

Example — Train/Test Split as Sampling:

from sklearn.model_selection import train_test_split
import numpy as np

# Population: all house data
np.random.seed(42)
n_total = 1000
X = np.random.randn(n_total, 3)  # 3 features
y = 2*X[:,0] + 3*X[:,1] - X[:,2] + np.random.randn(n_total)*0.5

# Sample: training set (80% of population)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Population size: {n_total}")
print(f"Training sample: {len(X_train)}")
print(f"Test sample: {len(X_test)}")

# Statistics from sample (training set)
print(f"\nSample mean of X[:,0]: {X_train[:,0].mean():.3f}")
print(f"Population mean of X[:,0]: {X[:,0].mean():.3f}")
print(f"Sampling error: {abs(X_train[:,0].mean() - X[:,0].mean()):.3f}")

Output:

Architecture Diagram

Population size: 1000
Training sample: 800
Test sample: 200

Sample mean of X[:,0]: 0.018
Population mean of X[:,0]: 0.003
Sampling error: 0.015

Key Takeaways

←03 Levels Of Measurement 05 Data Collection Methods→

Need Expert Statistics Help?

Get personalized tutoring, project support, or professional consulting.

Contact Us →View Services

Advertisement