Variance
Variance measures the average squared deviation from the mean.
Population variance: σ² = (1/N) Σ(xᵢ − μ)²
Sample variance (Bessel's correction): s² = (1/(n−1)) Σ(xᵢ − x̄)²
We divide by n−1 to produce an unbiased estimator of σ².
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
data = np.array([4, 7, 13, 2, 1, 8, 11, 6, 9, 5])
n = len(data)
mean = np.mean(data)
deviations = data - mean
squared_devs = deviations ** 2
pop_var = squared_devs.sum() / n
sample_var = squared_devs.sum() / (n - 1)
print(f"Mean: {mean:.2f}")
print(f"Deviations: {deviations}")
print(f"Squared deviations: {squared_devs}")
print(f"Population variance σ² (÷n): {pop_var:.4f}")
print(f"Sample variance s² (÷n−1): {sample_var:.4f}")
print(f"NumPy ddof=0 (population): {np.var(data, ddof=0):.4f}")
print(f"NumPy ddof=1 (sample): {np.var(data, ddof=1):.4f}")
Why Divide by n−1?
The sample mean x̄ is computed from the data and is the "closest" center to the sample. This causes squared deviations from x̄ to systematically underestimate those from the true μ. Dividing by n−1 corrects this bias.
np.random.seed(42)
pop = np.random.normal(50, 10, 10000)
true_var = np.var(pop)
biased, unbiased = [], []
for _ in range(5000):
sample = np.random.choice(pop, size=10)
biased.append(np.var(sample, ddof=0))
unbiased.append(np.var(sample, ddof=1))
print(f"True population variance: {true_var:.4f}")
print(f"Average biased estimate (÷n): {np.mean(biased):.4f} ← UNDERESTIMATES")
print(f"Average unbiased estimate (÷n−1): {np.mean(unbiased):.4f} ← CORRECT")
Computational Shortcut
sum_x = data.sum()
sum_x2 = (data**2).sum()
s2_comp = (sum_x2 - sum_x**2/n) / (n-1)
print(f"Computational formula: {s2_comp:.4f}")
print(f"Standard formula: {sample_var:.4f}") # identical
Properties of Variance
a, b = 3, 10
x = np.random.normal(50, 10, 1000)
# Shifting does NOT change variance
assert abs(np.var(x+b, ddof=1) - np.var(x, ddof=1)) < 1e-6
# Scaling multiplies variance by a²
assert abs(np.var(a*x, ddof=1) - a**2*np.var(x, ddof=1)) < 1e-4
print("Var(X+b) = Var(X): True")
print("Var(aX) = a²Var(X): True")
# Var(X+Y) = Var(X)+Var(Y) for independent X,Y
y = np.random.normal(30, 5, 1000)
print(f"Var(X+Y) ≈ Var(X)+Var(Y): {abs(np.var(x+y,ddof=1)-(np.var(x,ddof=1)+np.var(y,ddof=1))) < 200}")
Key Takeaways
- Variance = average squared deviation — squaring makes all deviations positive
- Always use ddof=1 for samples (Bessel's correction) in Python
- Variance units are squared — use standard deviation (√variance) for interpretation
- Variance is additive for independent variables: Var(X+Y) = Var(X) + Var(Y)
- Scaling by a multiplies variance by a²: Var(aX) = a²Var(X)
- Adding a constant does not change variance: Var(X+b) = Var(X)