P-Values: The Most Misunderstood Number in Statistics
The p-value is simultaneously the most used and most misused concept in statistics. Let's get it exactly right.
The Exact Definition
The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the observed value, assuming the null hypothesis is true.
In plain English: "If H₀ were true, how often would we see results this extreme just by chance?"
- Small p-value → data is surprising if H₀ were true → evidence against H₀
- Large p-value → data is consistent with H₀ → no strong evidence against H₀
Computing P-Values in Python
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(42)
# One-sample t-test example
# H₀: μ = 70 (exam scores average 70)
# H₁: μ ≠ 70
sample = np.array([74, 69, 78, 71, 76, 72, 80, 68, 75, 77,
73, 79, 65, 82, 71, 74, 76, 70, 78, 73])
n = len(sample)
x_bar = sample.mean()
s = sample.std(ddof=1)
mu_0 = 70
t_stat = (x_bar - mu_0) / (s / np.sqrt(n))
p_two = 2 * stats.t.sf(abs(t_stat), df=n-1)
print(f"Sample: n={n}, x̄={x_bar:.2f}, s={s:.2f}")
print(f"H₀: μ = {mu_0}")
print(f"t-statistic = {t_stat:.4f}")
print(f"p-value (two-tailed) = {p_two:.4f}")
# Visualize p-value
x = np.linspace(-5, 5, 1000)
t_dist = stats.t(df=n-1)
y = t_dist.pdf(x)
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, y, 'b-', linewidth=2)
ax.fill_between(x, y, where=x >= abs(t_stat), alpha=0.4, color='red', label=f'Right tail p/2={p_two/2:.4f}')
ax.fill_between(x, y, where=x <= -abs(t_stat), alpha=0.4, color='red', label=f'Left tail p/2={p_two/2:.4f}')
ax.axvline(t_stat, color='red', linewidth=2, linestyle='--', label=f't_obs={t_stat:.3f}')
ax.axvline(-t_stat, color='red', linewidth=2, linestyle='--')
ax.set_title(f'P-value = {p_two:.4f} (shaded area = probability under H₀)')
ax.legend()
plt.tight_layout()
plt.savefig('pvalue_visualization.png', dpi=150)
plt.show()
What a P-Value IS and IS NOT
| Statement | Correct? |
|---|---|
| "p = 0.03 means there's a 3% chance H₀ is true" | ❌ WRONG |
| "p = 0.03 means there's a 97% chance H₁ is true" | ❌ WRONG |
| "p = 0.03 means: if H₀ were true, only 3% of samples would yield this extreme a result" | ✅ CORRECT |
| "p = 0.03 means the result is practically important" | ❌ WRONG |
| "p = 0.03 means the study will replicate 97% of the time" | ❌ WRONG |
| "p > 0.05 proves H₀ is true" | ❌ WRONG |
# Show: p-value does NOT measure effect size or practical importance
from scipy.stats import ttest_1samp
mu_0 = 0
# Tiny but "significant" effect (huge n)
tiny_effect = np.random.normal(0.001, 1, 100000)
t1, p1 = ttest_1samp(tiny_effect, popmean=mu_0)
print(f"Tiny effect (Δ≈0.001), n=100,000: t={t1:.2f}, p={p1:.6f} ← 'Significant'!")
# Large but non-significant effect (small n)
large_effect = np.random.normal(2.0, 5, 8)
t2, p2 = ttest_1samp(large_effect, popmean=mu_0)
print(f"Large effect (Δ≈2.0), n=8: t={t2:.2f}, p={p2:.4f} ← 'Not significant'!")
print("\nConclusion: p-value confounds effect size with sample size!")
print("Always report effect size (Cohen's d) alongside p-value.")
# Cohen's d
d1 = tiny_effect.mean() / tiny_effect.std()
d2 = large_effect.mean() / large_effect.std()
print(f"Cohen's d (tiny effect): {d1:.4f}")
print(f"Cohen's d (large effect): {d2:.4f}")
The p-value Controversy
The American Statistical Association (2016) issued a statement with key principles:
- p-values do not measure the probability that H₀ is true
- p-values do not measure the probability that results occurred by chance
- Statistical significance does not equal scientific or practical significance
- p-values do not measure the size of an effect
- A p > 0.05 does not mean there is no effect
Recommended reporting:
# Best practice: report confidence interval + effect size + p-value
from scipy import stats
effect_size_d = (x_bar - mu_0) / s
ci = stats.t.interval(0.95, df=n-1, loc=x_bar, scale=s/np.sqrt(n))
print(f"Result: t({n-1}) = {t_stat:.3f}, p = {p_two:.4f}")
print(f"95% CI for μ: ({ci[0]:.2f}, {ci[1]:.2f})")
print(f"Effect size: d = {effect_size_d:.3f} ({'small' if abs(effect_size_d)<0.5 else 'medium' if abs(effect_size_d)<0.8 else 'large'})")
Key Takeaways
- p-value = P(data this extreme | H₀ true) — it says nothing directly about H₁
- p < α is the threshold for rejection — but α = 0.05 is arbitrary, not magical
- Statistical significance ≠ practical significance — a tiny difference can be "significant" with enough data
- Always report effect sizes and confidence intervals alongside p-values
- p > 0.05 does not mean "no effect" — it means "insufficient evidence against H₀"
- Pre-register your hypotheses to avoid p-hacking and false discoveries