P-Values — What They Mean, What They Don't, and Common Misconceptions

Hypothesis TestingCore ConceptsFree Lesson

Advertisement

P-Values: The Most Misunderstood Number in Statistics

The p-value is simultaneously the most used and most misused concept in statistics. Let's get it exactly right.


The Exact Definition

The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the observed value, assuming the null hypothesis is true.

p=P(TtobsH0 is true)p = P(|T| \geq |t_{\text{obs}}| \mid H_0 \text{ is true})

In plain English: "If H₀ were true, how often would we see results this extreme just by chance?"

  • Small p-value → data is surprising if H₀ were true → evidence against H₀
  • Large p-value → data is consistent with H₀ → no strong evidence against H₀

Computing P-Values in Python

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

np.random.seed(42)

# One-sample t-test example
# H₀: μ = 70 (exam scores average 70)
# H₁: μ ≠ 70

sample = np.array([74, 69, 78, 71, 76, 72, 80, 68, 75, 77,
                   73, 79, 65, 82, 71, 74, 76, 70, 78, 73])
n = len(sample)
x_bar = sample.mean()
s = sample.std(ddof=1)
mu_0 = 70

t_stat = (x_bar - mu_0) / (s / np.sqrt(n))
p_two = 2 * stats.t.sf(abs(t_stat), df=n-1)

print(f"Sample: n={n}, x̄={x_bar:.2f}, s={s:.2f}")
print(f"H₀: μ = {mu_0}")
print(f"t-statistic = {t_stat:.4f}")
print(f"p-value (two-tailed) = {p_two:.4f}")

# Visualize p-value
x = np.linspace(-5, 5, 1000)
t_dist = stats.t(df=n-1)
y = t_dist.pdf(x)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, y, 'b-', linewidth=2)
ax.fill_between(x, y, where=x >= abs(t_stat), alpha=0.4, color='red', label=f'Right tail p/2={p_two/2:.4f}')
ax.fill_between(x, y, where=x <= -abs(t_stat), alpha=0.4, color='red', label=f'Left tail p/2={p_two/2:.4f}')
ax.axvline(t_stat, color='red', linewidth=2, linestyle='--', label=f't_obs={t_stat:.3f}')
ax.axvline(-t_stat, color='red', linewidth=2, linestyle='--')
ax.set_title(f'P-value = {p_two:.4f} (shaded area = probability under H₀)')
ax.legend()
plt.tight_layout()
plt.savefig('pvalue_visualization.png', dpi=150)
plt.show()

What a P-Value IS and IS NOT

StatementCorrect?
"p = 0.03 means there's a 3% chance H₀ is true"❌ WRONG
"p = 0.03 means there's a 97% chance H₁ is true"❌ WRONG
"p = 0.03 means: if H₀ were true, only 3% of samples would yield this extreme a result"✅ CORRECT
"p = 0.03 means the result is practically important"❌ WRONG
"p = 0.03 means the study will replicate 97% of the time"❌ WRONG
"p > 0.05 proves H₀ is true"❌ WRONG
# Show: p-value does NOT measure effect size or practical importance
from scipy.stats import ttest_1samp

mu_0 = 0

# Tiny but "significant" effect (huge n)
tiny_effect = np.random.normal(0.001, 1, 100000)
t1, p1 = ttest_1samp(tiny_effect, popmean=mu_0)
print(f"Tiny effect (Δ≈0.001), n=100,000: t={t1:.2f}, p={p1:.6f} ← 'Significant'!")

# Large but non-significant effect (small n)
large_effect = np.random.normal(2.0, 5, 8)
t2, p2 = ttest_1samp(large_effect, popmean=mu_0)
print(f"Large effect (Δ≈2.0), n=8: t={t2:.2f}, p={p2:.4f} ← 'Not significant'!")

print("\nConclusion: p-value confounds effect size with sample size!")
print("Always report effect size (Cohen's d) alongside p-value.")

# Cohen's d
d1 = tiny_effect.mean() / tiny_effect.std()
d2 = large_effect.mean() / large_effect.std()
print(f"Cohen's d (tiny effect): {d1:.4f}")
print(f"Cohen's d (large effect): {d2:.4f}")

The p-value Controversy

The American Statistical Association (2016) issued a statement with key principles:

  1. p-values do not measure the probability that H₀ is true
  2. p-values do not measure the probability that results occurred by chance
  3. Statistical significance does not equal scientific or practical significance
  4. p-values do not measure the size of an effect
  5. A p > 0.05 does not mean there is no effect

Recommended reporting:

# Best practice: report confidence interval + effect size + p-value
from scipy import stats

effect_size_d = (x_bar - mu_0) / s
ci = stats.t.interval(0.95, df=n-1, loc=x_bar, scale=s/np.sqrt(n))

print(f"Result: t({n-1}) = {t_stat:.3f}, p = {p_two:.4f}")
print(f"95% CI for μ: ({ci[0]:.2f}, {ci[1]:.2f})")
print(f"Effect size: d = {effect_size_d:.3f} ({'small' if abs(effect_size_d)<0.5 else 'medium' if abs(effect_size_d)<0.8 else 'large'})")

Key Takeaways

  1. p-value = P(data this extreme | H₀ true) — it says nothing directly about H₁
  2. p < α is the threshold for rejection — but α = 0.05 is arbitrary, not magical
  3. Statistical significance ≠ practical significance — a tiny difference can be "significant" with enough data
  4. Always report effect sizes and confidence intervals alongside p-values
  5. p > 0.05 does not mean "no effect" — it means "insufficient evidence against H₀"
  6. Pre-register your hypotheses to avoid p-hacking and false discoveries

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement