The Multiple Testing Problem
If you run 20 hypothesis tests at α = 0.05 when all H₀s are true, you expect 1 false positive by chance. Run 1000 tests? Expect 50 false discoveries.
Family-Wise Error Rate (FWER)
FWER = P(at least one false rejection) = 1 − (1−α)^m for independent tests
import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests
import matplotlib.pyplot as plt
# Show FWER inflation
alpha = 0.05
m_tests = np.arange(1, 201)
fwer = 1 - (1 - alpha)**m_tests
plt.figure(figsize=(10, 5))
plt.plot(m_tests, fwer, 'b-', linewidth=2)
plt.axhline(0.05, color='red', linestyle='--', label='α=0.05 (desired level)')
plt.axhline(0.50, color='orange', linestyle=':', label='50% false positive rate')
plt.xlabel('Number of Tests (m)')
plt.ylabel('FWER (P at least one false positive)')
plt.title('Multiple Testing Problem: FWER Inflation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('multiple_testing.png', dpi=150)
plt.show()
print(f"FWER with {5} tests: {1-(1-0.05)**5:.4f}")
print(f"FWER with {20} tests: {1-(1-0.05)**20:.4f}")
print(f"FWER with {100} tests: {1-(1-0.05)**100:.4f}")
# ==========================================
# CORRECTIONS
# ==========================================
np.random.seed(42)
n_tests = 20
n_truly_null = 15 # 15 of 20 are truly null
n_true_alt = 5 # 5 truly have effects
# Simulate p-values
null_p = np.random.uniform(0, 1, n_truly_null) # null: uniform
alt_p = np.random.beta(0.3, 5, n_true_alt) # alternative: small p-values
all_p = np.concatenate([null_p, alt_p])
np.random.shuffle(all_p)
print(f"
Uncorrected: {(all_p < 0.05).sum()} rejections (some are false)")
# Bonferroni correction
bonf_alpha = 0.05 / n_tests
print(f"
Bonferroni (α/m = {bonf_alpha:.4f}): {(all_p < bonf_alpha).sum()} rejections")
# Holm-Bonferroni (uniformly more powerful than Bonferroni)
reject_holm, pvals_corrected_holm, _, _ = multipletests(all_p, alpha=0.05, method='holm')
print(f"Holm-Bonferroni: {reject_holm.sum()} rejections")
# Benjamini-Hochberg (controls FDR, not FWER — more power for many tests)
reject_bh, pvals_corrected_bh, _, _ = multipletests(all_p, alpha=0.05, method='fdr_bh')
print(f"Benjamini-Hochberg (FDR): {reject_bh.sum()} rejections")
print("
B-H procedure: controls the EXPECTED proportion of false discoveries, not probability of any")
print(" FDR ≤ α (expected false discovery rate)")
print(" More powerful than Bonferroni for many simultaneous tests")
FWER vs FDR
| Method | Controls | Power | Use When |
|---|---|---|---|
| Bonferroni | FWER (strong) | Lowest | Few tests, any false positive is catastrophic |
| Holm | FWER (strong) | Higher than Bonferroni | Few-moderate tests |
| Benjamini-Hochberg | FDR | Highest | Many tests (genomics, fMRI), some false positives acceptable |
Key Takeaways
- Run m tests at α → expect m×α false positives when all nulls are true
- Bonferroni: divide α by m — simple, conservative, appropriate for few tests
- Benjamini-Hochberg controls FDR — used in genomics (thousands of tests)
- Pre-specify your comparisons before seeing data — reduces multiple testing inflation
- Report all tests conducted — selective reporting is a major source of false discoveries