Multiple Testing Problem — FWER and False Discovery Rate

Hypothesis TestingAdvanced TopicsFree Lesson

Advertisement

The Multiple Testing Problem

If you run 20 hypothesis tests at α = 0.05 when all H₀s are true, you expect 1 false positive by chance. Run 1000 tests? Expect 50 false discoveries.

Family-Wise Error Rate (FWER)

FWER = P(at least one false rejection) = 1 − (1−α)^m for independent tests

import numpy as np
from scipy import stats
from statsmodels.stats.multitest import multipletests
import matplotlib.pyplot as plt

# Show FWER inflation
alpha = 0.05
m_tests = np.arange(1, 201)
fwer = 1 - (1 - alpha)**m_tests

plt.figure(figsize=(10, 5))
plt.plot(m_tests, fwer, 'b-', linewidth=2)
plt.axhline(0.05, color='red', linestyle='--', label='α=0.05 (desired level)')
plt.axhline(0.50, color='orange', linestyle=':', label='50% false positive rate')
plt.xlabel('Number of Tests (m)')
plt.ylabel('FWER (P at least one false positive)')
plt.title('Multiple Testing Problem: FWER Inflation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('multiple_testing.png', dpi=150)
plt.show()

print(f"FWER with {5} tests:   {1-(1-0.05)**5:.4f}")
print(f"FWER with {20} tests:  {1-(1-0.05)**20:.4f}")
print(f"FWER with {100} tests: {1-(1-0.05)**100:.4f}")

# ==========================================
# CORRECTIONS
# ==========================================
np.random.seed(42)
n_tests = 20
n_truly_null = 15  # 15 of 20 are truly null
n_true_alt = 5    # 5 truly have effects

# Simulate p-values
null_p = np.random.uniform(0, 1, n_truly_null)       # null: uniform
alt_p = np.random.beta(0.3, 5, n_true_alt)           # alternative: small p-values
all_p = np.concatenate([null_p, alt_p])
np.random.shuffle(all_p)

print(f"
Uncorrected: {(all_p < 0.05).sum()} rejections (some are false)")

# Bonferroni correction
bonf_alpha = 0.05 / n_tests
print(f"
Bonferroni (α/m = {bonf_alpha:.4f}): {(all_p < bonf_alpha).sum()} rejections")

# Holm-Bonferroni (uniformly more powerful than Bonferroni)
reject_holm, pvals_corrected_holm, _, _ = multipletests(all_p, alpha=0.05, method='holm')
print(f"Holm-Bonferroni: {reject_holm.sum()} rejections")

# Benjamini-Hochberg (controls FDR, not FWER — more power for many tests)
reject_bh, pvals_corrected_bh, _, _ = multipletests(all_p, alpha=0.05, method='fdr_bh')
print(f"Benjamini-Hochberg (FDR): {reject_bh.sum()} rejections")

print("
B-H procedure: controls the EXPECTED proportion of false discoveries, not probability of any")
print("  FDR ≤ α (expected false discovery rate)")
print("  More powerful than Bonferroni for many simultaneous tests")

FWER vs FDR

MethodControlsPowerUse When
BonferroniFWER (strong)LowestFew tests, any false positive is catastrophic
HolmFWER (strong)Higher than BonferroniFew-moderate tests
Benjamini-HochbergFDRHighestMany tests (genomics, fMRI), some false positives acceptable

Key Takeaways

  1. Run m tests at α → expect m×α false positives when all nulls are true
  2. Bonferroni: divide α by m — simple, conservative, appropriate for few tests
  3. Benjamini-Hochberg controls FDR — used in genomics (thousands of tests)
  4. Pre-specify your comparisons before seeing data — reduces multiple testing inflation
  5. Report all tests conducted — selective reporting is a major source of false discoveries

Advertisement

Need Expert Statistics Help?

Get personalized tutoring, dissertation support, or statistical consulting.

Advertisement