Hypothesis Testing
ℹ️ Why It Matters
Hypothesis testing is the backbone of scientific discovery and data-driven decision making. Whether validating a clinical trial, tuning a machine learning model, or measuring the impact of a new feature, hypothesis testing provides the formal framework to distinguish real effects from random noise. Without it, every observed difference — no matter how small or how likely to occur by chance — could be mistaken for a meaningful finding.
Overview
Every hypothesis test begins by formulating two competing statements about a population parameter. The null hypothesis () is the default assumption of no effect. The alternative hypothesis () is the claim that an effect exists. A test statistic measures how far observed data deviates from . The p-value quantifies the probability of seeing results at least as extreme if is true. We reject when the p-value falls below the significance level . Two types of errors are possible: Type I (false positive, probability ) and Type II (false negative, probability ). Power () is the probability of detecting a real effect, and increases with effect size, sample size, and .
Key Concepts
Test Statistic (Z-Test)
Here,
- =Sample mean
- =Hypothesized population mean
- =Population standard deviation
- =Sample size
P-Value
Here,
- =Probability of observing results this extreme under H₀
Power of a Test
Here,
- =Type II error probability
Sample Size for Power
Here,
- =Cohen's d (effect size)
- =Critical value for significance level
- =Critical value for desired power
Cohen's d (Effect Size)
Here,
- =Pooled standard deviation
Error Matrix
| is True | is False | |
|---|---|---|
| Reject | Type I Error () — false positive | Power () — true positive |
| Fail to Reject | Correct — true negative | Type II Error () — false negative |
Effect Size Benchmarks (Cohen's d)
| Effect | Cohen's d | Interpretation |
|---|---|---|
| Small | 0.2 | Subtle, hard to detect |
| Medium | 0.5 | Noticeable practical effect |
| Large | 0.8 | Strong, clearly visible |
P-Value Interpretation
| P-Value | Evidence Against |
|---|---|
| Very strong | |
| Strong | |
| Weak | |
| Little or none |
Quick Example
📝One-Sample T-Test
A researcher claims average response time is 200ms. Sample: , , .
With , critical value . Since , reject .
There is sufficient evidence that the mean response time differs from 200ms.
📝Power Analysis
To detect a medium effect () at with power = 0.80:
You need approximately 64 participants per group.
Key Takeaways
📋Summary: Hypothesis Testing
- Decision Rule: Reject if p-value ≤ . Never say "accept " — say "fail to reject."
- p-value: Probability of results this extreme given is true. NOT the probability that is true.
- Type I vs Type II: Type I = false positive (); Type II = false negative (). Reducing one increases the other for fixed .
- Power: Increases with effect size, sample size, and . Always conduct power analysis before collecting data.
- Effect Size: A tiny effect can be "significant" with large . Always report Cohen's d alongside p-values.
- One vs Two Tailed: Use two-tailed as the default. One-tailed requires a strong a priori directional prediction.
- Multiple Comparisons: Many tests inflate family-wise error. Use Bonferroni, Holm, or FDR correction.
- Statistical vs Practical: Statistical significance ≠ practical significance. Always consider effect size and context.
Deep Dive
For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:
Hypothesis Formulation
- Null and Alternative Hypothesis — How to formulate and , one-sided vs. two-sided, and common patterns
Errors and Significance
- Type I and Type II Errors — Error matrix, trade-off, and real-world consequences
- P-Values — Calculation, interpretation, and common misinterpretations
- Significance Levels — Choosing , multiple testing, and when to use 0.01 vs 0.05
Power and Effect Size
- Power of a Test — Factors affecting power, a priori power analysis, and underpowered studies
- Effect Size — Cohen's d, Hedges' g, eta-squared, and why practical significance matters
Related Topics
- Confidence Intervals — CIs and hypothesis tests are duals: same information, different format
- t-Tests — Applying the hypothesis testing framework to compare means
- Chi-Square Tests — Hypothesis testing for categorical data
- Multiple Testing Problem — Why running many tests inflates false positives