← Math|48 of 100
Statistics

Hypothesis Testing

Master hypothesis testing, p-values, Type I/II errors, power analysis, and real-world applications in AI/ML.

📂 Testing📖 Lesson 48 of 100🎓 Free Course

Advertisement

Hypothesis Testing

ℹ️ Why It Matters

Hypothesis testing is the backbone of scientific discovery and data-driven decision making. Whether validating a clinical trial, tuning a machine learning model, or measuring the impact of a new feature, hypothesis testing provides the formal framework to distinguish real effects from random noise. Without it, every observed difference — no matter how small or how likely to occur by chance — could be mistaken for a meaningful finding.


Overview

Every hypothesis test begins by formulating two competing statements about a population parameter. The null hypothesis (H0H_0) is the default assumption of no effect. The alternative hypothesis (H1H_1) is the claim that an effect exists. A test statistic measures how far observed data deviates from H0H_0. The p-value quantifies the probability of seeing results at least as extreme if H0H_0 is true. We reject H0H_0 when the p-value falls below the significance level α\alpha. Two types of errors are possible: Type I (false positive, probability α\alpha) and Type II (false negative, probability β\beta). Power (1β1 - \beta) is the probability of detecting a real effect, and increases with effect size, sample size, and α\alpha.


Key Concepts

Test Statistic (Z-Test)

z=xˉμ0σ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

Here,

  • xˉ\bar{x}=Sample mean
  • μ0\mu_0=Hypothesized population mean
  • σ\sigma=Population standard deviation
  • nn=Sample size

P-Value

p=P(data or more extremeH0 is true)p = P(\text{data or more extreme} \mid H_0 \text{ is true})

Here,

  • pp=Probability of observing results this extreme under H₀

Power of a Test

Power=1β=P(Reject H0H0 is false)\text{Power} = 1 - \beta = P(\text{Reject } H_0 \mid H_0 \text{ is false})

Here,

  • β\beta=Type II error probability

Sample Size for Power

n=(z1α/2+z1βd)2n = \left(\frac{z_{1-\alpha/2} + z_{1-\beta}}{d}\right)^2

Here,

  • dd=Cohen's d (effect size)
  • z1α/2z_{1-\alpha/2}=Critical value for significance level
  • z1βz_{1-\beta}=Critical value for desired power

Cohen's d (Effect Size)

d=xˉ1xˉ2spd = \frac{\bar{x}_1 - \bar{x}_2}{s_p}

Here,

  • sps_p=Pooled standard deviation

Error Matrix

H0H_0 is TrueH0H_0 is False
Reject H0H_0Type I Error (α\alpha) — false positivePower (1β1-\beta) — true positive
Fail to Reject H0H_0Correct — true negativeType II Error (β\beta) — false negative

Effect Size Benchmarks (Cohen's d)

EffectCohen's dInterpretation
Small0.2Subtle, hard to detect
Medium0.5Noticeable practical effect
Large0.8Strong, clearly visible

P-Value Interpretation

P-ValueEvidence Against H0H_0
p<0.01p < 0.01Very strong
p<0.05p < 0.05Strong
p<0.10p < 0.10Weak
p0.10p \geq 0.10Little or none

Quick Example

📝One-Sample T-Test

A researcher claims average response time is 200ms. Sample: n=25n = 25, xˉ=215\bar{x} = 215, s=30s = 30.

t=21520030/25=156=2.5t = \frac{215 - 200}{30/\sqrt{25}} = \frac{15}{6} = 2.5

With df=24df = 24, critical value t0.025,24=2.064t_{0.025, 24} = 2.064. Since t=2.5>2.064|t| = 2.5 > 2.064, reject H0H_0.

There is sufficient evidence that the mean response time differs from 200ms.

📝Power Analysis

To detect a medium effect (d=0.5d = 0.5) at α=0.05\alpha = 0.05 with power = 0.80:

n=(1.96+0.8420.5)2=(2.8020.5)264n = \left(\frac{1.96 + 0.842}{0.5}\right)^2 = \left(\frac{2.802}{0.5}\right)^2 \approx 64

You need approximately 64 participants per group.


Key Takeaways

📋Summary: Hypothesis Testing

  • Decision Rule: Reject H0H_0 if p-value ≤ α\alpha. Never say "accept H0H_0" — say "fail to reject."
  • p-value: Probability of results this extreme given H0H_0 is true. NOT the probability that H0H_0 is true.
  • Type I vs Type II: Type I = false positive (α\alpha); Type II = false negative (β\beta). Reducing one increases the other for fixed nn.
  • Power: Increases with effect size, sample size, and α\alpha. Always conduct power analysis before collecting data.
  • Effect Size: A tiny effect can be "significant" with large nn. Always report Cohen's d alongside p-values.
  • One vs Two Tailed: Use two-tailed as the default. One-tailed requires a strong a priori directional prediction.
  • Multiple Comparisons: Many tests inflate family-wise error. Use Bonferroni, Holm, or FDR correction.
  • Statistical vs Practical: Statistical significance ≠ practical significance. Always consider effect size and context.

Deep Dive

For detailed explanations, worked examples, and Python implementations, explore the dedicated statistics lessons:

Hypothesis Formulation

Errors and Significance

Power and Effect Size

  • Power of a Test — Factors affecting power, a priori power analysis, and underpowered studies
  • Effect Size — Cohen's d, Hedges' g, eta-squared, and why practical significance matters

Related Topics

Lesson Progress48 / 100