Why This Matters
Statistical testing is how we make decisions under uncertainty. Instead of guessing whether a drug works, a feature helps, or a difference is real — we quantify the evidence and let data guide our conclusions.
The Hypothesis Testing Framework
Step 1: State hypotheses
H0 (null): "There is no effect / no difference"
H1 (alternative): "There IS an effect / difference"
Step 2: Collect data
Step 3: Calculate test statistic
(How far is our data from what H0 predicts?)
Step 4: Calculate p-value
(How likely is this data if H0 is true?)
Step 5: Make decision
If p-value < alpha -> Reject H0
If p-value >= alpha -> Fail to reject H0
Visual: Rejection Regions
Distribution under H0 (null hypothesis is true):
H0 is true
|
Rejection | Rejection
Region | Region
|:::::::::|--------+--------|:::::::::|
-3 -2 0 2 3
-1.96 1.96
^ ^
| |
Critical Critical
Value Value
(alpha=0.05)
If test statistic falls in shaded area -> Reject H0
If test statistic falls in white area -> Fail to reject H0
Type I and Type II Errors
Actual Reality
H0 True | H0 False
(No Effect) | (Effect Exists)
+----------------+----------------+
Decision: | | |
Reject H0 | TYPE I ERROR | CORRECT |
(Say effect | (False Positive)| (True Positive)|
exists) | Alpha = 0.05 | Power = 1-beta|
+----------------+----------------+
Fail to | CORRECT | TYPE II ERROR |
Reject H0 | (True Negative)| (False Negative)|
(Say no | | Beta |
effect) | | |
+----------------+----------------+
The p-value: What It Actually Means
Formal Definition of p-value
Here,
- =test statistic under the null distribution
- =observed test statistic from sample
- =null hypothesis is true
Common Misconceptions:
WRONG: "p = 0.03 means there's a 3% chance H0 is true"
RIGHT: "If H0 were true, there's a 3% chance of seeing data this extreme"
WRONG: "p < 0.05 means the effect is large"
RIGHT: "p < 0.05 means the effect is unlikely under H0"
(A tiny effect can be significant with large n)
WRONG: "p > 0.05 means no effect exists"
RIGHT: "p > 0.05 means we don't have enough evidence to reject H0"
Statistical vs practical significance: A p-value measures how surprised you should be if H0 were true. It does NOT measure the size or importance of an effect. With n = 1,000,000, a 0.01 cm height difference can yield p < 0.001. Always report effect sizes (Cohen's d, Cramér's V) alongside p-values.
One-Sample t-test
Here,
- =sample mean
- =hypothesized population mean
- =sample standard deviation
- =sample size
- =standard error of the mean
Effect Size: Cohen's d
Cohen's d Effect Size
Here,
- =sample mean
- =hypothesized population mean
- =sample standard deviation
|d| Interpretation:
0.2 -> Small effect
0.5 -> Medium effect
0.8 -> Large effect
Complete Example
import numpy as np
from scipy import stats
np.random.seed(42)
heights = np.random.normal(loc=172, scale=8, size=30)
print(f"Sample mean: {heights.mean():.2f} cm")
print(f"Sample std: {heights.std(ddof=1):.2f} cm")
mu_0 = 170
t_stat, p_value = stats.ttest_1samp(heights, mu_0)
print(f"\nH0: mu = {mu_0} cm")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
cohens_d = (heights.mean() - mu_0) / heights.std(ddof=1)
print(f"Cohen's d: {cohens_d:.4f}")
alpha = 0.05
if p_value < alpha:
print(f"\nResult: Reject H0 (p < {alpha})")
else:
print(f"\nResult: Fail to reject H0 (p >= {alpha})")
Assumptions
1. Independence: Observations are independent
2. Normality: Data is approximately normally distributed
- Check: Shapiro-Wilk test, Q-Q plot
- Robust: t-test is robust to mild non-normality for n > 30
3. Continuous: The dependent variable is continuous
The Central Limit Theorem saves you: Even if the underlying data is not normal, the sampling distribution of the mean approaches normality as n increases (CLT). For n > 30, the t-test is robust to moderate departures from normality.
Two-Sample t-test
Welch's t-test (Unequal Variances)
Here,
- =sample means of groups 1 and 2
- =sample standard deviations
- =sample sizes
Welch-Satterthwaite Degrees of Freedom
Here,
- =sample standard deviations
- =sample sizes
np.random.seed(42)
men_heights = np.random.normal(loc=175, scale=7, size=30)
women_heights = np.random.normal(loc=163, scale=6, size=30)
t_stat, p_value = stats.ttest_ind(men_heights, women_heights, equal_var=False)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
pooled_std = np.sqrt((men_heights.std(ddof=1)**2 + women_heights.std(ddof=1)**2) / 2)
cohens_d = (men_heights.mean() - women_heights.mean()) / pooled_std
print(f"Cohen's d: {cohens_d:.4f}")
Paired Samples
np.random.seed(42)
before = np.random.normal(loc=65, scale=10, size=25)
after = before + np.random.normal(loc=5, scale=8, size=25)
t_stat, p_value = stats.ttest_rel(after, before)
t_stat_one, p_value_one = stats.ttest_rel(after, before, alternative='greater')
print(f"Paired t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_value:.4f}")
print(f"p-value (one-tailed): {p_value_one:.4f}")
Chi-Square Test
Here,
- =observed frequency (actual data)
- =expected frequency (if variables were independent)
import pandas as pd
contingency = pd.DataFrame({
'Python': [50, 35],
'Java': [30, 45],
'R': [20, 20]
}, index=['Male', 'Female'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
print("Contingency Table:")
print(contingency)
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
Effect Size: Cramér's V
Cramér's V Effect Size
Here,
- =chi-square statistic
- =total sample size
- =min(number of rows, number of columns)
|V| Interpretation:
0.1 -> Small association
0.3 -> Medium association
0.5 -> Large association
ANOVA (Analysis of Variance)
Why Not Use Multiple t-tests?
Comparing 3 groups: A, B, C
Multiple t-tests approach:
A vs B -> test 1
A vs C -> test 2
B vs C -> test 3
Problem: With alpha=0.05 per test:
P(at least one false positive) = 1 - (1-0.05)^3 = 0.143
ANOVA approach:
Single test: Are ANY groups different?
Then post-hoc: WHICH groups are different?
One-Way ANOVA
Here,
- =mean square between groups (signal)
- =mean square within groups (noise)
np.random.seed(42)
method_a = np.random.normal(loc=75, scale=10, size=30)
method_b = np.random.normal(loc=80, scale=10, size=30)
method_c = np.random.normal(loc=72, scale=10, size=30)
f_stat, p_value = stats.f_oneway(method_a, method_b, method_c)
print(f"F-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: At least one method differs significantly")
Post-Hoc: Tukey's HSD
from statsmodels.stats.multicomp import pairwise_tukeyhsd
all_scores = np.concatenate([method_a, method_b, method_c])
groups = ['A']*30 + ['B']*30 + ['C']*30
tukey = pairwise_tukeyhsd(all_scores, groups, alpha=0.05)
print(tukey)
Non-Parametric Tests
| Parametric | Non-Parametric | When to Use |
|---|---|---|
| One-sample t | Wilcoxon signed-rank | Small sample, non-normal |
| Independent t | Mann-Whitney U | Unequal variances, ordinal data |
| Paired t | Wilcoxon signed-rank (paired) | Paired, non-normal differences |
| One-way ANOVA | Kruskal-Wallis | Non-normal, 3+ groups |
| Pearson r | Spearman rho | Non-linear monotonic relationship |
# Mann-Whitney U (non-parametric independent t-test)
stat, p_value = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat:.4f}, p-value: {p_value:.4f}")
# Kruskal-Wallis (non-parametric ANOVA)
stat, p_value = stats.kruskal(method_a, method_b, method_c)
print(f"Kruskal-Wallis H: {stat:.4f}, p-value: {p_value:.4f}")
Multiple Comparisons Problem
Solutions
from statsmodels.stats.multitest import multipletests
p_values = [0.01, 0.04, 0.03, 0.85, 0.12, 0.02, 0.06]
# Bonferroni correction
reject, pvals_corrected, _, _ = multipletests(p_values, method='bonferroni')
print("Bonferroni corrected:", pvals_corrected)
# FDR (Benjamini-Hochberg)
reject_fdr, pvals_fdr, _, _ = multipletests(p_values, method='fdr_bh')
print("FDR corrected:", pvals_fdr)
When to use which correction: Use Bonferroni when you have few comparisons (m < 10) and the cost of a false positive is high. Use FDR when you have many comparisons (m > 10) and you want to maximize discoveries (e.g., genomics, exploratory data analysis).
Power Analysis
Statistical Power
Here,
- =probability of Type II error
- =significance level (typically 0.05)
- =sample size per group
from statsmodels.stats.power import TTestIndPower
power_analysis = TTestIndPower()
# How many samples needed to detect d=0.5 with 80% power?
n = power_analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.80, ratio=1.0)
print(f"Required n per group: {int(np.ceil(n))}")
# What power do we have with n=50 per group?
power = power_analysis.power(effect_size=0.5, alpha=0.05, nobs1=50, ratio=1.0)
print(f"Power with n=50: {power:.2f}")
Quick Reference: Which Test to Use
What are you comparing?
|
+-- One group vs known value?
| +-- Continuous, normal? -> One-sample t-test
| +-- Continuous, non-normal? -> Wilcoxon signed-rank
| +-- Categorical? -> Chi-square goodness of fit
|
+-- Two groups?
| +-- Independent?
| | +-- Continuous, normal? -> Independent t-test (Welch's)
| | +-- Continuous, non-normal? -> Mann-Whitney U
| | +-- Categorical? -> Chi-square test of independence
| +-- Paired?
| +-- Continuous, normal? -> Paired t-test
| +-- Continuous, non-normal? -> Wilcoxon signed-rank (paired)
|
+-- Three+ groups?
+-- One factor?
| +-- Continuous, normal? -> One-way ANOVA
| +-- Continuous, non-normal? -> Kruskal-Wallis
+-- Two+ factors?
+-- Continuous, normal? -> Two-way ANOVA
+-- Non-parametric? -> Friedman test
Key Takeaways
Summary: Statistical Testing Deep Dive
- Always state H0 and H1 before testing. Hypothesis testing is a structured framework, not a fishing expedition.
- p-value is NOT the probability H0 is true. It is the probability of seeing data this extreme if H0 were true.
- Statistical significance ≠practical significance. Always report effect sizes (Cohen's d, Cramér's V) alongside p-values.
- Use Welch's t-test by default (safer than Student's). It does not assume equal variances.
- Correct for multiple comparisons using Bonferroni (conservative) or FDR (liberal).
- Always check assumptions (normality, independence, homoscedasticity). Use non-parametric tests when assumptions fail.
- Use power analysis to determine sample size before collecting data.
Practice Exercises
- Drug Trial: Blood pressure in 40 patients after a new drug. Historical mean 120 mmHg. Sample mean 115, std=12. Is the drug effective?
- A/B Test: Website A conversion 12.3% (n=5000), Website B 13.1% (n=5000). Is B significantly better?
- Survey Analysis: Association between education level and preferred news source (n=200). Chi-square test?
- Experiment Design: Detect medium effect (d=0.5) with 90% power. How many subjects per group?