Why This Matters
Statistical testing is how we make decisions under uncertainty. Instead of guessing whether a drug works, a feature helps, or a difference is real β we quantify the evidence and let data guide our conclusions.
DfStatistical Inference
The process of drawing conclusions about a population based on a sample. Since we rarely have access to entire populations, we use sample statistics (means, proportions, variances) to estimate population parameters. Statistical inference quantifies the uncertainty inherent in this process through confidence intervals and hypothesis tests.
The Hypothesis Testing Framework
The Scientific Approach to Data
DfNull Hypothesis (Hβ)
A statement of "no effect" or "no difference" that serves as the default assumption in hypothesis testing. The null hypothesis is assumed true until the data provides sufficient evidence to reject it. Formally, Hβ specifies a particular value for the population parameter (e.g., ΞΌ = 0 or ΞΌβ β ΞΌβ = 0).
Step 1: State hypotheses
H0 (null): "There is no effect / no difference"
H1 (alternative): "There IS an effect / difference"
Step 2: Collect data
Step 3: Calculate test statistic
(How far is our data from what H0 predicts?)
Step 4: Calculate p-value
(How likely is this data if H0 is true?)
Step 5: Make decision
If p-value < alpha -> Reject H0
If p-value >= alpha -> Fail to reject H0
Visual: Rejection Regions
Distribution under H0 (null hypothesis is true):
H0 is true
|
Rejection | Rejection
Region | Region
|:::::::::|--------+--------|:::::::::|
-3 -2 0 2 3
-1.96 1.96
^ ^
| |
Critical Critical
Value Value
(alpha=0.05)
If test statistic falls in shaded area -> Reject H0
If test statistic falls in white area -> Fail to reject H0
Type I and Type II Errors
DfType I Error (False Positive)
Rejecting the null hypothesis when it is actually true. The probability of making a Type I error is denoted by Ξ± (alpha) and is set by the researcher before the test (typically 0.05). If Ξ± = 0.05, we accept a 5% chance of concluding an effect exists when it does not.
DfType II Error (False Negative)
Failing to reject the null hypothesis when it is actually false. The probability of making a Type II error is denoted by Ξ². Statistical power is defined as 1 β Ξ² β the probability of correctly detecting a real effect. Lower Ξ² (higher power) requires larger sample sizes.
Actual Reality
H0 True | H0 False
(No Effect) | (Effect Exists)
ββββββββββββββββββΌβββββββββββββββββ
Decision: β β β
Reject H0 β TYPE I ERROR β CORRECT β
(Say effect β (False Positive)β (True Positive)β
exists) β Alpha = 0.05 β Power = 1-betaβ
ββββββββββββββββββΌβββββββββββββββββ€
Fail to β CORRECT β TYPE II ERROR β
Reject H0 β (True Negative)β (False Negative)β
(Say no β β Beta β
effect) β β β
ββββββββββββββββββ΄βββββββββββββββββ
Tradeoff:
Increasing alpha (0.05 -> 0.10):
+ More power to detect effects
- More false positives
Decreasing alpha (0.05 -> 0.01):
+ Fewer false positives
- More missed effects (lower power)
The p-value: What It Actually Means
Dfp-value
The probability of observing a test statistic at least as extreme as the one computed from the sample, assuming the null hypothesis is true. It is NOT the probability that Hβ is true, and it is NOT the probability that the effect is real. A small p-value indicates that the observed data would be very unlikely under Hβ, providing evidence against it.
Formal Definition of p-value
Here,
- =
- =
- =
Common Misconceptions:
WRONG: "p = 0.03 means there's a 3% chance H0 is true"
RIGHT: "If H0 were true, there's a 3% chance of seeing data this extreme"
WRONG: "p < 0.05 means the effect is large"
RIGHT: "p < 0.05 means the effect is unlikely under H0"
(A tiny effect can be significant with large n)
WRONG: "p > 0.05 means no effect exists"
RIGHT: "p > 0.05 means we don't have enough evidence to reject H0"
Statistical vs practical significance: A p-value measures how surprised you should be if Hβ were true. It does NOT measure the size or importance of an effect. With n = 1,000,000, a 0.01 cm height difference between groups can yield p < 0.001. Always report effect sizes (Cohen's d, CramΓ©r's V) alongside p-values to convey practical importance.
One-Sample t-test
When to Use
Compare a sample mean to a known population value.
Example: Is the average height of students different from 170cm?
H0: mu = 170cm (population mean)
H1: mu != 170cm (two-tailed)
Formula
Intuition:
t = (Observed difference) / (Expected variability)
Large t -> The difference is large relative to variability
Small t -> The difference could easily be due to chance
Effect Size: Cohen's d
Cohen's d Effect Size
Here,
- =
- =
- =
|d| Interpretation:
0.2 -> Small effect
0.5 -> Medium effect
0.8 -> Large effect
Why effect size matters:
With n=10,000, even a 0.1cm difference is "significant"
Effect size tells you if the difference PRACTICALLY matters
Complete Example
πOne-Sample t-test: Testing Student Heights
import numpy as np
from scipy import stats
# Sample data: heights of 30 students
np.random.seed(42)
heights = np.random.normal(loc=172, scale=8, size=30)
print(f"Sample mean: {heights.mean():.2f} cm")
print(f"Sample std: {heights.std(ddof=1):.2f} cm")
print(f"Sample size: {len(heights)}")
# One-sample t-test: Is mean different from 170cm?
mu_0 = 170
t_stat, p_value = stats.ttest_1samp(heights, mu_0)
print(f"\nH0: mu = {mu_0} cm")
print(f"H1: mu != {mu_0} cm")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
# Effect size
cohens_d = (heights.mean() - mu_0) / heights.std(ddof=1)
print(f"Cohen's d: {cohens_d:.4f}")
# Decision
alpha = 0.05
if p_value < alpha:
print(f"\nResult: Reject H0 (p < {alpha})")
print("Conclusion: Student heights differ from 170cm")
else:
print(f"\nResult: Fail to reject H0 (p >= {alpha})")
print("Conclusion: No significant difference from 170cm")
Assumptions
1. Independence: Observations are independent
2. Normality: Data is approximately normally distributed
- Check: Shapiro-Wilk test, Q-Q plot
- Robust: t-test is robust to mild non-normality for n > 30
3. Continuous: The dependent variable is continuous
The Central Limit Theorem saves you: Even if the underlying data is not normal, the sampling distribution of the mean approaches normality as n increases (CLT). For n > 30, the t-test is robust to moderate departures from normality. For smaller samples, check normality with a Shapiro-Wilk test or Q-Q plot before proceeding.
Two-Sample t-test
Independent Samples
Compare means of two independent groups.
Example: Do men and women have different average heights?
H0: mu_men = mu_women
H1: mu_men != mu_women
Formula (Welch's t-test β unequal variances)
Degrees of freedom (Welch-Satterthwaite):
Welch-Satterthwaite Degrees of Freedom
Here,
- =
- =
Why Welch's over Student's t-test:
- Student's t-test assumes equal variances (homoscedasticity)
- Welch's works regardless of variance equality
- Use Welch's by default (safer)
Complete Example
# Two independent groups
np.random.seed(42)
men_heights = np.random.normal(loc=175, scale=7, size=30)
women_heights = np.random.normal(loc=163, scale=6, size=30)
print(f"Men: mean={men_heights.mean():.2f}, std={men_heights.std(ddof=1):.2f}")
print(f"Women: mean={women_heights.mean():.2f}, std={women_heights.std(ddof=1):.2f}")
# Independent t-test (Welch's by default)
t_stat, p_value = stats.ttest_ind(men_heights, women_heights, equal_var=False)
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
# Effect size
pooled_std = np.sqrt((men_heights.std(ddof=1)**2 + women_heights.std(ddof=1)**2) / 2)
cohens_d = (men_heights.mean() - women_heights.mean()) / pooled_std
print(f"Cohen's d: {cohens_d:.4f}")
if p_value < 0.05:
print("Result: Significant difference in heights")
Paired Samples
Compare means when the same subjects are measured twice.
Example: Does a training program improve test scores?
H0: mean_diff = 0 (no improvement)
H1: mean_diff > 0 (improvement)
# Paired samples: same students before/after training
np.random.seed(42)
before = np.random.normal(loc=65, scale=10, size=25)
after = before + np.random.normal(loc=5, scale=8, size=25) # Training effect
print(f"Before: mean={before.mean():.2f}")
print(f"After: mean={after.mean():.2f}")
# Paired t-test
t_stat, p_value = stats.ttest_rel(after, before)
# Alternative: one-tailed for improvement
t_stat_one, p_value_one = stats.ttest_rel(after, before, alternative='greater')
print(f"\nPaired t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_value:.4f}")
print(f"p-value (one-tailed, improvement): {p_value_one:.4f}")
Chi-Square Test
Test of Independence
Determine if two categorical variables are related.
Example: Is there an association between gender and preferred programming language?
H0: Gender and language preference are independent
H1: Gender and language preference are related
Formula
Expected frequency formula:
E = (Row Total * Column Total) / Grand Total
Intuition:
Large chi2 -> Observed differs greatly from Expected
Small chi2 -> Observed is close to Expected
Complete Example
import pandas as pd
# Contingency table
data = pd.DataFrame({
'Gender': ['Male', 'Male', 'Male', 'Male',
'Female', 'Female', 'Female', 'Female'],
'Language': ['Python', 'Java', 'Python', 'Python',
'Java', 'Python', 'Python', 'Java']
})
# Create contingency table
contingency = pd.crosstab(data['Gender'], data['Language'])
print("Contingency Table:")
print(contingency)
# Expected frequencies
print("\nExpected Frequencies (if independent):")
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
print(pd.DataFrame(expected, index=contingency.index, columns=contingency.columns))
# Test results
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value: {p_value:.4f}")
print(f"Degrees of freedom: {dof}")
if p_value < 0.05:
print("Result: Gender and language preference are related")
else:
print("Result: No significant association found")
Effect Size: CramΓ©r's V
CramΓ©r's V Effect Size
Here,
- =
- =
- =
|V| Interpretation:
0.1 -> Small association
0.3 -> Medium association
0.5 -> Large association
# CramΓ©r's V
n = len(data)
k = min(contingency.shape)
cramers_v = np.sqrt(chi2 / (n * (k - 1)))
print(f"CramΓ©r's V: {cramers_v:.4f}")
ANOVA (Analysis of Variance)
Why Not Use Multiple t-tests?
DfMultiple Comparisons Problem
The phenomenon where conducting multiple hypothesis tests simultaneously inflates the family-wise error rate (FWER). With k tests at Ξ± = 0.05 each, the probability of at least one false positive is . For 3 tests: . For 10 tests: . ANOVA solves this by testing all groups simultaneously with a single test.
Comparing 3 groups: A, B, C
Multiple t-tests approach:
A vs B -> test 1
A vs C -> test 2
B vs C -> test 3
Problem: With alpha=0.05 per test:
P(at least one false positive) = 1 - (1-0.05)^3 = 0.143
With 10 groups (45 tests): P(at least one false) = 0.901!
ANOVA approach:
Single test: Are ANY groups different?
Then post-hoc: WHICH groups are different?
One-Way ANOVA
Intuition:
F = Signal / Noise
Signal: How much do group means differ?
Noise: How much do individuals vary within groups?
Large F -> Groups differ more than individuals vary
Small F -> Group differences could be chance
F = 1 -> Between-group variation = Within-group variation
F > 1 -> Between-group variation > Within-group variation
ANOVA assumptions matter more than you think: ANOVA assumes (1) independence, (2) normality within each group, and (3) homogeneity of variances (homoscedasticity). Violation of assumption 3 can inflate Type I error rates. Levene's test checks this assumption. If violated, use Welch's ANOVA (stats.f_oneway with unequal variances) or the non-parametric Kruskal-Wallis test.
# Three groups: three different teaching methods
np.random.seed(42)
method_a = np.random.normal(loc=75, scale=10, size=30)
method_b = np.random.normal(loc=80, scale=10, size=30)
method_c = np.random.normal(loc=72, scale=10, size=30)
print(f"Method A: mean={method_a.mean():.2f}")
print(f"Method B: mean={method_b.mean():.2f}")
print(f"Method C: mean={method_c.mean():.2f}")
# One-way ANOVA
f_stat, p_value = stats.f_oneway(method_a, method_b, method_c)
print(f"\nF-statistic: {f_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: At least one method differs significantly")
Post-Hoc: Tukey's HSD
DfTukey's Honestly Significant Difference
A post-hoc test used after a significant ANOVA to determine which specific group pairs differ. It controls the family-wise error rate across all pairwise comparisons by adjusting the critical value based on the studentized range distribution. The test computes adjusted p-values for each pair that account for the multiple comparisons.
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Combine data
all_scores = np.concatenate([method_a, method_b, method_c])
groups = ['A']*30 + ['B']*30 + ['C']*30
# Tukey's HSD
tukey = pairwise_tukeyhsd(all_scores, groups, alpha=0.05)
print(tukey)
Non-Parametric Tests
When normality assumption fails, use distribution-free alternatives.
| Parametric | Non-Parametric | When to Use |
|---|---|---|
| One-sample t | Wilcoxon signed-rank | Small sample, non-normal |
| Independent t | Mann-Whitney U | Unequal variances, ordinal data |
| Paired t | Wilcoxon signed-rank (paired) | Paired, non-normal differences |
| One-way ANOVA | Kruskal-Wallis | Non-normal, 3+ groups |
| Pearson r | Spearman rho | Non-linear monotonic relationship |
DfNon-Parametric Test
A statistical test that does not assume a specific distribution (e.g., normal) for the data. Instead of testing parameters (means, variances), non-parametric tests typically operate on ranks or signs of the data. They are more robust to outliers and non-normality but have less statistical power when the parametric assumptions are met.
# Mann-Whitney U (non-parametric independent t-test)
stat, p_value = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat:.4f}, p-value: {p_value:.4f}")
# Kruskal-Wallis (non-parametric ANOVA)
stat, p_value = stats.kruskal(method_a, method_b, method_c)
print(f"Kruskal-Wallis H: {stat:.4f}, p-value: {p_value:.4f}")
Multiple Comparisons Problem
The Problem
With k groups and alpha = 0.05:
Number of pairwise comparisons = k * (k-1) / 2
k=3: 3 comparisons -> P(false positive) = 14.3%
k=5: 10 comparisons -> P(false positive) = 40.1%
k=10: 45 comparisons -> P(false positive) = 90.1%
Solutions
DfBonferroni Correction
A conservative method for controlling the family-wise error rate when performing multiple tests. Each p-value is multiplied by the number of tests: , where is the number of comparisons. This guarantees that the family-wise error rate β€ Ξ±, but can be overly conservative (low power) when m is large.
DfFalse Discovery Rate (FDR)
The expected proportion of false positives among all rejected hypotheses. The Benjamini-Hochberg procedure controls FDR by ordering p-values and comparing each to , where is the rank and is the total number of tests. FDR control is less conservative than Bonferroni and provides more power to detect real effects.
from statsmodels.stats.multitest import multipletests
# Raw p-values from multiple tests
p_values = [0.01, 0.04, 0.03, 0.85, 0.12, 0.02, 0.06]
# Bonferroni correction (most conservative)
reject, pvals_corrected, _, _ = multipletests(p_values, method='bonferroni')
print("Bonferroni corrected p-values:", pvals_corrected)
# False Discovery Rate (FDR) - less conservative
reject_fdr, pvals_fdr, _, _ = multipletests(p_values, method='fdr_bh')
print("FDR corrected p-values:", pvals_fdr)
Bonferroni: p_corrected = p * number_of_tests
Pro: Controls family-wise error rate
Con: Very conservative (misses real effects)
FDR (Benjamini-Hochberg): Controls false discovery rate
Pro: More power to detect real effects
Con: Allows more false positives
When to use which correction: Use Bonferroni when you have few comparisons (m < 10) and the cost of a false positive is high (e.g., clinical trials). Use FDR when you have many comparisons (m > 10) and you want to maximize discoveries (e.g., genomics, exploratory data analysis). FDR is the standard in high-dimensional testing because Bonferroni becomes impossibly conservative with thousands of tests.
Power Analysis
Why It Matters
Before collecting data, ask:
"How many samples do I need to detect an effect of size X?"
Too few samples -> Can't detect real effects (waste of time)
Too many samples -> Waste of resources (ethical issue)
The Three Variables
Statistical Power
Here,
- =
- =
- =
Given any 3, solve for the 4th:
Effect size (Cohen's d)
Significance level (alpha, usually 0.05)
Sample size (n)
Power (usually 0.80)
πPower Analysis: Determining Sample Size
from statsmodels.stats.power import TTestIndPower
# Power analysis for independent t-test
power_analysis = TTestIndPower()
# How many samples needed to detect d=0.5 with 80% power?
n = power_analysis.solve_power(
effect_size=0.5, # Cohen's d
alpha=0.05,
power=0.80,
ratio=1.0 # Equal group sizes
)
print(f"Required n per group: {int(np.ceil(n))}")
# What power do we have with n=50 per group?
power = power_analysis.power(
effect_size=0.5,
alpha=0.05,
nobs1=50,
ratio=1.0
)
print(f"Power with n=50: {power:.2f}")
# What effect size can we detect with n=30?
es = power_analysis.solve_power(alpha=0.05, power=0.80, nobs1=30, ratio=1.0)
print(f"Minimum detectable effect size: {es:.3f}")
The sample sizeβpowerβeffect size triangle: These three quantities are locked in a tradeoff. If you want higher power (fewer missed effects), you need either a larger sample or a larger effect size. A power analysis before data collection prevents the two most common experimental design failures: (1) underpowered studies that waste resources by detecting nothing, and (2) overpowered studies that detect trivially small effects and waste resources unnecessarily.
Quick Reference: Which Test to Use
What are you comparing?
|
+-- One group vs known value?
| +-- Continuous, normal? -> One-sample t-test
| +-- Continuous, non-normal? -> Wilcoxon signed-rank
| +-- Categorical? -> Chi-square goodness of fit
|
+-- Two groups?
| +-- Independent?
| | +-- Continuous, normal? -> Independent t-test (Welch's)
| | +-- Continuous, non-normal? -> Mann-Whitney U
| | +-- Categorical? -> Chi-square test of independence
| +-- Paired?
| +-- Continuous, normal? -> Paired t-test
| +-- Continuous, non-normal? -> Wilcoxon signed-rank (paired)
|
+-- Three+ groups?
+-- One factor?
| +-- Continuous, normal? -> One-way ANOVA
| +-- Continuous, non-normal? -> Kruskal-Wallis
+-- Two+ factors?
+-- Continuous, normal? -> Two-way ANOVA
+-- Non-parametric? -> Friedman test
Key Takeaways
πSummary: Statistical Testing Deep Dive
- Always state H0 and H1 before testing. Hypothesis testing is a structured framework for decision-making under uncertainty, not a fishing expedition for significant results.
- p-value is NOT the probability H0 is true. It is the probability of seeing data this extreme if Hβ were true. This subtle distinction is the most common misconception in statistics.
- Statistical significance β Practical significance. Always report effect sizes (Cohen's d, CramΓ©r's V) alongside p-values. A tiny effect can be "significant" with a large enough sample.
- Use Welch's t-test by default (safer than Student's). It does not assume equal variances and performs nearly as well when variances are equal.
- Correct for multiple comparisons using Bonferroni (conservative, few tests) or FDR (liberal, many tests). Ignoring multiple testing inflates false positive rates to unacceptably high levels.
- Always check assumptions (normality, independence, homoscedasticity). Violated assumptions lead to incorrect p-values and invalid conclusions. Use non-parametric tests when assumptions fail.
- Use power analysis to determine sample size before collecting data. The minimum detectable effect size is a function of sample size and desired power β design your experiment accordingly.
- Non-parametric tests are alternatives when normality fails. They trade some power for robustness, making them ideal for small samples or skewed distributions.
Practice Exercises
-
Drug Trial: You measure blood pressure in 40 patients after a new drug. Historical mean is 120 mmHg. Your sample mean is 115 mmHg, std=12. Is the drug effective? (One-sample t-test)
-
A/B Test: Website A has conversion rate 12.3% (n=5000), Website B has 13.1% (n=5000). Is B significantly better? (Two-proportion z-test)
-
Survey Analysis: A survey of 200 people shows the relationship between education level (High School, Bachelor, Master, PhD) and preferred news source (TV, Online, Print). Is there an association? (Chi-square)
-
Experiment Design: You want to detect a medium effect (d=0.5) with 90% power. How many subjects do you need per group? (Power analysis)