Statistical Testing: Hypothesis, t-tests and Chi-square

Why This Matters

Statistical testing is how we make decisions under uncertainty. Instead of guessing whether a drug works, a feature helps, or a difference is real — we quantify the evidence and let data guide our conclusions.

The Hypothesis Testing Framework

Visual: Rejection Regions

Type I and Type II Errors

The p-value: What It Actually Means

❌ WRONG

"p = 0.03 means there's a 3% chance H₀ is true"

✅ RIGHT

"If H₀ were true, there's a 3% chance of seeing data this extreme"

❌ WRONG

"p < 0.05 means the effect is large"

✅ RIGHT

"p < 0.05 means the effect is unlikely under H₀" (A tiny effect can be significant with large n)

❌ WRONG

"p > 0.05 means no effect exists"

✅ RIGHT

"p > 0.05 means we don't have enough evidence to reject H₀"

One-Sample t-test

Effect Size: Cohen's d

| |d| Value | Interpretation | |-----------|---------------| | 0.2 | Small effect | | 0.5 | Medium effect | | 0.8 | Large effect |

Complete Example

import numpy as np
from scipy import stats

np.random.seed(42)
heights = np.random.normal(loc=172, scale=8, size=30)

print(f"Sample mean: {heights.mean():.2f} cm")
print(f"Sample std:  {heights.std(ddof=1):.2f} cm")

mu_0 = 170
t_stat, p_value = stats.ttest_1samp(heights, mu_0)

print(f"\nH0: mu = {mu_0} cm")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

cohens_d = (heights.mean() - mu_0) / heights.std(ddof=1)
print(f"Cohen's d:   {cohens_d:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Reject H0 (p < {alpha})")
else:
    print(f"\nResult: Fail to reject H0 (p >= {alpha})")

Assumptions

Assumption	Description	How to Check
Independence	Observations are independent	Study design, random sampling
Normality	Data is approximately normal	Shapiro-Wilk test, Q-Q plot
Continuous	Dependent variable is continuous	Data type inspection

Note: The t-test is robust to mild non-normality for n > 30 due to the Central Limit Theorem.

Two-Sample t-test

Welch's t-test (Unequal Variances)

np.random.seed(42)
men_heights = np.random.normal(loc=175, scale=7, size=30)
women_heights = np.random.normal(loc=163, scale=6, size=30)

t_stat, p_value = stats.ttest_ind(men_heights, women_heights, equal_var=False)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

pooled_std = np.sqrt((men_heights.std(ddof=1)**2 + women_heights.std(ddof=1)**2) / 2)
cohens_d = (men_heights.mean() - women_heights.mean()) / pooled_std
print(f"Cohen's d:   {cohens_d:.4f}")

Paired Samples

np.random.seed(42)
before = np.random.normal(loc=65, scale=10, size=25)
after = before + np.random.normal(loc=5, scale=8, size=25)

t_stat, p_value = stats.ttest_rel(after, before)
t_stat_one, p_value_one = stats.ttest_rel(after, before, alternative='greater')

print(f"Paired t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_value:.4f}")
print(f"p-value (one-tailed): {p_value_one:.4f}")

Chi-Square Test

import pandas as pd

contingency = pd.DataFrame({
    'Python': [50, 35],
    'Java': [30, 45],
    'R': [20, 20]
}, index=['Male', 'Female'])

chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

print("Contingency Table:")
print(contingency)
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value:             {p_value:.4f}")
print(f"Degrees of freedom:  {dof}")

Effect Size: Cramér's V

| |V| Value | Interpretation | |----------|---------------| | 0.1 | Small association | | 0.3 | Medium association | | 0.5 | Large association |

ANOVA (Analysis of Variance)

Why Not Use Multiple t-tests?

Why ANOVA Instead of Multiple t-tests? Multiple t-tests A vs B → test 1 A vs C → test 2 B vs C → test 3 P(≥1 false+) = 14.3% ANOVA Single test: Are ANY groups different? Then post-hoc: WHICH? FWER = α = 0.05

One-Way ANOVA

np.random.seed(42)
method_a = np.random.normal(loc=75, scale=10, size=30)
method_b = np.random.normal(loc=80, scale=10, size=30)
method_c = np.random.normal(loc=72, scale=10, size=30)

f_stat, p_value = stats.f_oneway(method_a, method_b, method_c)

print(f"F-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

if p_value < 0.05:
    print("Result: At least one method differs significantly")

Post-Hoc: Tukey's HSD

from statsmodels.stats.multicomp import pairwise_tukeyhsd

all_scores = np.concatenate([method_a, method_b, method_c])
groups = ['A']*30 + ['B']*30 + ['C']*30

tukey = pairwise_tukeyhsd(all_scores, groups, alpha=0.05)
print(tukey)

Non-Parametric Tests

Parametric	Non-Parametric	When to Use
One-sample t	Wilcoxon signed-rank	Small sample, non-normal
Independent t	Mann-Whitney U	Unequal variances, ordinal data
Paired t	Wilcoxon signed-rank (paired)	Paired, non-normal differences
One-way ANOVA	Kruskal-Wallis	Non-normal, 3+ groups
Pearson r	Spearman rho	Non-linear monotonic relationship

# Mann-Whitney U (non-parametric independent t-test)
stat, p_value = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat:.4f}, p-value: {p_value:.4f}")

# Kruskal-Wallis (non-parametric ANOVA)
stat, p_value = stats.kruskal(method_a, method_b, method_c)
print(f"Kruskal-Wallis H: {stat:.4f}, p-value: {p_value:.4f}")

Multiple Comparisons Problem

Solutions

from statsmodels.stats.multitest import multipletests

p_values = [0.01, 0.04, 0.03, 0.85, 0.12, 0.02, 0.06]

# Bonferroni correction
reject, pvals_corrected, _, _ = multipletests(p_values, method='bonferroni')
print("Bonferroni corrected:", pvals_corrected)

# FDR (Benjamini-Hochberg)
reject_fdr, pvals_fdr, _, _ = multipletests(p_values, method='fdr_bh')
print("FDR corrected:", pvals_fdr)

Power Analysis

from statsmodels.stats.power import TTestIndPower

power_analysis = TTestIndPower()

# How many samples needed to detect d=0.5 with 80% power?
n = power_analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.80, ratio=1.0)
print(f"Required n per group: {int(np.ceil(n))}")

# What power do we have with n=50 per group?
power = power_analysis.power(effect_size=0.5, alpha=0.05, nobs1=50, ratio=1.0)
print(f"Power with n=50: {power:.2f}")

Quick Reference: Which Test to Use

Which Statistical Test Should I Use? What are you comparing? 1 group vs known value 2 groups 3+ groups Normal? Non-normal? One-sample t Wilcoxon signed-rank Indep.? Paired? Normal? Non-normal? Welch's t-test Mann-Whitney U Normal? Non-normal? Paired t-test Wilcoxon paired 1 factor? 2+ factors? Normal? Non-normal? One-way ANOVA Kruskal-Wallis Normal? No Two-way ANOVA Friedman Quick Reference • Categorical data? → Chi-square test • Correlation? → Pearson (normal) or Spearman (non-linear) • Always check: Independence, Normality, Homoscedasticity

Key Takeaways

Practice Exercises

Drug Trial: Blood pressure in 40 patients after a new drug. Historical mean 120 mmHg. Sample mean 115, std=12. Is the drug effective?
A/B Test: Website A conversion 12.3% (n=5000), Website B 13.1% (n=5000). Is B significantly better?
Survey Analysis: Association between education level and preferred news source (n=200). Chi-square test?
Experiment Design: Detect medium effect (d=0.5) with 90% power. How many subjects per group?

Statistical Testing: Hypothesis, t-tests and Chi-square

Why This Matters

The Hypothesis Testing Framework

Visual: Rejection Regions

Type I and Type II Errors

The p-value: What It Actually Means

One-Sample t-test

Effect Size: Cohen's d

Complete Example

Assumptions

Two-Sample t-test

Welch's t-test (Unequal Variances)

Paired Samples

Chi-Square Test

Effect Size: Cramér's V

ANOVA (Analysis of Variance)

Why Not Use Multiple t-tests?

One-Way ANOVA

Post-Hoc: Tukey's HSD

Non-Parametric Tests

Multiple Comparisons Problem

Solutions

Power Analysis

Quick Reference: Which Test to Use

Key Takeaways

Practice Exercises

Need Expert Data Science Help?