CW

Statistical Testing: Hypothesis, t-tests and Chi-square

Module 4: Statistics & ProbabilityFree Lesson

Advertisement

Why This Matters

Statistical testing is how we make decisions under uncertainty. Instead of guessing whether a drug works, a feature helps, or a difference is real — we quantify the evidence and let data guide our conclusions.

The Hypothesis Testing Framework

Architecture Diagram
Step 1: State hypotheses
        H0 (null): "There is no effect / no difference"
        H1 (alternative): "There IS an effect / difference"

Step 2: Collect data

Step 3: Calculate test statistic
        (How far is our data from what H0 predicts?)

Step 4: Calculate p-value
        (How likely is this data if H0 is true?)

Step 5: Make decision
        If p-value < alpha -> Reject H0
        If p-value >= alpha -> Fail to reject H0

Visual: Rejection Regions

Architecture Diagram
Distribution under H0 (null hypothesis is true):

                    H0 is true
                        |
        Rejection      |      Rejection
         Region        |       Region
    |:::::::::|--------+--------|:::::::::|
   -3       -2        0        2         3
              -1.96          1.96
                 ^              ^
                 |              |
              Critical      Critical
              Value         Value
              (alpha=0.05)

    If test statistic falls in shaded area -> Reject H0
    If test statistic falls in white area -> Fail to reject H0

Type I and Type II Errors

Architecture Diagram
                        Actual Reality
                    H0 True    |    H0 False
                  (No Effect)  |  (Effect Exists)
              +----------------+----------------+
 Decision:   |                |                |
 Reject H0   |  TYPE I ERROR  |  CORRECT       |
 (Say effect |  (False Positive)|  (True Positive)|
  exists)    |  Alpha = 0.05  |  Power = 1-beta|
              +----------------+----------------+
 Fail to     |  CORRECT       |  TYPE II ERROR |
 Reject H0   |  (True Negative)|  (False Negative)|
 (Say no     |                |  Beta          |
  effect)    |                |                |
              +----------------+----------------+

The p-value: What It Actually Means

Formal Definition of p-value

p-value=P(TtobsH0)p\text{-value} = P(T \geq t_{\text{obs}} \mid H_0)

Here,

  • TT=test statistic under the null distribution
  • tobst_{obs}=observed test statistic from sample
  • H0H_0=null hypothesis is true
Architecture Diagram
Common Misconceptions:
  WRONG: "p = 0.03 means there's a 3% chance H0 is true"
  RIGHT: "If H0 were true, there's a 3% chance of seeing data this extreme"

  WRONG: "p < 0.05 means the effect is large"
  RIGHT: "p < 0.05 means the effect is unlikely under H0"
  (A tiny effect can be significant with large n)

  WRONG: "p > 0.05 means no effect exists"
  RIGHT: "p > 0.05 means we don't have enough evidence to reject H0"

Statistical vs practical significance: A p-value measures how surprised you should be if H0 were true. It does NOT measure the size or importance of an effect. With n = 1,000,000, a 0.01 cm height difference can yield p < 0.001. Always report effect sizes (Cohen's d, Cramér's V) alongside p-values.

One-Sample t-test

One-Sample t-Statistic
t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

Here,

  • xˉ\bar{x}=sample mean
  • μ0\mu_0=hypothesized population mean
  • ss=sample standard deviation
  • nn=sample size
  • s/ns / \sqrt{n}=standard error of the mean

Effect Size: Cohen's d

Cohen's d Effect Size

d=xˉμ0sd = \frac{\bar{x} - \mu_0}{s}

Here,

  • xˉ\bar{x}=sample mean
  • μ0\mu_0=hypothesized population mean
  • ss=sample standard deviation
Architecture Diagram
|d| Interpretation:
  0.2  -> Small effect
  0.5  -> Medium effect
  0.8  -> Large effect

Complete Example

import numpy as np
from scipy import stats

np.random.seed(42)
heights = np.random.normal(loc=172, scale=8, size=30)

print(f"Sample mean: {heights.mean():.2f} cm")
print(f"Sample std:  {heights.std(ddof=1):.2f} cm")

mu_0 = 170
t_stat, p_value = stats.ttest_1samp(heights, mu_0)

print(f"\nH0: mu = {mu_0} cm")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

cohens_d = (heights.mean() - mu_0) / heights.std(ddof=1)
print(f"Cohen's d:   {cohens_d:.4f}")

alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Reject H0 (p < {alpha})")
else:
    print(f"\nResult: Fail to reject H0 (p >= {alpha})")

Assumptions

Architecture Diagram
1. Independence: Observations are independent
2. Normality: Data is approximately normally distributed
   - Check: Shapiro-Wilk test, Q-Q plot
   - Robust: t-test is robust to mild non-normality for n > 30
3. Continuous: The dependent variable is continuous

The Central Limit Theorem saves you: Even if the underlying data is not normal, the sampling distribution of the mean approaches normality as n increases (CLT). For n > 30, the t-test is robust to moderate departures from normality.

Two-Sample t-test

Welch's t-test (Unequal Variances)

Welch's t-Statistic
t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Here,

  • xˉ1,xˉ2\bar{x}_1, \bar{x}_2=sample means of groups 1 and 2
  • s1,s2s_1, s_2=sample standard deviations
  • n1,n2n_1, n_2=sample sizes

Welch-Satterthwaite Degrees of Freedom

df=(s12n1+s22n2)2(s12/n1)2n11+(s22/n2)2n21df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1 - 1} + \frac{(s_2^2/n_2)^2}{n_2 - 1}}

Here,

  • s1,s2s_1, s_2=sample standard deviations
  • n1,n2n_1, n_2=sample sizes
np.random.seed(42)
men_heights = np.random.normal(loc=175, scale=7, size=30)
women_heights = np.random.normal(loc=163, scale=6, size=30)

t_stat, p_value = stats.ttest_ind(men_heights, women_heights, equal_var=False)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

pooled_std = np.sqrt((men_heights.std(ddof=1)**2 + women_heights.std(ddof=1)**2) / 2)
cohens_d = (men_heights.mean() - women_heights.mean()) / pooled_std
print(f"Cohen's d:   {cohens_d:.4f}")

Paired Samples

np.random.seed(42)
before = np.random.normal(loc=65, scale=10, size=25)
after = before + np.random.normal(loc=5, scale=8, size=25)

t_stat, p_value = stats.ttest_rel(after, before)
t_stat_one, p_value_one = stats.ttest_rel(after, before, alternative='greater')

print(f"Paired t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_value:.4f}")
print(f"p-value (one-tailed): {p_value_one:.4f}")

Chi-Square Test

Chi-Square Test Statistic
χ2=(OE)2E\chi^2 = \sum \frac{(O - E)^2}{E}

Here,

  • OO=observed frequency (actual data)
  • EE=expected frequency (if variables were independent)
import pandas as pd

contingency = pd.DataFrame({
    'Python': [50, 35],
    'Java': [30, 45],
    'R': [20, 20]
}, index=['Male', 'Female'])

chi2, p_value, dof, expected = stats.chi2_contingency(contingency)

print("Contingency Table:")
print(contingency)
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value:             {p_value:.4f}")
print(f"Degrees of freedom:  {dof}")

Effect Size: Cramér's V

Cramér's V Effect Size

V=χ2n(k1)V = \sqrt{\frac{\chi^2}{n \cdot (k - 1)}}

Here,

  • I¨A^2χ²=chi-square statistic
  • nn=total sample size
  • kk=min(number of rows, number of columns)
Architecture Diagram
|V| Interpretation:
  0.1  -> Small association
  0.3  -> Medium association
  0.5  -> Large association

ANOVA (Analysis of Variance)

Why Not Use Multiple t-tests?

Architecture Diagram
Comparing 3 groups: A, B, C

Multiple t-tests approach:
  A vs B -> test 1
  A vs C -> test 2
  B vs C -> test 3

  Problem: With alpha=0.05 per test:
  P(at least one false positive) = 1 - (1-0.05)^3 = 0.143

ANOVA approach:
  Single test: Are ANY groups different?
  Then post-hoc: WHICH groups are different?

One-Way ANOVA

F-Statistic for One-Way ANOVA
F=Between-Group VarianceWithin-Group Variance=MSbetweenMSwithinF = \frac{\text{Between-Group Variance}}{\text{Within-Group Variance}} = \frac{MS_{\text{between}}}{MS_{\text{within}}}

Here,

  • MSbetweenMS_between=mean square between groups (signal)
  • MSwithinMS_within=mean square within groups (noise)
np.random.seed(42)
method_a = np.random.normal(loc=75, scale=10, size=30)
method_b = np.random.normal(loc=80, scale=10, size=30)
method_c = np.random.normal(loc=72, scale=10, size=30)

f_stat, p_value = stats.f_oneway(method_a, method_b, method_c)

print(f"F-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

if p_value < 0.05:
    print("Result: At least one method differs significantly")

Post-Hoc: Tukey's HSD

from statsmodels.stats.multicomp import pairwise_tukeyhsd

all_scores = np.concatenate([method_a, method_b, method_c])
groups = ['A']*30 + ['B']*30 + ['C']*30

tukey = pairwise_tukeyhsd(all_scores, groups, alpha=0.05)
print(tukey)

Non-Parametric Tests

ParametricNon-ParametricWhen to Use
One-sample tWilcoxon signed-rankSmall sample, non-normal
Independent tMann-Whitney UUnequal variances, ordinal data
Paired tWilcoxon signed-rank (paired)Paired, non-normal differences
One-way ANOVAKruskal-WallisNon-normal, 3+ groups
Pearson rSpearman rhoNon-linear monotonic relationship
# Mann-Whitney U (non-parametric independent t-test)
stat, p_value = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat:.4f}, p-value: {p_value:.4f}")

# Kruskal-Wallis (non-parametric ANOVA)
stat, p_value = stats.kruskal(method_a, method_b, method_c)
print(f"Kruskal-Wallis H: {stat:.4f}, p-value: {p_value:.4f}")

Multiple Comparisons Problem

Solutions

from statsmodels.stats.multitest import multipletests

p_values = [0.01, 0.04, 0.03, 0.85, 0.12, 0.02, 0.06]

# Bonferroni correction
reject, pvals_corrected, _, _ = multipletests(p_values, method='bonferroni')
print("Bonferroni corrected:", pvals_corrected)

# FDR (Benjamini-Hochberg)
reject_fdr, pvals_fdr, _, _ = multipletests(p_values, method='fdr_bh')
print("FDR corrected:", pvals_fdr)

When to use which correction: Use Bonferroni when you have few comparisons (m < 10) and the cost of a false positive is high. Use FDR when you have many comparisons (m > 10) and you want to maximize discoveries (e.g., genomics, exploratory data analysis).

Power Analysis

Statistical Power

Power=1β=f(effect size,α,n)\text{Power} = 1 - \beta = f(\text{effect size}, \alpha, n)

Here,

  • I^2β=probability of Type II error
  • I^±Î±=significance level (typically 0.05)
  • nn=sample size per group
from statsmodels.stats.power import TTestIndPower

power_analysis = TTestIndPower()

# How many samples needed to detect d=0.5 with 80% power?
n = power_analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.80, ratio=1.0)
print(f"Required n per group: {int(np.ceil(n))}")

# What power do we have with n=50 per group?
power = power_analysis.power(effect_size=0.5, alpha=0.05, nobs1=50, ratio=1.0)
print(f"Power with n=50: {power:.2f}")

Quick Reference: Which Test to Use

Architecture Diagram
What are you comparing?
  |
  +-- One group vs known value?
  |     +-- Continuous, normal? -> One-sample t-test
  |     +-- Continuous, non-normal? -> Wilcoxon signed-rank
  |     +-- Categorical? -> Chi-square goodness of fit
  |
  +-- Two groups?
  |     +-- Independent?
  |     |     +-- Continuous, normal? -> Independent t-test (Welch's)
  |     |     +-- Continuous, non-normal? -> Mann-Whitney U
  |     |     +-- Categorical? -> Chi-square test of independence
  |     +-- Paired?
  |           +-- Continuous, normal? -> Paired t-test
  |           +-- Continuous, non-normal? -> Wilcoxon signed-rank (paired)
  |
  +-- Three+ groups?
        +-- One factor?
        |     +-- Continuous, normal? -> One-way ANOVA
        |     +-- Continuous, non-normal? -> Kruskal-Wallis
        +-- Two+ factors?
              +-- Continuous, normal? -> Two-way ANOVA
              +-- Non-parametric? -> Friedman test

Key Takeaways

Summary: Statistical Testing Deep Dive

  1. Always state H0 and H1 before testing. Hypothesis testing is a structured framework, not a fishing expedition.
  2. p-value is NOT the probability H0 is true. It is the probability of seeing data this extreme if H0 were true.
  3. Statistical significance ≠ practical significance. Always report effect sizes (Cohen's d, Cramér's V) alongside p-values.
  4. Use Welch's t-test by default (safer than Student's). It does not assume equal variances.
  5. Correct for multiple comparisons using Bonferroni (conservative) or FDR (liberal).
  6. Always check assumptions (normality, independence, homoscedasticity). Use non-parametric tests when assumptions fail.
  7. Use power analysis to determine sample size before collecting data.

Practice Exercises

  1. Drug Trial: Blood pressure in 40 patients after a new drug. Historical mean 120 mmHg. Sample mean 115, std=12. Is the drug effective?
  2. A/B Test: Website A conversion 12.3% (n=5000), Website B 13.1% (n=5000). Is B significantly better?
  3. Survey Analysis: Association between education level and preferred news source (n=200). Chi-square test?
  4. Experiment Design: Detect medium effect (d=0.5) with 90% power. How many subjects per group?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement