Hypothesis Testing

Principles of Hypothesis Testing

Hypothesis testing provides the statistical framework for making decisions about populations based on sample data. This formal approach specifies competing claims, evaluates evidence from data, and draws conclusions with known error probabilities. Understanding hypothesis testing principles enables appropriate application and correct interpretation across diverse analytical contexts.

The development of hypothesis testing emerged from statistical theory in the early twentieth century, with major contributions from Ronald Fisher, Jerzy Neyman, and Egon Pearson. Their work established the conceptual framework and mathematical foundations that remain central to statistical practice. While debates continue about appropriate interpretation and use, hypothesis testing remains fundamental to evidence-based decision making.

Hypothesis testing serves as the primary inferential framework for evaluating research claims across scientific disciplines. From clinical trials testing treatment efficacy to A/B tests evaluating product changes, formal hypothesis testing provides the structure for objective evaluation. Data scientists routinely apply these methods to make data-driven business decisions.

Formulating Hypotheses

Hypothesis formulation requires translating research questions into statistical terms with precisely specified null and alternative hypotheses. This translation determines what conclusions the test can support and shapes the entire analytical approach.

Null and Alternative Hypotheses

The null hypothesis (H₀) represents the default or status quo claim. It typically states no effect, no difference, or no association. The alternative hypothesis (H₁) represents the research claim of interest. It states that an effect exists, a difference occurs, or an association holds.

The null hypothesis is the one subjected to formal testing. Evidence must be sufficiently strong to reject H₀ before we conclude that H₁ is supported. This framework protects against falsely claiming effects when none exist, at the cost of sometimes missing real effects.

Hypotheses can specify direction (one-tailed tests) or allow any departure from H₀ (two-tailed tests). One-tailed tests are appropriate when theory or previous research predicts specific direction. Two-tailed tests are appropriate when direction is unspecified or when findings in either direction would be interesting.

Parameter Hypotheses

Hypotheses concern population parameters, not sample statistics. Common parameter hypotheses include population means (μ), proportions (p), differences (μ₁ - μ₂), and correlations (ρ). The parameter of interest must be clearly specified.

Hypotheses can be stated as equalities, inequalities, or composite statements. Simple hypotheses specify exact parameter values. Composite hypotheses specify parameter ranges. Most tests compare parameter values to null values (typically zero for effects).

The practical hypothesis often differs from the statistical hypothesis. Practical significance considers real-world importance beyond statistical significance. A statistically significant effect might be too small to matter practically, or a practically important effect might require large samples to achieve statistical significance.

Test Statistics and Sampling Distributions

Test statistics summarize sample evidence about hypotheses. Their distributions under the null hypothesis determine critical values and compute p-values. Understanding test statistics and their distributions is essential for hypothesis testing.

Test Statistic Construction

Test statistics measure discrepancy between sample data and null hypothesis expectations. Larger discrepancies provide stronger evidence against H₀. The specific form depends on the parameter tested and distribution assumptions.

Common test statistics include z-statistics following standard normal distributions, t-statistics following t-distributions, chi-square statistics following chi-square distributions, and F-statistics following F-distributions. Each applies to specific testing situations.

The general form compares observed values to expected values under H₀, scaled by estimates of variability. This comparison reveals whether observed patterns are unusually large relative to sampling variability.

Sampling Distributions Under Null

The null distribution defines the probability distribution of test statistics when H₀ is true. This distribution determines what values are likely under H₀ and what values would be unusual.

Critical values cut off specified tail areas of null distributions. These values determine rejection regions. Test statistics exceeding critical values lead to rejecting H₀. Critical value approach compares test statistics to cutoffs rather than computing p-values.

P-values provide alternative evidence measures. The p-value equals the probability of observing a test statistic as extreme or more extreme than actually observed, assuming H₀ is true. Smaller p-values indicate stronger evidence against H₀.

Type I and Type II Errors

Hypothesis testing can result in two types of errors: rejecting a true null hypothesis (Type I) or failing to reject a false null hypothesis (Type II). Understanding these errors guides test design and result interpretation.

Error Types and Implications

Type I error (false positive) occurs when we reject H₀ when it is actually true. This error concludes an effect exists when none does. The significance level (α) controls Type I error rate across repeated testing.

Type II error (false negative) occurs when we fail to reject H₀ when it is actually false. This error misses real effects. The power (1-β) measures the probability of correctly rejecting a false null. The Type II error rate equals β.

These errors involve a fundamental tradeoff: reducing one typically increases the other for fixed sample size. More stringent significance levels (smaller α) increase Type II error rates. Larger sample sizes reduce both error rates.

Error Rate Control

The significance level (α) sets the Type I error rate controlled by the test. Common choices include α = 0.05 and α = 0.01. Lower α provides stronger protection against false positives but requires stronger evidence to reject H₀.

Power analysis determines sample sizes needed to detect effects of practical importance at desired power levels. Larger effects are easier to detect (higher power for same sample size). Higher power requires larger samples.

Balancing Type I and Type II errors depends on consequences of each error type. In medical screening, missing disease (Type II) might be worse than false alarms (Type I). In confirmatory trials, false claims of efficacy (Type I) might be worse than missing effective treatments.

One-Sample Tests

One-sample tests evaluate claims about single population parameters. These tests compare sample statistics to hypothesized population values.

One-Sample Mean Test (Z-test)

The Z-test for a population mean applies when population standard deviation is known or sample size is large. The test statistic equals (x̄ - μ₀) / (σ/√n), following a standard normal distribution under H₀.

Assumptions include random sampling (or as-if random for other designs), known or large-sample estimated population standard deviation, and approximately normal population or large sample size. With small samples from non-normal populations, alternative methods are needed.

The test evaluates whether the sample mean significantly differs from the hypothesized mean. Significant results indicate the population mean likely differs from the hypothesized value.

One-Sample Mean Test (t-test)

The t-test for a population mean applies when population standard deviation is unknown and estimated from data. The test statistic equals (x̄ - μ₀) / (s/√n), following a t-distribution with n-1 degrees of freedom.

Assumptions include random sampling, unknown population standard deviation estimated from data, and approximately normal population distribution. The t-test is fairly robust to normality violations with moderate sample sizes, but severe non-normality requires nonparametric alternatives.

The t-test is more appropriate than the Z-test in most practical situations where population standard deviation is unknown.

One-Sample Proportion Test

Tests for population proportions use the binomial distribution or its normal approximation. The test statistic equals (p̂ - p₀) / √(p₀(1-p₀)/n), following an approximately standard normal distribution for large samples.

Sample size requirements ensure adequate expected counts under H₀. The rule of thumb requires np₀ ≥ 5 and n(1-p₀) ≥ 5 for normal approximation adequacy.

The proportion test applies to binary outcomes such as yes/no responses, success/failure counts, and categorical outcomes with two categories.

Two-Sample Tests

Two-sample tests compare parameters across two populations. Common applications include treatment versus control comparisons and group difference evaluations.

Independent Samples t-Test

The independent samples t-test compares means from two independent groups. The test statistic compares the difference in sample means to the standard error of the difference.

Two versions exist: equal variance (pooled) t-test and unequal variance (Welch) t-test. The equal variance version assumes equal population variances. The Welch version does not assume equal variances and is generally preferred.

The test evaluates whether the difference between means is statistically significant. Significant results indicate the populations likely have different means.

Paired Samples t-Test

The paired samples t-test compares means from related samples, such as before/after measurements on the same subjects. The analysis focuses on within-pair differences.

The test statistic equals the mean difference divided by the standard error of the mean difference, following a t-distribution with n-1 degrees of freedom where n is the number of pairs.

Paired designs often provide more powerful comparisons than independent samples designs because they control for individual differences. They require matching or repeated measurements.

Two-Sample Proportion Test

Tests for differences in proportions between independent groups use the two-proportion z-test. The test statistic compares the difference in sample proportions to its standard error.

The test evaluates whether proportions significantly differ between groups. Large samples enable normal approximation; exact methods apply for small samples.

Analysis of Variance (ANOVA)

ANOVA extends hypothesis testing to compare means across three or more groups. This approach tests overall group differences while controlling overall Type I error rate.

One-Way ANOVA

One-way ANOVA partitions total variance into between-group and within-group components. The F-statistic compares between-group variability to within-group variability. Significant F indicates group means differ.

The null hypothesis states all group means equal. Rejection indicates at least one group mean differs, but does not identify which groups differ. Post-hoc comparisons identify specific group differences.

ANOVA assumptions include independent observations, normally distributed groups, and equal group variances. Diagnostic checks verify these assumptions before interpreting results.

Two-Way ANOVA

Two-way ANOVA examines effects of two categorical factors simultaneously. Main effects test each factor separately. Interaction effects test whether the effect of one factor depends on the level of the other.

Designs can be factorial (all combinations present) or unbalanced (unequal cell sizes). Analysis extends to incorporate covariates and blocking factors.

Two-way ANOVA provides more efficient designs than separate one-way ANOVAs when both factors are of interest. It also enables testing interactions that single-factor designs miss.

Repeated Measures ANOVA

Repeated measures ANOVA analyzes designs where the same subjects are measured under multiple conditions. This approach accounts for within-subject correlation and typically provides more powerful comparisons than between-subject designs.

Assumptions include sphericity (equal variances of differences across conditions). Violations reduce validity. Corrections (Greenhouse-Geisser, Huynh-Feldt) adjust degrees of freedom for sphericity violations.

Chi-Square Tests

Chi-square tests evaluate categorical data, testing associations and goodness-of-fit to specified distributions.

Test of Independence

The test of independence evaluates whether two categorical variables are associated. It compares observed cell counts to expected counts under independence. Significant chi-square indicates association.

Expected counts under independence equal row total times column total divided by overall total. Chi-square sums squared standardized deviations across cells.

The test applies to contingency tables of any dimension. Expected counts should be at least 5 in most cells for valid approximation.

Goodness-of-Fit Test

The goodness-of-fit test evaluates whether data follow a specified distribution. Observed frequencies in categories are compared to expected frequencies under the hypothesized distribution.

Applications include testing Hardy-Weinberg equilibrium in genetics, testing market basket distributions, and testing distributional assumptions in other contexts.

Nonparametric Tests

Nonparametric tests make fewer distributional assumptions than parametric tests. They are appropriate when assumptions are violated or when data are ordinal or ranked.

Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is the nonparametric alternative to the one-sample t-test. It tests whether the median of differences equals zero using ranks of absolute differences.

The test requires symmetric distribution of differences but not normality. It provides valid inference when parametric assumptions fail and can be nearly as powerful when assumptions hold.

Mann-Whitney U Test

The Mann-Whitney U test is the nonparametric alternative to the independent samples t-test. It tests whether one distribution is stochastically greater than another using rank sums.

The test compares ranks across groups rather than means. It is appropriate when data are ordinal or when normality assumptions are seriously violated.

Kruskal-Wallis Test

The Kruskal-Wallis test is the nonparametric alternative to one-way ANOVA. It tests whether distributions differ across groups using rank-based methods.

The test extends the Mann-Whitney logic to multiple groups. It is appropriate when ANOVA assumptions are violated or when data are ordinal.

Key Takeaways

Hypothesis testing formalizes evidence evaluation for population parameters
Type I (false positive) and Type II (false negative) errors involve fundamental tradeoff
Various tests address different data structures and hypotheses
ANOVA extends comparison to three or more groups
Chi-square tests evaluate categorical data associations
Nonparametric tests provide alternatives when parametric assumptions fail