Statistical Testing: Hypothesis, t-tests, Chi-square

Module 1: FoundationsFree Lesson

Advertisement

Why This Matters

Statistical testing is how we make decisions under uncertainty. Instead of guessing whether a drug works, a feature helps, or a difference is real β€” we quantify the evidence and let data guide our conclusions.

DfStatistical Inference

The process of drawing conclusions about a population based on a sample. Since we rarely have access to entire populations, we use sample statistics (means, proportions, variances) to estimate population parameters. Statistical inference quantifies the uncertainty inherent in this process through confidence intervals and hypothesis tests.


The Hypothesis Testing Framework

The Scientific Approach to Data

DfNull Hypothesis (Hβ‚€)

A statement of "no effect" or "no difference" that serves as the default assumption in hypothesis testing. The null hypothesis is assumed true until the data provides sufficient evidence to reject it. Formally, Hβ‚€ specifies a particular value for the population parameter (e.g., ΞΌ = 0 or μ₁ βˆ’ ΞΌβ‚‚ = 0).

Architecture Diagram
Step 1: State hypotheses
        H0 (null): "There is no effect / no difference"
        H1 (alternative): "There IS an effect / difference"

Step 2: Collect data

Step 3: Calculate test statistic
        (How far is our data from what H0 predicts?)

Step 4: Calculate p-value
        (How likely is this data if H0 is true?)

Step 5: Make decision
        If p-value < alpha -> Reject H0
        If p-value >= alpha -> Fail to reject H0

Visual: Rejection Regions

Architecture Diagram
Distribution under H0 (null hypothesis is true):

                    H0 is true
                        |
        Rejection      |      Rejection
         Region        |       Region
    |:::::::::|--------+--------|:::::::::|
   -3       -2        0        2         3
              -1.96          1.96
                 ^              ^
                 |              |
              Critical      Critical
              Value         Value
              (alpha=0.05)

    If test statistic falls in shaded area -> Reject H0
    If test statistic falls in white area -> Fail to reject H0

Type I and Type II Errors

DfType I Error (False Positive)

Rejecting the null hypothesis when it is actually true. The probability of making a Type I error is denoted by Ξ± (alpha) and is set by the researcher before the test (typically 0.05). If Ξ± = 0.05, we accept a 5% chance of concluding an effect exists when it does not.

DfType II Error (False Negative)

Failing to reject the null hypothesis when it is actually false. The probability of making a Type II error is denoted by Ξ². Statistical power is defined as 1 βˆ’ Ξ² β€” the probability of correctly detecting a real effect. Lower Ξ² (higher power) requires larger sample sizes.

Architecture Diagram
                        Actual Reality
                    H0 True    |    H0 False
                  (No Effect)  |  (Effect Exists)
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 Decision:    β”‚                β”‚                β”‚
 Reject H0    β”‚  TYPE I ERROR  β”‚  CORRECT       β”‚
 (Say effect  β”‚  (False Positive)β”‚  (True Positive)β”‚
  exists)     β”‚  Alpha = 0.05  β”‚  Power = 1-betaβ”‚
              β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
 Fail to      β”‚  CORRECT       β”‚  TYPE II ERROR β”‚
 Reject H0    β”‚  (True Negative)β”‚  (False Negative)β”‚
 (Say no      β”‚                β”‚  Beta          β”‚
  effect)     β”‚                β”‚                β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tradeoff:
  Increasing alpha (0.05 -> 0.10):
    + More power to detect effects
    - More false positives
  
  Decreasing alpha (0.05 -> 0.01):
    + Fewer false positives
    - More missed effects (lower power)

The p-value: What It Actually Means

Dfp-value

The probability of observing a test statistic at least as extreme as the one computed from the sample, assuming the null hypothesis is true. It is NOT the probability that Hβ‚€ is true, and it is NOT the probability that the effect is real. A small p-value indicates that the observed data would be very unlikely under Hβ‚€, providing evidence against it.

Formal Definition of p-value

p-value=P(Tβ‰₯tobs∣H0)p\text{-value} = P(T \geq t_{\text{obs}} \mid H_0)

Here,

  • TT=
  • tobst_{\text{obs}}=
  • H0H_0=
Architecture Diagram
Common Misconceptions:
  WRONG: "p = 0.03 means there's a 3% chance H0 is true"
  RIGHT: "If H0 were true, there's a 3% chance of seeing data this extreme"

  WRONG: "p < 0.05 means the effect is large"
  RIGHT: "p < 0.05 means the effect is unlikely under H0"
  (A tiny effect can be significant with large n)

  WRONG: "p > 0.05 means no effect exists"
  RIGHT: "p > 0.05 means we don't have enough evidence to reject H0"

Statistical vs practical significance: A p-value measures how surprised you should be if Hβ‚€ were true. It does NOT measure the size or importance of an effect. With n = 1,000,000, a 0.01 cm height difference between groups can yield p < 0.001. Always report effect sizes (Cohen's d, CramΓ©r's V) alongside p-values to convey practical importance.


One-Sample t-test

When to Use

Compare a sample mean to a known population value.

Architecture Diagram
Example: Is the average height of students different from 170cm?

  H0: mu = 170cm (population mean)
  H1: mu != 170cm (two-tailed)

Formula

t=xΛ‰βˆ’ΞΌ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}
Architecture Diagram
Intuition:
  t = (Observed difference) / (Expected variability)
  
  Large t -> The difference is large relative to variability
  Small t -> The difference could easily be due to chance

Effect Size: Cohen's d

Cohen's d Effect Size

d=xΛ‰βˆ’ΞΌ0sd = \frac{\bar{x} - \mu_0}{s}

Here,

  • xΛ‰\bar{x}=
  • ΞΌ0\mu_0=
  • ss=
Architecture Diagram
|d| Interpretation:
  0.2  -> Small effect
  0.5  -> Medium effect
  0.8  -> Large effect

Why effect size matters:
  With n=10,000, even a 0.1cm difference is "significant"
  Effect size tells you if the difference PRACTICALLY matters

Complete Example

πŸ“One-Sample t-test: Testing Student Heights

import numpy as np
from scipy import stats

# Sample data: heights of 30 students
np.random.seed(42)
heights = np.random.normal(loc=172, scale=8, size=30)

print(f"Sample mean: {heights.mean():.2f} cm")
print(f"Sample std:  {heights.std(ddof=1):.2f} cm")
print(f"Sample size: {len(heights)}")

# One-sample t-test: Is mean different from 170cm?
mu_0 = 170
t_stat, p_value = stats.ttest_1samp(heights, mu_0)

print(f"\nH0: mu = {mu_0} cm")
print(f"H1: mu != {mu_0} cm")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

# Effect size
cohens_d = (heights.mean() - mu_0) / heights.std(ddof=1)
print(f"Cohen's d:   {cohens_d:.4f}")

# Decision
alpha = 0.05
if p_value < alpha:
    print(f"\nResult: Reject H0 (p < {alpha})")
    print("Conclusion: Student heights differ from 170cm")
else:
    print(f"\nResult: Fail to reject H0 (p >= {alpha})")
    print("Conclusion: No significant difference from 170cm")

Assumptions

Architecture Diagram
1. Independence: Observations are independent
2. Normality: Data is approximately normally distributed
   - Check: Shapiro-Wilk test, Q-Q plot
   - Robust: t-test is robust to mild non-normality for n > 30
3. Continuous: The dependent variable is continuous

The Central Limit Theorem saves you: Even if the underlying data is not normal, the sampling distribution of the mean approaches normality as n increases (CLT). For n > 30, the t-test is robust to moderate departures from normality. For smaller samples, check normality with a Shapiro-Wilk test or Q-Q plot before proceeding.


Two-Sample t-test

Independent Samples

Compare means of two independent groups.

Architecture Diagram
Example: Do men and women have different average heights?

  H0: mu_men = mu_women
  H1: mu_men != mu_women

Formula (Welch's t-test β€” unequal variances)

t=xΛ‰1βˆ’xΛ‰2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Degrees of freedom (Welch-Satterthwaite):

Welch-Satterthwaite Degrees of Freedom

df=(s12n1+s22n2)2(s12/n1)2n1βˆ’1+(s22/n2)2n2βˆ’1df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{(s_1^2/n_1)^2}{n_1 - 1} + \frac{(s_2^2/n_2)^2}{n_2 - 1}}

Here,

  • s1,s2s_1, s_2=
  • n1,n2n_1, n_2=
Architecture Diagram
Why Welch's over Student's t-test:
  - Student's t-test assumes equal variances (homoscedasticity)
  - Welch's works regardless of variance equality
  - Use Welch's by default (safer)

Complete Example

# Two independent groups
np.random.seed(42)
men_heights = np.random.normal(loc=175, scale=7, size=30)
women_heights = np.random.normal(loc=163, scale=6, size=30)

print(f"Men:   mean={men_heights.mean():.2f}, std={men_heights.std(ddof=1):.2f}")
print(f"Women: mean={women_heights.mean():.2f}, std={women_heights.std(ddof=1):.2f}")

# Independent t-test (Welch's by default)
t_stat, p_value = stats.ttest_ind(men_heights, women_heights, equal_var=False)

print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

# Effect size
pooled_std = np.sqrt((men_heights.std(ddof=1)**2 + women_heights.std(ddof=1)**2) / 2)
cohens_d = (men_heights.mean() - women_heights.mean()) / pooled_std
print(f"Cohen's d:   {cohens_d:.4f}")

if p_value < 0.05:
    print("Result: Significant difference in heights")

Paired Samples

Compare means when the same subjects are measured twice.

Architecture Diagram
Example: Does a training program improve test scores?

  H0: mean_diff = 0 (no improvement)
  H1: mean_diff > 0 (improvement)
# Paired samples: same students before/after training
np.random.seed(42)
before = np.random.normal(loc=65, scale=10, size=25)
after = before + np.random.normal(loc=5, scale=8, size=25)  # Training effect

print(f"Before: mean={before.mean():.2f}")
print(f"After:  mean={after.mean():.2f}")

# Paired t-test
t_stat, p_value = stats.ttest_rel(after, before)
# Alternative: one-tailed for improvement
t_stat_one, p_value_one = stats.ttest_rel(after, before, alternative='greater')

print(f"\nPaired t-test:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value (two-tailed): {p_value:.4f}")
print(f"p-value (one-tailed, improvement): {p_value_one:.4f}")

Chi-Square Test

Test of Independence

Determine if two categorical variables are related.

Architecture Diagram
Example: Is there an association between gender and preferred programming language?

  H0: Gender and language preference are independent
  H1: Gender and language preference are related

Formula

Ο‡2=βˆ‘(Oβˆ’E)2E\chi^2 = \sum \frac{(O - E)^2}{E}
Architecture Diagram
Expected frequency formula:
  E = (Row Total * Column Total) / Grand Total

Intuition:
  Large chi2 -> Observed differs greatly from Expected
  Small chi2 -> Observed is close to Expected

Complete Example

import pandas as pd

# Contingency table
data = pd.DataFrame({
    'Gender': ['Male', 'Male', 'Male', 'Male',
               'Female', 'Female', 'Female', 'Female'],
    'Language': ['Python', 'Java', 'Python', 'Python',
                 'Java', 'Python', 'Python', 'Java']
})

# Create contingency table
contingency = pd.crosstab(data['Gender'], data['Language'])
print("Contingency Table:")
print(contingency)

# Expected frequencies
print("\nExpected Frequencies (if independent):")
chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
print(pd.DataFrame(expected, index=contingency.index, columns=contingency.columns))

# Test results
print(f"\nChi-square statistic: {chi2:.4f}")
print(f"p-value:             {p_value:.4f}")
print(f"Degrees of freedom:  {dof}")

if p_value < 0.05:
    print("Result: Gender and language preference are related")
else:
    print("Result: No significant association found")

Effect Size: CramΓ©r's V

CramΓ©r's V Effect Size

V=Ο‡2nβ‹…(kβˆ’1)V = \sqrt{\frac{\chi^2}{n \cdot (k - 1)}}

Here,

  • Ο‡2\chi^2=
  • nn=
  • kk=
Architecture Diagram
|V| Interpretation:
  0.1  -> Small association
  0.3  -> Medium association
  0.5  -> Large association
# CramΓ©r's V
n = len(data)
k = min(contingency.shape)
cramers_v = np.sqrt(chi2 / (n * (k - 1)))
print(f"CramΓ©r's V: {cramers_v:.4f}")

ANOVA (Analysis of Variance)

Why Not Use Multiple t-tests?

DfMultiple Comparisons Problem

The phenomenon where conducting multiple hypothesis tests simultaneously inflates the family-wise error rate (FWER). With k tests at Ξ± = 0.05 each, the probability of at least one false positive is 1βˆ’(1βˆ’Ξ±)k1 - (1 - \alpha)^k. For 3 tests: 1βˆ’0.953=14.3%1 - 0.95^3 = 14.3\%. For 10 tests: 1βˆ’0.9545=90.1%1 - 0.95^{45} = 90.1\%. ANOVA solves this by testing all groups simultaneously with a single test.

Architecture Diagram
Comparing 3 groups: A, B, C

Multiple t-tests approach:
  A vs B -> test 1
  A vs C -> test 2
  B vs C -> test 3

  Problem: With alpha=0.05 per test:
  P(at least one false positive) = 1 - (1-0.05)^3 = 0.143
  With 10 groups (45 tests): P(at least one false) = 0.901!

ANOVA approach:
  Single test: Are ANY groups different?
  Then post-hoc: WHICH groups are different?

One-Way ANOVA

F=Between-GroupΒ VarianceWithin-GroupΒ Variance=MSbetweenMSwithinF = \frac{\text{Between-Group Variance}}{\text{Within-Group Variance}} = \frac{MS_{\text{between}}}{MS_{\text{within}}}
Architecture Diagram
Intuition:
  F = Signal / Noise

  Signal: How much do group means differ?
  Noise: How much do individuals vary within groups?

  Large F -> Groups differ more than individuals vary
  Small F -> Group differences could be chance

  F = 1 -> Between-group variation = Within-group variation
  F > 1 -> Between-group variation > Within-group variation

ANOVA assumptions matter more than you think: ANOVA assumes (1) independence, (2) normality within each group, and (3) homogeneity of variances (homoscedasticity). Violation of assumption 3 can inflate Type I error rates. Levene's test checks this assumption. If violated, use Welch's ANOVA (stats.f_oneway with unequal variances) or the non-parametric Kruskal-Wallis test.

# Three groups: three different teaching methods
np.random.seed(42)
method_a = np.random.normal(loc=75, scale=10, size=30)
method_b = np.random.normal(loc=80, scale=10, size=30)
method_c = np.random.normal(loc=72, scale=10, size=30)

print(f"Method A: mean={method_a.mean():.2f}")
print(f"Method B: mean={method_b.mean():.2f}")
print(f"Method C: mean={method_c.mean():.2f}")

# One-way ANOVA
f_stat, p_value = stats.f_oneway(method_a, method_b, method_c)

print(f"\nF-statistic: {f_stat:.4f}")
print(f"p-value:     {p_value:.4f}")

if p_value < 0.05:
    print("Result: At least one method differs significantly")

Post-Hoc: Tukey's HSD

DfTukey's Honestly Significant Difference

A post-hoc test used after a significant ANOVA to determine which specific group pairs differ. It controls the family-wise error rate across all pairwise comparisons by adjusting the critical value based on the studentized range distribution. The test computes adjusted p-values for each pair that account for the multiple comparisons.

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine data
all_scores = np.concatenate([method_a, method_b, method_c])
groups = ['A']*30 + ['B']*30 + ['C']*30

# Tukey's HSD
tukey = pairwise_tukeyhsd(all_scores, groups, alpha=0.05)
print(tukey)

Non-Parametric Tests

When normality assumption fails, use distribution-free alternatives.

ParametricNon-ParametricWhen to Use
One-sample tWilcoxon signed-rankSmall sample, non-normal
Independent tMann-Whitney UUnequal variances, ordinal data
Paired tWilcoxon signed-rank (paired)Paired, non-normal differences
One-way ANOVAKruskal-WallisNon-normal, 3+ groups
Pearson rSpearman rhoNon-linear monotonic relationship

DfNon-Parametric Test

A statistical test that does not assume a specific distribution (e.g., normal) for the data. Instead of testing parameters (means, variances), non-parametric tests typically operate on ranks or signs of the data. They are more robust to outliers and non-normality but have less statistical power when the parametric assumptions are met.

# Mann-Whitney U (non-parametric independent t-test)
stat, p_value = stats.mannwhitneyu(method_a, method_b, alternative='two-sided')
print(f"Mann-Whitney U: {stat:.4f}, p-value: {p_value:.4f}")

# Kruskal-Wallis (non-parametric ANOVA)
stat, p_value = stats.kruskal(method_a, method_b, method_c)
print(f"Kruskal-Wallis H: {stat:.4f}, p-value: {p_value:.4f}")

Multiple Comparisons Problem

The Problem

Architecture Diagram
With k groups and alpha = 0.05:
  Number of pairwise comparisons = k * (k-1) / 2

  k=3:  3 comparisons  -> P(false positive) = 14.3%
  k=5: 10 comparisons  -> P(false positive) = 40.1%
  k=10: 45 comparisons -> P(false positive) = 90.1%

Solutions

DfBonferroni Correction

A conservative method for controlling the family-wise error rate when performing multiple tests. Each p-value is multiplied by the number of tests: pcorrected=pΓ—mp_{\text{corrected}} = p \times m, where mm is the number of comparisons. This guarantees that the family-wise error rate ≀ Ξ±, but can be overly conservative (low power) when m is large.

DfFalse Discovery Rate (FDR)

The expected proportion of false positives among all rejected hypotheses. The Benjamini-Hochberg procedure controls FDR by ordering p-values and comparing each to imβ‹…Ξ±\frac{i}{m} \cdot \alpha, where ii is the rank and mm is the total number of tests. FDR control is less conservative than Bonferroni and provides more power to detect real effects.

from statsmodels.stats.multitest import multipletests

# Raw p-values from multiple tests
p_values = [0.01, 0.04, 0.03, 0.85, 0.12, 0.02, 0.06]

# Bonferroni correction (most conservative)
reject, pvals_corrected, _, _ = multipletests(p_values, method='bonferroni')
print("Bonferroni corrected p-values:", pvals_corrected)

# False Discovery Rate (FDR) - less conservative
reject_fdr, pvals_fdr, _, _ = multipletests(p_values, method='fdr_bh')
print("FDR corrected p-values:", pvals_fdr)
Architecture Diagram
Bonferroni: p_corrected = p * number_of_tests
  Pro: Controls family-wise error rate
  Con: Very conservative (misses real effects)

FDR (Benjamini-Hochberg): Controls false discovery rate
  Pro: More power to detect real effects
  Con: Allows more false positives

When to use which correction: Use Bonferroni when you have few comparisons (m < 10) and the cost of a false positive is high (e.g., clinical trials). Use FDR when you have many comparisons (m > 10) and you want to maximize discoveries (e.g., genomics, exploratory data analysis). FDR is the standard in high-dimensional testing because Bonferroni becomes impossibly conservative with thousands of tests.


Power Analysis

Why It Matters

Architecture Diagram
Before collecting data, ask:
  "How many samples do I need to detect an effect of size X?"

  Too few samples -> Can't detect real effects (waste of time)
  Too many samples -> Waste of resources (ethical issue)

The Three Variables

Statistical Power

Power=1βˆ’Ξ²=f(effectΒ size,Ξ±,n)\text{Power} = 1 - \beta = f(\text{effect size}, \alpha, n)

Here,

  • Ξ²\beta=
  • Ξ±\alpha=
  • nn=
Architecture Diagram
Given any 3, solve for the 4th:
  Effect size (Cohen's d)
  Significance level (alpha, usually 0.05)
  Sample size (n)
  Power (usually 0.80)

πŸ“Power Analysis: Determining Sample Size

from statsmodels.stats.power import TTestIndPower

# Power analysis for independent t-test
power_analysis = TTestIndPower()

# How many samples needed to detect d=0.5 with 80% power?
n = power_analysis.solve_power(
    effect_size=0.5,    # Cohen's d
    alpha=0.05,
    power=0.80,
    ratio=1.0           # Equal group sizes
)
print(f"Required n per group: {int(np.ceil(n))}")

# What power do we have with n=50 per group?
power = power_analysis.power(
    effect_size=0.5,
    alpha=0.05,
    nobs1=50,
    ratio=1.0
)
print(f"Power with n=50: {power:.2f}")

# What effect size can we detect with n=30?
es = power_analysis.solve_power(alpha=0.05, power=0.80, nobs1=30, ratio=1.0)
print(f"Minimum detectable effect size: {es:.3f}")

The sample size–power–effect size triangle: These three quantities are locked in a tradeoff. If you want higher power (fewer missed effects), you need either a larger sample or a larger effect size. A power analysis before data collection prevents the two most common experimental design failures: (1) underpowered studies that waste resources by detecting nothing, and (2) overpowered studies that detect trivially small effects and waste resources unnecessarily.


Quick Reference: Which Test to Use

Architecture Diagram
What are you comparing?
  |
  +-- One group vs known value?
  |     +-- Continuous, normal? -> One-sample t-test
  |     +-- Continuous, non-normal? -> Wilcoxon signed-rank
  |     +-- Categorical? -> Chi-square goodness of fit
  |
  +-- Two groups?
  |     +-- Independent?
  |     |     +-- Continuous, normal? -> Independent t-test (Welch's)
  |     |     +-- Continuous, non-normal? -> Mann-Whitney U
  |     |     +-- Categorical? -> Chi-square test of independence
  |     +-- Paired?
  |           +-- Continuous, normal? -> Paired t-test
  |           +-- Continuous, non-normal? -> Wilcoxon signed-rank (paired)
  |
  +-- Three+ groups?
        +-- One factor?
        |     +-- Continuous, normal? -> One-way ANOVA
        |     +-- Continuous, non-normal? -> Kruskal-Wallis
        +-- Two+ factors?
              +-- Continuous, normal? -> Two-way ANOVA
              +-- Non-parametric? -> Friedman test

Key Takeaways

πŸ“‹Summary: Statistical Testing Deep Dive

  1. Always state H0 and H1 before testing. Hypothesis testing is a structured framework for decision-making under uncertainty, not a fishing expedition for significant results.
  2. p-value is NOT the probability H0 is true. It is the probability of seeing data this extreme if Hβ‚€ were true. This subtle distinction is the most common misconception in statistics.
  3. Statistical significance β‰  Practical significance. Always report effect sizes (Cohen's d, CramΓ©r's V) alongside p-values. A tiny effect can be "significant" with a large enough sample.
  4. Use Welch's t-test by default (safer than Student's). It does not assume equal variances and performs nearly as well when variances are equal.
  5. Correct for multiple comparisons using Bonferroni (conservative, few tests) or FDR (liberal, many tests). Ignoring multiple testing inflates false positive rates to unacceptably high levels.
  6. Always check assumptions (normality, independence, homoscedasticity). Violated assumptions lead to incorrect p-values and invalid conclusions. Use non-parametric tests when assumptions fail.
  7. Use power analysis to determine sample size before collecting data. The minimum detectable effect size is a function of sample size and desired power β€” design your experiment accordingly.
  8. Non-parametric tests are alternatives when normality fails. They trade some power for robustness, making them ideal for small samples or skewed distributions.

Practice Exercises

  1. Drug Trial: You measure blood pressure in 40 patients after a new drug. Historical mean is 120 mmHg. Your sample mean is 115 mmHg, std=12. Is the drug effective? (One-sample t-test)

  2. A/B Test: Website A has conversion rate 12.3% (n=5000), Website B has 13.1% (n=5000). Is B significantly better? (Two-proportion z-test)

  3. Survey Analysis: A survey of 200 people shows the relationship between education level (High School, Bachelor, Master, PhD) and preferred news source (TV, Online, Print). Is there an association? (Chi-square)

  4. Experiment Design: You want to detect a medium effect (d=0.5) with 90% power. How many subjects do you need per group? (Power analysis)

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement