CW

A/B Testing and Experimentation

Module 4: Statistics & ProbabilityFree Lesson

Advertisement

A/B Testing Fundamentals

A/B testing (also called split testing or randomized controlled experimentation) is the gold standard for making data-driven decisions. It involves randomly assigning users to two or more variants to measure the causal impact of a change.

Architecture Diagram
                      Traffic
                        |
            +----------+----------+
            |                     |
       Random Assignment     Random Assignment
            |                     |
            v                     v
     +------+------+       +------+------+
     |  Control   |       | Treatment   |
     |  (A)       |       |  (B)        |
     |  Original  |       |  New Version|
     +------+-----+       +------+------+
            |                     |
            v                     v
     +------+------+       +------+------+
     |  Metrics   |       |  Metrics    |
     |  Collect   |       |  Collect    |
     +------+-----+       +------+------+
            |                     |
            +----------+----------+
                       |
                       v
               Statistical Analysis
               (Is B significantly better?)

Key Concepts

TermDefinition
Control (A)The baseline/original version
Treatment (B)The modified version being tested
Unit of AnalysisUser, session, pageview, etc.
RandomizationRandom assignment to eliminate confounders
Sample SizeNumber of observations needed for reliable results
Statistical SignificanceProbability that the observed effect is real
Effect SizeMagnitude of the difference between variants

Statistical Foundations

Hypothesis Testing Framework

Null Hypothesis

H0:μA=μB(no difference)H_0: \mu_A = \mu_B \quad \text{(no difference)}

Here,

  • H0H_0=Null hypothesis
  • μA\mu_A=Mean of control group
  • μB\mu_B=Mean of treatment group

Z-Test for Proportions

Z-Test Statistic for Proportions

z=p^Bp^Ap^(1p^)(1nA+1nB)z = \frac{\hat{p}_B - \hat{p}_A}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)}}

Here,

  • zz=Test statistic
  • p^A\hat{p}_A=Sample proportion for control
  • p^B\hat{p}_B=Sample proportion for treatment
  • p^\hat{p}=Pooled proportion
  • nA,nBn_A, n_B=Sample sizes for each group

T-Test for Continuous Metrics

Welch's T-Test Statistic
t=XˉBXˉAsA2nA+sB2nBt = \frac{\bar{X}_B - \bar{X}_A}{\sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}}

Here,

  • tt=Test statistic
  • XˉA,XˉB\bar{X}_A, \bar{X}_B=Sample means
  • sA,sBs_A, s_B=Sample standard deviations
  • nA,nBn_A, n_B=Sample sizes

Why Welch's over Student's t-test

Welch's t-test does not assume equal variances between groups, making it more robust for real-world A/B tests where control and treatment groups may have different variance structures.

Sample Size Calculation

For Conversion Rate Tests

Sample Size for Proportions
n=(zα/2+zβ)2[pA(1pA)+pB(1pB)](pBpA)2n = \frac{(z_{\alpha/2} + z_{\beta})^2 \cdot \left[p_A(1-p_A) + p_B(1-p_B)\right]}{(p_B - p_A)^2}

Here,

  • nn=Sample size per group
  • zI^±/2z_{α/2}=Z-value for significance level (1.96 for α=0.05)
  • zI^2z_β=Z-value for power (0.84 for power = 80%)
  • pAp_A=Baseline conversion rate
  • pBp_B=Expected conversion rate

For Continuous Metrics

Sample Size for Continuous Metrics

n=2(zα/2+zβ)2σ2δ2n = \frac{2(z_{\alpha/2} + z_{\beta})^2 \cdot \sigma^2}{\delta^2}

Here,

  • nn=Sample size per group
  • I¨ƒÏƒ=Standard deviation of the metric
  • I^´Î´=Minimum detectable effect (MDE)

Power Analysis Visualization

Architecture Diagram
Power vs Sample Size
Power
1.0 |
    |                          ___________
    |                        /
0.8 |-----------------------/------------ Target Power
    |                    /
    |                  /
0.6 |               /
    |            /
    |         /
0.4 |      /
    |    /
    |  /
0.2 |/
    |
0.0 +----+----+----+----+----+----+----+----> Sample Size
    0   500  1000 1500 2000 2500 3000 3500

Key insight: Diminishing returns as n increases

Sample Size Calculator

import numpy as np
from scipy import stats

def sample_size_proportions(p_control, mde_relative, alpha=0.05, power=0.80):
    p_treatment = p_control * (1 + mde_relative)
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    n = (z_alpha + z_beta)**2 * (
        p_control * (1 - p_control) + p_treatment * (1 - p_treatment)
    ) / (p_treatment - p_control)**2
    return int(np.ceil(n))

def sample_size_continuous(std_dev, mde_absolute, alpha=0.05, power=0.80):
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    n = 2 * (z_alpha + z_beta)**2 * std_dev**2 / mde_absolute**2
    return int(np.ceil(n))

# Example: Conversion rate test
baseline_cr = 0.10  # 10% baseline conversion
lift = 0.10  # 10% relative lift (10% -> 11%)

n_per_group = sample_size_proportions(baseline_cr, lift)
print(f"Sample size per group: {n_per_group:,}")
print(f"Total sample needed: {n_per_group * 2:,}")
print(f"Test duration at 1000 visitors/day: {n_per_group * 2 / 1000:.1f} days")

Statistical Significance

P-Value Interpretation

P-ValueInterpretationAction
< 0.01Very strong evidence against Hâ‚€Ship with confidence
0.01 - 0.05Strong evidence against Hâ‚€Likely ship
0.05 - 0.10Weak evidenceConsider more data
> 0.10No significant evidenceDon't ship

Confidence Intervals

Confidence Interval for Difference in Proportions

CI=(p^Bp^A)±zα/2p^A(1p^A)nA+p^B(1p^B)nBCI = (\hat{p}_B - \hat{p}_A) \pm z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_B(1-\hat{p}_B)}{n_B}}

Here,

  • CICI=Confidence interval
  • zI^±/2z_{α/2}=Critical value for significance level

Confidence Intervals vs P-Values

Always report confidence intervals alongside p-values. A CI provides both statistical significance (does it exclude zero?) and practical significance (how large is the effect?).

Effect Size Measures

Cohen's h for Proportions

h=2arcsin(pB)2arcsin(pA)h = 2 \arcsin(\sqrt{p_B}) - 2 \arcsin(\sqrt{p_A})

Here,

  • hh=Effect size
  • pA,pBp_A, p_B=Proportions for each group

Cohen's d for Continuous Metrics

d=XˉBXˉAspooledd = \frac{\bar{X}_B - \bar{X}_A}{s_{pooled}}

Here,

  • dd=Effect size
  • spooleds_{pooled}=Pooled standard deviation
Effect SizeCohen's dCohen's h
Small0.20.05
Medium0.50.15
Large0.80.25

Complete A/B Test Implementation

import numpy as np
import pandas as pd
from scipy import stats
from dataclasses import dataclass

@dataclass
class ABTestResult:
    control_mean: float
    treatment_mean: float
    lift: float
    p_value: float
    confidence_interval: tuple
    significant: bool
    sample_size_control: int
    sample_size_treatment: int
    power: float

def run_ab_test(control_data, treatment_data, metric_type='continuous', alpha=0.05):
    n_c, n_t = len(control_data), len(treatment_data)
    
    if metric_type == 'proportion':
        p_c = np.mean(control_data)
        p_t = np.mean(treatment_data)
        p_pooled = (np.sum(control_data) + np.sum(treatment_data)) / (n_c + n_t)
        
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
        z_stat = (p_t - p_c) / se
        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
        
        se_diff = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
        z_crit = stats.norm.ppf(1 - alpha/2)
        ci = ((p_t - p_c) - z_crit*se_diff, (p_t - p_c) + z_crit*se_diff)
        
        lift = (p_t - p_c) / p_c if p_c > 0 else float('inf')
        mean_c, mean_t = p_c, p_t
    else:
        mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
        std_c, std_t = np.std(control_data, ddof=1), np.std(treatment_data, ddof=1)
        
        se = np.sqrt(std_c**2/n_c + std_t**2/n_t)
        t_stat = (mean_t - mean_c) / se
        df = (std_c**2/n_c + std_t**2/n_t)**2 / (
            (std_c**2/n_c)**2/(n_c-1) + (std_t**2/n_t)**2/(n_t-1))
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
        
        t_crit = stats.t.ppf(1 - alpha/2, df)
        ci = ((mean_t - mean_c) - t_crit*se, (mean_t - mean_c) + t_crit*se)
        lift = (mean_t - mean_c) / mean_c if mean_c > 0 else float('inf')
    
    effect_size = abs(mean_t - mean_c) / np.sqrt((np.var(control_data) + np.var(treatment_data)) / 2)
    power = 1 - stats.norm.cdf(
        stats.norm.ppf(1 - alpha/2) - effect_size * np.sqrt(n_c * n_t / (n_c + n_t)))
    
    return ABTestResult(
        control_mean=mean_c, treatment_mean=mean_t, lift=lift,
        p_value=p_value, confidence_interval=ci,
        significant=p_value < alpha,
        sample_size_control=n_c, sample_size_treatment=n_t, power=power
    )

# Example
np.random.seed(42)
control = np.random.binomial(1, 0.10, 10000)
treatment = np.random.binomial(1, 0.11, 10000)

result = run_ab_test(control, treatment, metric_type='proportion')
print(f"Control CR: {result.control_mean:.4f}")
print(f"Treatment CR: {result.treatment_mean:.4f}")
print(f"Lift: {result.lift:.2%}")
print(f"P-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
print(f"95% CI: ({result.confidence_interval[0]:.4f}, {result.confidence_interval[1]:.4f})")

Common Pitfalls

1. Peeking Problem (Optional Stopping)

Architecture Diagram
Problem: Checking results before reaching target sample size
                    *
           *       *  *
         *           *
       *               *
     *                   *
   *  <-- Stop early?    *
 *      "Looks good!"     *
+---------------------------> Time

Inflates false positive rate from 5% to 30%+

Peeking Problem

Checking results before reaching target sample size inflates the false positive rate from 5% to 30%+. Always use sequential testing or run to completion.

2. Multiple Comparisons

Family-wise Error Rate

P(at least one false positive)=1(1α)kP(\text{at least one false positive}) = 1 - (1-\alpha)^k

Here,

  • kk=Number of comparisons
  • I^±Î±=Significance level per test
Comparisons (k)Family-wise Error Rate
29.75%
522.6%
1040.1%
2064.2%

3. Novelty and Primacy Effects

Architecture Diagram
Effect Size
    ^
    |  *
    |   *  *  Novelty effect (new wears off)
    |     *  *  *
    |           * * * * * * * True effect
    |                   * * * * * * * *
    +-------------------------------------> Time
    Launch

Novelty Effect

New features get a temporary boost that fades over time. Run tests for at least 1-2 full business cycles (typically 2 weeks).

4. Simpson's Paradox

Architecture Diagram
Overall: A beats B (5% vs 4%)
         |                 |
    +----+----+       +----+----+
    | Mobile  |       | Desktop |
    | A: 2%   |       | A: 8%   |
    | B: 3%   |       | B: 7%   |
    | B wins! |       | A wins! |
    +---------+       +---------+

Confounding variable: Traffic source distribution

Simpson's Paradox

Always check results by key segments (device, traffic source, geography). Aggregate results can be misleading.

Multi-Armed Bandits

Epsilon-Greedy Strategy

Epsilon-Greedy Action Selection

Action={Randomwith probability ϵBest knownwith probability 1ϵ\text{Action} = \begin{cases} \text{Random} & \text{with probability } \epsilon \\ \text{Best known} & \text{with probability } 1-\epsilon \end{cases}

Here,

  • I^µÎµ=Exploration probability

Upper Confidence Bound (UCB)

Upper Confidence Bound

UCBi=μ^i+2lntni\text{UCB}_i = \hat{\mu}_i + \sqrt{\frac{2 \ln t}{n_i}}

Here,

  • tt=Total rounds
  • nin_i=Times arm i was pulled
  • I^¼^i\hat{μ}_i=Estimated mean reward for arm i

Thompson Sampling

Thompson Sampling Posterior
P(arm i is bestdata)Beta(αi+successes,βi+failures)P(\text{arm } i \text{ is best} \mid \text{data}) \propto \text{Beta}(\alpha_i + \text{successes}, \beta_i + \text{failures})

Here,

  • I^±iα_i=Prior successes for arm i
  • I^i2β_i=Prior failures for arm i

Multi-Armed Bandit Implementation

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
    
    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)
        return np.argmax(self.values)
    
    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] = self.values[arm] * (n-1)/n + reward/n

class ThompsonSampling:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)
        self.beta = np.ones(n_arms)
    
    def select_arm(self):
        samples = [np.random.beta(self.alpha[i], self.beta[i]) 
                   for i in range(self.n_arms)]
        return np.argmax(samples)
    
    def update(self, arm, reward):
        if reward == 1:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

# Simulation
def simulate_bandit(algorithm, true_probs, n_rounds=1000):
    rewards = []
    for _ in range(n_rounds):
        arm = algorithm.select_arm()
        reward = np.random.random() < true_probs[arm]
        algorithm.update(arm, reward)
        rewards.append(reward)
    return np.array(rewards)

np.random.seed(42)
true_probs = [0.10, 0.15, 0.20, 0.05]

for name, algo in [('Epsilon-Greedy', EpsilonGreedy(4, 0.1)),
                    ('Thompson Sampling', ThompsonSampling(4))]:
    rewards = simulate_bandit(algo, true_probs, n_rounds=10000)
    print(f"{name}: Total reward = {rewards.sum():.0f}, Avg = {rewards.mean():.4f}")

Experimentation Platform Architecture

Architecture Diagram
+-------------------+     +-------------------+     +-------------------+
|   Experiment      |     |   Traffic         |     |   Assignment      |
|   Configuration   | --> |   Allocation      | --> |   Service         |
|   (YAML/DB)       |     |   (% by variant)  |     |   (user -> variant)|
+-------------------+     +-------------------+     +-------------------+
           |                        |                        |
           v                        v                        v
+-------------------+     +-------------------+     +-------------------+
|   Hypothesis      |     |   Event           |     |   Statistical     |
|   Registry        |     |   Logging         |     |   Engine          |
+-------------------+     +-------------------+     +-------------------+
                                   |
                                   v
                          +-------------------+
                          |   Results         |
                          |   Dashboard       |
                          +-------------------+

Key Takeaways

Summary: A/B Testing

  1. Randomization is critical: Without proper randomization, results are unreliable
  2. Sample size matters: Calculate before running the test — underpowered tests waste time
  3. Don't peek: Checking results early inflates the false positive rate from 5% to 30%+
  4. Check segments: Aggregate results can be misleading (Simpson's paradox)
  5. Run long enough: Cover at least one full business cycle (typically 1-2 weeks)
  6. Multi-armed bandits are better for optimization; A/B tests are better for causal learning
  7. Pre-registration: Define hypotheses, metrics, and sample size before starting
  8. Report effect sizes and confidence intervals, not just p-values

Practice Exercises

  1. Calculate the required sample size for baseline conversion 5%, MDE 10% relative lift, α=0.05, power=80%
  2. Implement a Thompson Sampling algorithm and compare against epsilon-greedy
  3. Research and implement a sequential testing framework
  4. When would you choose a multi-armed bandit over a traditional A/B test?
  5. How do you handle experiments where the metric takes weeks to mature?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement