A/B Testing and Experimentation

A/B Testing Fundamentals

A/B testing (also called split testing or randomized controlled experimentation) is the gold standard for making data-driven decisions. It involves randomly assigning users to two or more variants to measure the causal impact of a change.

Key Concepts

Term	Definition
Control (A)	The baseline/original version
Treatment (B)	The modified version being tested
Unit of Analysis	User, session, pageview, etc.
Randomization	Random assignment to eliminate confounders
Sample Size	Number of observations needed for reliable results
Statistical Significance	Probability that the observed effect is real
Effect Size	Magnitude of the difference between variants

Statistical Foundations

Hypothesis Testing Framework

Z-Test for Proportions

T-Test for Continuous Metrics

Sample Size Calculation

For Conversion Rate Tests

For Continuous Metrics

Power Analysis Visualization

Power vs Sample Size 0.0 0.2 0.4 0.6 0.8 1.0 Power 0 500 1000 1500 2000 2500 3000 3500 Sample Size → Target 0.8 n ≈ 2000 Diminishing Returns Beyond n=2000, each additional sample adds less power gain Power curve Target (0.8) Min sample size

Sample Size Calculator

import numpy as np
from scipy import stats

def sample_size_proportions(p_control, mde_relative, alpha=0.05, power=0.80):
    p_treatment = p_control * (1 + mde_relative)
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    n = (z_alpha + z_beta)**2 * (
        p_control * (1 - p_control) + p_treatment * (1 - p_treatment)
    ) / (p_treatment - p_control)**2
    return int(np.ceil(n))

def sample_size_continuous(std_dev, mde_absolute, alpha=0.05, power=0.80):
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    n = 2 * (z_alpha + z_beta)**2 * std_dev**2 / mde_absolute**2
    return int(np.ceil(n))

# Example: Conversion rate test
baseline_cr = 0.10  # 10% baseline conversion
lift = 0.10  # 10% relative lift (10% -> 11%)

n_per_group = sample_size_proportions(baseline_cr, lift)
print(f"Sample size per group: {n_per_group:,}")
print(f"Total sample needed: {n_per_group * 2:,}")
print(f"Test duration at 1000 visitors/day: {n_per_group * 2 / 1000:.1f} days")

Statistical Significance

P-Value Interpretation

P-Value	Interpretation	Action
< 0.01	Very strong evidence against H₀	Ship with confidence
0.01 - 0.05	Strong evidence against H₀	Likely ship
0.05 - 0.10	Weak evidence	Consider more data
> 0.10	No significant evidence	Don't ship

Confidence Intervals

Effect Size Measures

Effect Size	Cohen's d	Cohen's h
Small	0.2	0.05
Medium	0.5	0.15
Large	0.8	0.25

Complete A/B Test Implementation

import numpy as np
import pandas as pd
from scipy import stats
from dataclasses import dataclass

@dataclass
class ABTestResult:
    control_mean: float
    treatment_mean: float
    lift: float
    p_value: float
    confidence_interval: tuple
    significant: bool
    sample_size_control: int
    sample_size_treatment: int
    power: float

def run_ab_test(control_data, treatment_data, metric_type='continuous', alpha=0.05):
    n_c, n_t = len(control_data), len(treatment_data)
    
    if metric_type == 'proportion':
        p_c = np.mean(control_data)
        p_t = np.mean(treatment_data)
        p_pooled = (np.sum(control_data) + np.sum(treatment_data)) / (n_c + n_t)
        
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
        z_stat = (p_t - p_c) / se
        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
        
        se_diff = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
        z_crit = stats.norm.ppf(1 - alpha/2)
        ci = ((p_t - p_c) - z_crit*se_diff, (p_t - p_c) + z_crit*se_diff)
        
        lift = (p_t - p_c) / p_c if p_c > 0 else float('inf')
        mean_c, mean_t = p_c, p_t
    else:
        mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
        std_c, std_t = np.std(control_data, ddof=1), np.std(treatment_data, ddof=1)
        
        se = np.sqrt(std_c**2/n_c + std_t**2/n_t)
        t_stat = (mean_t - mean_c) / se
        df = (std_c**2/n_c + std_t**2/n_t)**2 / (
            (std_c**2/n_c)**2/(n_c-1) + (std_t**2/n_t)**2/(n_t-1))
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
        
        t_crit = stats.t.ppf(1 - alpha/2, df)
        ci = ((mean_t - mean_c) - t_crit*se, (mean_t - mean_c) + t_crit*se)
        lift = (mean_t - mean_c) / mean_c if mean_c > 0 else float('inf')
    
    effect_size = abs(mean_t - mean_c) / np.sqrt((np.var(control_data) + np.var(treatment_data)) / 2)
    power = 1 - stats.norm.cdf(
        stats.norm.ppf(1 - alpha/2) - effect_size * np.sqrt(n_c * n_t / (n_c + n_t)))
    
    return ABTestResult(
        control_mean=mean_c, treatment_mean=mean_t, lift=lift,
        p_value=p_value, confidence_interval=ci,
        significant=p_value < alpha,
        sample_size_control=n_c, sample_size_treatment=n_t, power=power
    )

# Example
np.random.seed(42)
control = np.random.binomial(1, 0.10, 10000)
treatment = np.random.binomial(1, 0.11, 10000)

result = run_ab_test(control, treatment, metric_type='proportion')
print(f"Control CR: {result.control_mean:.4f}")
print(f"Treatment CR: {result.treatment_mean:.4f}")
print(f"Lift: {result.lift:.2%}")
print(f"P-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
print(f"95% CI: ({result.confidence_interval[0]:.4f}, {result.confidence_interval[1]:.4f})")

Common Pitfalls

1. Peeking Problem (Optional Stopping)

Peeking Problem (Optional Stopping) Sample Size → Test Statistic α = 0.05 Stop early? "Looks good!" But is it? True path ⚠ Inflates false positive rate from 5% to 30%+

2. Multiple Comparisons

Comparisons (k)	Family-wise Error Rate
2	9.75%
5	22.6%
10	40.1%
20	64.2%

3. Novelty and Primacy Effects

Novelty and Primacy Effects Time → Effect Size Launch Novelty Effect New wears off quickly Inflated initial lift True Effect Stable long-term impact Peak ⚠ Early winners may be novelty effects — wait for stable signal

4. Simpson's Paradox

Multi-Armed Bandits

Epsilon-Greedy Strategy

Upper Confidence Bound (UCB)

Thompson Sampling

Multi-Armed Bandit Implementation

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
    
    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)
        return np.argmax(self.values)
    
    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] = self.values[arm] * (n-1)/n + reward/n

class ThompsonSampling:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)
        self.beta = np.ones(n_arms)
    
    def select_arm(self):
        samples = [np.random.beta(self.alpha[i], self.beta[i]) 
                   for i in range(self.n_arms)]
        return np.argmax(samples)
    
    def update(self, arm, reward):
        if reward == 1:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

# Simulation
def simulate_bandit(algorithm, true_probs, n_rounds=1000):
    rewards = []
    for _ in range(n_rounds):
        arm = algorithm.select_arm()
        reward = np.random.random() < true_probs[arm]
        algorithm.update(arm, reward)
        rewards.append(reward)
    return np.array(rewards)

np.random.seed(42)
true_probs = [0.10, 0.15, 0.20, 0.05]

for name, algo in [('Epsilon-Greedy', EpsilonGreedy(4, 0.1)),
                    ('Thompson Sampling', ThompsonSampling(4))]:
    rewards = simulate_bandit(algo, true_probs, n_rounds=10000)
    print(f"{name}: Total reward = {rewards.sum():.0f}, Avg = {rewards.mean():.4f}")

Experimentation Platform Architecture

Key Takeaways

Practice Exercises

Calculate the required sample size for baseline conversion 5%, MDE 10% relative lift, α=0.05, power=80%
Implement a Thompson Sampling algorithm and compare against epsilon-greedy
Research and implement a sequential testing framework
When would you choose a multi-armed bandit over a traditional A/B test?
How do you handle experiments where the metric takes weeks to mature?