A/B Testing Fundamentals
A/B testing (also called split testing or randomized controlled experimentation) is the gold standard for making data-driven decisions. It involves randomly assigning users to two or more variants to measure the causal impact of a change.
Traffic
|
+----------+----------+
| |
Random Assignment Random Assignment
| |
v v
+------+------+ +------+------+
| Control | | Treatment |
| (A) | | (B) |
| Original | | New Version|
+------+-----+ +------+------+
| |
v v
+------+------+ +------+------+
| Metrics | | Metrics |
| Collect | | Collect |
+------+-----+ +------+------+
| |
+----------+----------+
|
v
Statistical Analysis
(Is B significantly better?)
Key Concepts
| Term | Definition |
|---|---|
| Control (A) | The baseline/original version |
| Treatment (B) | The modified version being tested |
| Unit of Analysis | User, session, pageview, etc. |
| Randomization | Random assignment to eliminate confounders |
| Sample Size | Number of observations needed for reliable results |
| Statistical Significance | Probability that the observed effect is real |
| Effect Size | Magnitude of the difference between variants |
Statistical Foundations
Hypothesis Testing Framework
Null Hypothesis
Here,
- =Null hypothesis
- =Mean of control group
- =Mean of treatment group
Z-Test for Proportions
Z-Test Statistic for Proportions
Here,
- =Test statistic
- =Sample proportion for control
- =Sample proportion for treatment
- =Pooled proportion
- =Sample sizes for each group
T-Test for Continuous Metrics
Here,
- =Test statistic
- =Sample means
- =Sample standard deviations
- =Sample sizes
Why Welch's over Student's t-test
Welch's t-test does not assume equal variances between groups, making it more robust for real-world A/B tests where control and treatment groups may have different variance structures.
Sample Size Calculation
For Conversion Rate Tests
Here,
- =Sample size per group
- =Z-value for significance level (1.96 for α=0.05)
- =Z-value for power (0.84 for power = 80%)
- =Baseline conversion rate
- =Expected conversion rate
For Continuous Metrics
Sample Size for Continuous Metrics
Here,
- =Sample size per group
- =Standard deviation of the metric
- =Minimum detectable effect (MDE)
Power Analysis Visualization
Power vs Sample Size
Power
1.0 |
| ___________
| /
0.8 |-----------------------/------------ Target Power
| /
| /
0.6 | /
| /
| /
0.4 | /
| /
| /
0.2 |/
|
0.0 +----+----+----+----+----+----+----+----> Sample Size
0 500 1000 1500 2000 2500 3000 3500
Key insight: Diminishing returns as n increases
Sample Size Calculator
import numpy as np
from scipy import stats
def sample_size_proportions(p_control, mde_relative, alpha=0.05, power=0.80):
p_treatment = p_control * (1 + mde_relative)
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
n = (z_alpha + z_beta)**2 * (
p_control * (1 - p_control) + p_treatment * (1 - p_treatment)
) / (p_treatment - p_control)**2
return int(np.ceil(n))
def sample_size_continuous(std_dev, mde_absolute, alpha=0.05, power=0.80):
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
n = 2 * (z_alpha + z_beta)**2 * std_dev**2 / mde_absolute**2
return int(np.ceil(n))
# Example: Conversion rate test
baseline_cr = 0.10 # 10% baseline conversion
lift = 0.10 # 10% relative lift (10% -> 11%)
n_per_group = sample_size_proportions(baseline_cr, lift)
print(f"Sample size per group: {n_per_group:,}")
print(f"Total sample needed: {n_per_group * 2:,}")
print(f"Test duration at 1000 visitors/day: {n_per_group * 2 / 1000:.1f} days")
Statistical Significance
P-Value Interpretation
| P-Value | Interpretation | Action |
|---|---|---|
| < 0.01 | Very strong evidence against Hâ‚€ | Ship with confidence |
| 0.01 - 0.05 | Strong evidence against Hâ‚€ | Likely ship |
| 0.05 - 0.10 | Weak evidence | Consider more data |
| > 0.10 | No significant evidence | Don't ship |
Confidence Intervals
Confidence Interval for Difference in Proportions
Here,
- =Confidence interval
- =Critical value for significance level
Confidence Intervals vs P-Values
Always report confidence intervals alongside p-values. A CI provides both statistical significance (does it exclude zero?) and practical significance (how large is the effect?).
Effect Size Measures
Cohen's h for Proportions
Here,
- =Effect size
- =Proportions for each group
Cohen's d for Continuous Metrics
Here,
- =Effect size
- =Pooled standard deviation
| Effect Size | Cohen's d | Cohen's h |
|---|---|---|
| Small | 0.2 | 0.05 |
| Medium | 0.5 | 0.15 |
| Large | 0.8 | 0.25 |
Complete A/B Test Implementation
import numpy as np
import pandas as pd
from scipy import stats
from dataclasses import dataclass
@dataclass
class ABTestResult:
control_mean: float
treatment_mean: float
lift: float
p_value: float
confidence_interval: tuple
significant: bool
sample_size_control: int
sample_size_treatment: int
power: float
def run_ab_test(control_data, treatment_data, metric_type='continuous', alpha=0.05):
n_c, n_t = len(control_data), len(treatment_data)
if metric_type == 'proportion':
p_c = np.mean(control_data)
p_t = np.mean(treatment_data)
p_pooled = (np.sum(control_data) + np.sum(treatment_data)) / (n_c + n_t)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
z_stat = (p_t - p_c) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
se_diff = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
z_crit = stats.norm.ppf(1 - alpha/2)
ci = ((p_t - p_c) - z_crit*se_diff, (p_t - p_c) + z_crit*se_diff)
lift = (p_t - p_c) / p_c if p_c > 0 else float('inf')
mean_c, mean_t = p_c, p_t
else:
mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
std_c, std_t = np.std(control_data, ddof=1), np.std(treatment_data, ddof=1)
se = np.sqrt(std_c**2/n_c + std_t**2/n_t)
t_stat = (mean_t - mean_c) / se
df = (std_c**2/n_c + std_t**2/n_t)**2 / (
(std_c**2/n_c)**2/(n_c-1) + (std_t**2/n_t)**2/(n_t-1))
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
t_crit = stats.t.ppf(1 - alpha/2, df)
ci = ((mean_t - mean_c) - t_crit*se, (mean_t - mean_c) + t_crit*se)
lift = (mean_t - mean_c) / mean_c if mean_c > 0 else float('inf')
effect_size = abs(mean_t - mean_c) / np.sqrt((np.var(control_data) + np.var(treatment_data)) / 2)
power = 1 - stats.norm.cdf(
stats.norm.ppf(1 - alpha/2) - effect_size * np.sqrt(n_c * n_t / (n_c + n_t)))
return ABTestResult(
control_mean=mean_c, treatment_mean=mean_t, lift=lift,
p_value=p_value, confidence_interval=ci,
significant=p_value < alpha,
sample_size_control=n_c, sample_size_treatment=n_t, power=power
)
# Example
np.random.seed(42)
control = np.random.binomial(1, 0.10, 10000)
treatment = np.random.binomial(1, 0.11, 10000)
result = run_ab_test(control, treatment, metric_type='proportion')
print(f"Control CR: {result.control_mean:.4f}")
print(f"Treatment CR: {result.treatment_mean:.4f}")
print(f"Lift: {result.lift:.2%}")
print(f"P-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
print(f"95% CI: ({result.confidence_interval[0]:.4f}, {result.confidence_interval[1]:.4f})")
Common Pitfalls
1. Peeking Problem (Optional Stopping)
Problem: Checking results before reaching target sample size
*
* * *
* *
* *
* *
* <-- Stop early? *
* "Looks good!" *
+---------------------------> Time
Inflates false positive rate from 5% to 30%+
Peeking Problem
Checking results before reaching target sample size inflates the false positive rate from 5% to 30%+. Always use sequential testing or run to completion.
2. Multiple Comparisons
Family-wise Error Rate
Here,
- =Number of comparisons
- =Significance level per test
| Comparisons (k) | Family-wise Error Rate |
|---|---|
| 2 | 9.75% |
| 5 | 22.6% |
| 10 | 40.1% |
| 20 | 64.2% |
3. Novelty and Primacy Effects
Effect Size
^
| *
| * * Novelty effect (new wears off)
| * * *
| * * * * * * * True effect
| * * * * * * * *
+-------------------------------------> Time
Launch
Novelty Effect
New features get a temporary boost that fades over time. Run tests for at least 1-2 full business cycles (typically 2 weeks).
4. Simpson's Paradox
Overall: A beats B (5% vs 4%)
| |
+----+----+ +----+----+
| Mobile | | Desktop |
| A: 2% | | A: 8% |
| B: 3% | | B: 7% |
| B wins! | | A wins! |
+---------+ +---------+
Confounding variable: Traffic source distribution
Simpson's Paradox
Always check results by key segments (device, traffic source, geography). Aggregate results can be misleading.
Multi-Armed Bandits
Epsilon-Greedy Strategy
Epsilon-Greedy Action Selection
Here,
- =Exploration probability
Upper Confidence Bound (UCB)
Upper Confidence Bound
Here,
- =Total rounds
- =Times arm i was pulled
- =Estimated mean reward for arm i
Thompson Sampling
Here,
- =Prior successes for arm i
- =Prior failures for arm i
Multi-Armed Bandit Implementation
class EpsilonGreedy:
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
def select_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms)
return np.argmax(self.values)
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] = self.values[arm] * (n-1)/n + reward/n
class ThompsonSampling:
def __init__(self, n_arms):
self.n_arms = n_arms
self.alpha = np.ones(n_arms)
self.beta = np.ones(n_arms)
def select_arm(self):
samples = [np.random.beta(self.alpha[i], self.beta[i])
for i in range(self.n_arms)]
return np.argmax(samples)
def update(self, arm, reward):
if reward == 1:
self.alpha[arm] += 1
else:
self.beta[arm] += 1
# Simulation
def simulate_bandit(algorithm, true_probs, n_rounds=1000):
rewards = []
for _ in range(n_rounds):
arm = algorithm.select_arm()
reward = np.random.random() < true_probs[arm]
algorithm.update(arm, reward)
rewards.append(reward)
return np.array(rewards)
np.random.seed(42)
true_probs = [0.10, 0.15, 0.20, 0.05]
for name, algo in [('Epsilon-Greedy', EpsilonGreedy(4, 0.1)),
('Thompson Sampling', ThompsonSampling(4))]:
rewards = simulate_bandit(algo, true_probs, n_rounds=10000)
print(f"{name}: Total reward = {rewards.sum():.0f}, Avg = {rewards.mean():.4f}")
Experimentation Platform Architecture
+-------------------+ +-------------------+ +-------------------+
| Experiment | | Traffic | | Assignment |
| Configuration | --> | Allocation | --> | Service |
| (YAML/DB) | | (% by variant) | | (user -> variant)|
+-------------------+ +-------------------+ +-------------------+
| | |
v v v
+-------------------+ +-------------------+ +-------------------+
| Hypothesis | | Event | | Statistical |
| Registry | | Logging | | Engine |
+-------------------+ +-------------------+ +-------------------+
|
v
+-------------------+
| Results |
| Dashboard |
+-------------------+
Key Takeaways
Summary: A/B Testing
- Randomization is critical: Without proper randomization, results are unreliable
- Sample size matters: Calculate before running the test — underpowered tests waste time
- Don't peek: Checking results early inflates the false positive rate from 5% to 30%+
- Check segments: Aggregate results can be misleading (Simpson's paradox)
- Run long enough: Cover at least one full business cycle (typically 1-2 weeks)
- Multi-armed bandits are better for optimization; A/B tests are better for causal learning
- Pre-registration: Define hypotheses, metrics, and sample size before starting
- Report effect sizes and confidence intervals, not just p-values
Practice Exercises
- Calculate the required sample size for baseline conversion 5%, MDE 10% relative lift, α=0.05, power=80%
- Implement a Thompson Sampling algorithm and compare against epsilon-greedy
- Research and implement a sequential testing framework
- When would you choose a multi-armed bandit over a traditional A/B test?
- How do you handle experiments where the metric takes weeks to mature?