A/B Testing + Experimentation
Overview
A/B testing (also called split testing or randomized controlled experimentation) is the gold standard for making data-driven decisions in product development, marketing, and business strategy. It involves randomly assigning users to two or more variants to measure the causal impact of a change. Rooted in Fisher's framework of hypothesis testing, A/B tests provide the most credible evidence for causal claims outside of randomized controlled trials in medicine.
A/B Testing Fundamentals
What is an A/B Test?
DfA/B Test
An A/B test is a randomized controlled experiment that compares two versions (control and treatment) to determine which performs better, using statistical inference to determine whether observed differences are due to the treatment or random chance.
Traffic
|
+----------+----------+
| |
Random Assignment Random Assignment
| |
v v
+------+------+ +------+------+
| Control | | Treatment |
| (A) | | (B) |
| Original | | New Version|
+------+-----+ +------+------+
| |
v v
+------+------+ +------+------+
| Metrics | | Metrics |
| Collect | | Collect |
+------+-----+ +------+------+
| |
+----------+----------+
|
v
Statistical Analysis
(Is B significantly better?)
ℹ️ Credibility Hierarchy
A/B tests sit near the top of the evidence credibility hierarchy. Unlike observational studies, proper randomization ensures that both observed and unobserved confounders are balanced across groups, allowing causal conclusions.
Key Concepts
| Term | Definition |
|---|---|
| Control (A) | The baseline/original version |
| Treatment (B) | The modified version being tested |
| Unit of Analysis | User, session, pageview, etc. |
| Randomization | Random assignment to eliminate confounders |
| Sample Size | Number of observations needed for reliable results |
| Statistical Significance | Probability that the observed effect is real |
| Effect Size | Magnitude of the difference between variants |
Hypothesis Testing Framework
Null Hypothesis
Here,
- =Null hypothesis
- =Mean of control group
- =Mean of treatment group
Alternative Hypothesis (Two-sided)
Here,
- =Alternative hypothesis
Or one-sided:
Alternative Hypothesis (One-sided)
Here,
- =Alternative hypothesis
Types of A/B Tests
A/B Testing Types
|
+--+--> Conversion Rate Tests (click-through rate, sign-ups)
|
+--+--> Revenue Tests (average order value, LTV)
|
+--+--> Engagement Tests (time on page, bounce rate)
|
+--+--> Funnel Tests (multi-step conversions)
|
+--+--> Multivariate Tests (multiple changes simultaneously)
Statistical Foundations
Z-Test for Proportions
For conversion rate tests, the test statistic is:
Z-Test Statistic for Proportions
Here,
- =Test statistic
- =Sample proportion for control
- =Sample proportion for treatment
- =Pooled proportion
- =Sample sizes for each group
where the pooled proportion is:
Pooled Proportion
Here,
- =Number of successes in each group
T-Test for Continuous Metrics
For continuous metrics (revenue, time on page):
with degrees of freedom (Welch's approximation):
Welch-Satterthwaite Degrees of Freedom
Here,
- =Degrees of freedom
ℹ️ Why Welch's over Student's t-test
Welch's t-test does not assume equal variances between groups, making it more robust for real-world A/B tests where control and treatment groups may have different variance structures.
Sample Size Calculation
For Conversion Rate Tests
The required sample size per group is:
For Continuous Metrics
Sample Size for Continuous Metrics
Here,
- =Sample size per group
- =Standard deviation of the metric
- =Minimum detectable effect (MDE)
Minimum Detectable Effect (MDE)
Minimum Detectable Effect
Here,
- =Minimum detectable effect
Power Analysis Visualization
Power vs Sample Size
Power
1.0 |
| ___________
| /
0.8 |-----------------------/------------ Target Power
| /
| /
0.6 | /
| /
| /
0.4 | /
| /
| /
0.2 |/
|
0.0 +----+----+----+----+----+----+----+----> Sample Size
0 500 1000 1500 2000 2500 3000 3500
Key insight: Diminishing returns as n increases
Sample Size Calculator
import numpy as np
from scipy import stats
def sample_size_proportions(p_control, mde_relative, alpha=0.05, power=0.80):
"""
Calculate sample size for A/B test with conversion rates.
Args:
p_control: Baseline conversion rate
mde_relative: Relative minimum detectable effect (e.g., 0.10 for 10% lift)
alpha: Significance level
power: Statistical power
Returns:
Sample size per group
"""
p_treatment = p_control * (1 + mde_relative)
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
p_pooled = (p_control + p_treatment) / 2
n = (z_alpha + z_beta)**2 * (
p_control * (1 - p_control) + p_treatment * (1 - p_treatment)
) / (p_treatment - p_control)**2
return int(np.ceil(n))
def sample_size_continuous(std_dev, mde_absolute, alpha=0.05, power=0.80):
"""
Calculate sample size for A/B test with continuous metrics.
Args:
std_dev: Standard deviation of the metric
mde_absolute: Minimum detectable effect (absolute)
alpha: Significance level
power: Statistical power
Returns:
Sample size per group
"""
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
n = 2 * (z_alpha + z_beta)**2 * std_dev**2 / mde_absolute**2
return int(np.ceil(n))
# Example: Conversion rate test
baseline_cr = 0.10 # 10% baseline conversion
lift = 0.10 # Want to detect 10% relative lift (10% -> 11%)
n_per_group = sample_size_proportions(baseline_cr, lift)
print(f"Sample size per group: {n_per_group:,}")
print(f"Total sample needed: {n_per_group * 2:,}")
print(f"Test duration at 1000 visitors/day: {n_per_group * 2 / 1000:.1f} days")
# Example: Revenue test
revenue_std = 50.0 # $50 std dev in revenue
min_effect = 2.0 # Want to detect $2 increase
n_per_group_rev = sample_size_continuous(revenue_std, min_effect)
print(f"\nRevenue test - Sample per group: {n_per_group_rev:,}")
Statistical Significance
P-Value
DfP-Value
The p-value is the probability of observing a result as extreme as (or more extreme than) the data, assuming the null hypothesis is true. It quantifies the evidence against the null hypothesis.
P-Value Definition
Here,
- =P-value
Interpretation
| P-Value | Interpretation | Action |
|---|---|---|
| < 0.01 | Very strong evidence against | Ship with confidence |
| 0.01 - 0.05 | Strong evidence against | Likely ship |
| 0.05 - 0.10 | Weak evidence | Consider more data |
| > 0.10 | No significant evidence | Don't ship |
Confidence Intervals
For the difference in proportions:
Confidence Interval for Difference in Proportions
Here,
- =Confidence interval
- =Critical value for significance level
💡 Confidence Intervals vs P-Values
Always report confidence intervals alongside p-values. A CI provides both statistical significance (does it exclude zero?) and practical significance (how large is the effect?). A narrow CI centered at zero is strong evidence of no effect.
Effect Size Measures
Cohen's h for proportions:
Cohen's h for Proportions
Here,
- =Effect size
- =Proportions for each group
Cohen's d for continuous metrics:
Cohen's d for Continuous Metrics
Here,
- =Effect size
- =Pooled standard deviation
| Effect Size | Cohen's d | Cohen's h |
|---|---|---|
| Small | 0.2 | 0.05 |
| Medium | 0.5 | 0.15 |
| Large | 0.8 | 0.25 |
Complete A/B Test Implementation
import numpy as np
import pandas as pd
from scipy import stats
from dataclasses import dataclass
from typing import Optional
@dataclass
class ABTestResult:
"""Results of an A/B test."""
control_mean: float
treatment_mean: float
lift: float
p_value: float
confidence_interval: tuple
significant: bool
sample_size_control: int
sample_size_treatment: int
power: float
def run_ab_test(
control_data: np.ndarray,
treatment_data: np.ndarray,
metric_type: str = 'continuous',
alpha: float = 0.05,
power_threshold: float = 0.80
) -> ABTestResult:
"""
Run a comprehensive A/B test.
Args:
control_data: Metrics for control group
treatment_data: Metrics for treatment group
metric_type: 'continuous' or 'proportion'
alpha: Significance level
power_threshold: Minimum power required
Returns:
ABTestResult with all statistics
"""
n_c, n_t = len(control_data), len(treatment_data)
if metric_type == 'proportion':
# Z-test for proportions
p_c = np.mean(control_data)
p_t = np.mean(treatment_data)
p_pooled = (np.sum(control_data) + np.sum(treatment_data)) / (n_c + n_t)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
z_stat = (p_t - p_c) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
# Confidence interval
se_diff = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
z_crit = stats.norm.ppf(1 - alpha/2)
ci = ((p_t - p_c) - z_crit*se_diff, (p_t - p_c) + z_crit*se_diff)
lift = (p_t - p_c) / p_c if p_c > 0 else float('inf')
mean_c, mean_t = p_c, p_t
else:
# Welch's t-test for continuous metrics
mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
std_c, std_t = np.std(control_data, ddof=1), np.std(treatment_data, ddof=1)
se = np.sqrt(std_c**2/n_c + std_t**2/n_t)
t_stat = (mean_t - mean_c) / se
# Welch-Satterthwaite degrees of freedom
df = (std_c**2/n_c + std_t**2/n_t)**2 / (
(std_c**2/n_c)**2/(n_c-1) + (std_t**2/n_t)**2/(n_t-1)
)
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
# Confidence interval
t_crit = stats.t.ppf(1 - alpha/2, df)
ci = ((mean_t - mean_c) - t_crit*se, (mean_t - mean_c) + t_crit*se)
lift = (mean_t - mean_c) / mean_c if mean_c > 0 else float('inf')
# Calculate observed power
effect_size = abs(mean_t - mean_c) / np.sqrt((np.var(control_data) + np.var(treatment_data)) / 2)
power = 1 - stats.norm.cdf(
stats.norm.ppf(1 - alpha/2) - effect_size * np.sqrt(n_c * n_t / (n_c + n_t))
)
return ABTestResult(
control_mean=mean_c,
treatment_mean=mean_t,
lift=lift,
p_value=p_value,
confidence_interval=ci,
significant=p_value < alpha,
sample_size_control=n_c,
sample_size_treatment=n_t,
power=power
)
# Example: Simulate and test
np.random.seed(42)
control = np.random.binomial(1, 0.10, 10000) # 10% conversion
treatment = np.random.binomial(1, 0.11, 10000) # 11% conversion
result = run_ab_test(control, treatment, metric_type='proportion')
print("=== A/B Test Results ===")
print(f"Control Conversion: {result.control_mean:.4f}")
print(f"Treatment Conversion: {result.treatment_mean:.4f}")
print(f"Lift: {result.lift:.2%}")
print(f"P-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
print(f"95% CI: ({result.confidence_interval[0]:.4f}, {result.confidence_interval[1]:.4f})")
print(f"Power: {result.power:.2%}")
Common Pitfalls
1. Peeking Problem (Optional Stopping)
Problem: Checking results before reaching target sample size
*
* * *
* *
* *
* *
* <-- Stop early? *
* "Looks good!" *
+---------------------------> Time
Inflates false positive rate from 5% to 30%+
⚠️ Peeking Problem
Checking results before reaching target sample size inflates the false positive rate from 5% to 30%+. Always use sequential testing or run to completion.
2. Multiple Comparisons
Testing many variants inflates Type I error:
Family-wise Error Rate
Here,
- =Number of comparisons
- =Significance level per test
| Comparisons (k) | Family-wise Error Rate |
|---|---|
| 2 | 9.75% |
| 5 | 22.6% |
| 10 | 40.1% |
| 20 | 64.2% |
Solutions:
- Bonferroni correction:
- Benjamini-Hochberg (FDR control)
- Sequential testing
3. Novelty and Primacy Effects
Effect Size
^
| *
| * * Novelty effect (new wears off)
| * * *
| * * * * * * * True effect
| * * * * * * * *
+-------------------------------------> Time
Launch
ℹ️ Novelty Effect
New features get a temporary boost that fades over time. Run tests for at least 1-2 full business cycles (typically 2 weeks).
4. Simpson's Paradox
Aggregate results can contradict segment-level results:
Overall: A beats B (5% vs 4%)
| |
+----+----+ +----+----+
| Mobile | | Desktop |
| A: 2% | | A: 8% |
| B: 3% | | B: 7% |
| B wins! | | A wins! |
+---------+ +---------+
Confounding variable: Traffic source distribution
⚠️ Simpson's Paradox
Always check results by key segments (device, traffic source, geography). Aggregate results can be misleading.
5. Network Effects
Users in treatment interact with users in control, contaminating results.
ℹ️ Network Effects
Use cluster randomization (randomize by social group, geography) to avoid contamination between control and treatment.
6. Selection Bias
# Example: Detecting selection bias
import pandas as pd
import numpy as np
np.random.seed(42)
# Simulate biased experiment
n = 10000
assignment = np.random.choice(['A', 'B'], n)
user_type = np.random.choice(['new', 'returning'], n, p=[0.3, 0.7])
# Biased: more new users in treatment
treatment_mask = assignment == 'B'
biased_assignment = assignment.copy()
biased_assignment[treatment_mask & (np.random.random(n) < 0.5)] = 'B'
biased_assignment[treatment_mask & (np.random.random(n) >= 0.5)] = 'B'
# Check balance
df = pd.DataFrame({'group': assignment, 'user_type': user_type})
print("Group balance check:")
print(pd.crosstab(df['group'], df['user_type'], normalize='index'))
Multi-Armed Bandits
Concept
Instead of fixed 50/50 split, dynamically allocate more traffic to better-performing variants:
Traditional A/B Test Multi-Armed Bandit
Fixed 50/50 split Adaptive allocation
| |
+--+--+ +---+---+
| | | | |
A A B A B A B A A B B B B B
(equal throughout) (more to winner over time)
Epsilon-Greedy Strategy
Epsilon-Greedy Action Selection
Here,
- =Exploration probability
Upper Confidence Bound (UCB)
Upper Confidence Bound
Here,
- =Total rounds
- =Times arm i was pulled
- =Estimated mean reward for arm i
Thompson Sampling
Maintain posterior distribution for each arm and sample from them:
Multi-Armed Bandit Implementation
import numpy as np
from scipy import stats
class EpsilonGreedy:
"""Epsilon-Greedy multi-armed bandit."""
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
def select_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms)
return np.argmax(self.values)
def update(self, arm, reward):
self.counts[arm] += 1
n = self.counts[arm]
self.values[arm] = self.values[arm] * (n-1)/n + reward/n
class ThompsonSampling:
"""Thompson Sampling for binary rewards."""
def __init__(self, n_arms):
self.n_arms = n_arms
self.alpha = np.ones(n_arms) # successes + 1
self.beta = np.ones(n_arms) # failures + 1
def select_arm(self):
samples = [np.random.beta(self.alpha[i], self.beta[i])
for i in range(self.n_arms)]
return np.argmax(samples)
def update(self, arm, reward):
if reward == 1:
self.alpha[arm] += 1
else:
self.beta[arm] += 1
class UCB1:
"""Upper Confidence Bound algorithm."""
def __init__(self, n_arms):
self.n_arms = n_arms
self.counts = np.zeros(n_arms)
self.values = np.zeros(n_arms)
self.total_counts = 0
def select_arm(self):
# Ensure each arm is tried at least once
for i in range(self.n_arms):
if self.counts[i] == 0:
return i
ucb_values = self.values + np.sqrt(2 * np.log(self.total_counts) / self.counts)
return np.argmax(ucb_values)
def update(self, arm, reward):
self.counts[arm] += 1
self.total_counts += 1
n = self.counts[arm]
self.values[arm] = self.values[arm] * (n-1)/n + reward/n
# Simulation
def simulate_bandit(algorithm, true_probs, n_rounds=1000):
"""Simulate bandit algorithm performance."""
rewards = []
arms_selected = []
for _ in range(n_rounds):
arm = algorithm.select_arm()
reward = np.random.random() < true_probs[arm]
algorithm.update(arm, reward)
rewards.append(reward)
arms_selected.append(arm)
return np.array(rewards), np.array(arms_selected)
# Compare algorithms
np.random.seed(42)
true_probs = [0.10, 0.15, 0.20, 0.05] # True conversion rates
algorithms = {
'Epsilon-Greedy (0.1)': EpsilonGreedy(4, epsilon=0.1),
'Thompson Sampling': ThompsonSampling(4),
'UCB1': UCB1(4)
}
for name, algo in algorithms.items():
rewards, arms = simulate_bandit(algo, true_probs, n_rounds=10000)
print(f"\n{name}:")
print(f" Total reward: {rewards.sum():.0f}")
print(f" Average reward: {rewards.mean():.4f}")
print(f" Best arm selected: {np.argmax(np.bincount(arms))} "
f"(true best: {np.argmax(true_probs)})")
Real-World Example: E-commerce Button Test
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed(42)
# Simulate e-commerce A/B test
n_users = 50000
# User segments
segments = np.random.choice(
['new', 'returning', 'vip'],
n_users,
p=[0.5, 0.35, 0.15]
)
# Base conversion rates by segment
base_rates = {'new': 0.05, 'returning': 0.12, 'vip': 0.25}
# Treatment effect varies by segment
treatment_lift = {'new': 0.15, 'returning': 0.08, 'vip': 0.05}
# Assign groups
groups = np.random.choice(['control', 'treatment'], n_users)
# Generate conversions
conversions = np.zeros(n_users)
for i in range(n_users):
seg = segments[i]
if groups[i] == 'control':
conversions[i] = np.random.random() < base_rates[seg]
else:
conversions[i] = np.random.random() < base_rates[seg] * (1 + treatment_lift[seg])
# Create DataFrame
df = pd.DataFrame({
'user_id': range(n_users),
'group': groups,
'segment': segments,
'converted': conversions.astype(int)
})
# Overall results
print("=== Overall Results ===")
ctrl = df[df['group'] == 'control']
treat = df[df['group'] == 'treatment']
ctrl_cr = ctrl['converted'].mean()
treat_cr = treat['converted'].mean()
lift = (treat_cr - ctrl_cr) / ctrl_cr
print(f"Control CR: {ctrl_cr:.4f}")
print(f"Treatment CR: {treat_cr:.4f}")
print(f"Lift: {lift:.2%}")
# Statistical test
z_stat, p_value = stats.proportions_ztest(
[treat['converted'].sum(), ctrl['converted'].sum()],
[len(treat), len(ctrl)]
)
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant at 0.05: {p_value < 0.05}")
# Segment-level results
print("\n=== Segment-Level Results ===")
for seg in ['new', 'returning', 'vip']:
seg_data = df[df['segment'] == seg]
seg_ctrl = seg_data[seg_data['group'] == 'control']['converted'].mean()
seg_treat = seg_data[seg_data['group'] == 'treatment']['converted'].mean()
seg_lift = (seg_treat - seg_ctrl) / seg_ctrl
print(f"{seg:>10}: Control={seg_ctrl:.4f}, Treatment={seg_treat:.4f}, Lift={seg_lift:.2%}")
Experimentation Platform Architecture
+-------------------+ +-------------------+ +-------------------+
| Experiment | | Traffic | | Assignment |
| Configuration | --> | Allocation | --> | Service |
| (YAML/DB) | | (% by variant) | | (user -> variant)|
+-------------------+ +-------------------+ +-------------------+
| | |
v v v
+-------------------+ +-------------------+ +-------------------+
| Hypothesis | | Event | | Statistical |
| Registry | | Logging | | Engine |
+-------------------+ +-------------------+ +-------------------+
|
v
+-------------------+
| Results |
| Dashboard |
+-------------------+
Key Takeaways
📋Summary: A/B Testing
- Randomization is critical: Without proper randomization, results are unreliable — it balances both observed and unobserved confounders
- Sample size matters: Calculate before running the test — underpowered tests waste time and miss real effects
- Don't peek: Checking results early inflates the false positive rate from 5% to 30%+
- Check segments: Aggregate results can be misleading (Simpson's paradox)
- Run long enough: Cover at least one full business cycle (typically 1-2 weeks) to avoid novelty effects
- Multi-armed bandits are better for optimization; A/B tests are better for causal learning
- Pre-registration: Define hypotheses, metrics, and sample size before starting to avoid p-hacking
- Report effect sizes and confidence intervals, not just p-values — they communicate practical significance
Practice Exercises
Exercise 1: Sample Size Calculation
Calculate the required sample size for:
- Baseline conversion: 5%
- Minimum detectable effect: 10% relative lift
- Significance level: 5%
- Power: 80%
Exercise 2: Analyze Real Data
# Given experiment data, run a complete A/B test analysis:
# 1. Check group balance
# 2. Calculate point estimates and CIs
# 3. Run significance tests
# 4. Check for novelty effects
# 5. Segment analysis
Exercise 3: Build a Bandit
Implement a Thompson Sampling algorithm and compare against epsilon-greedy on a simulation with 5 arms.
Exercise 4: Sequential Testing
Research and implement a sequential testing framework (e.g., always-valid p-values or group sequential design).
Discussion Questions
- When would you choose a multi-armed bandit over a traditional A/B test?
- How do you handle experiments where the metric takes weeks to mature (e.g., customer lifetime value)?
- What are the ethical considerations of running experiments on users?