A/B Testing + Experimentation

Module 4: Specialization + CareerFree Lesson

Advertisement

A/B Testing + Experimentation

Overview

A/B testing (also called split testing or randomized controlled experimentation) is the gold standard for making data-driven decisions in product development, marketing, and business strategy. It involves randomly assigning users to two or more variants to measure the causal impact of a change. Rooted in Fisher's framework of hypothesis testing, A/B tests provide the most credible evidence for causal claims outside of randomized controlled trials in medicine.


A/B Testing Fundamentals

What is an A/B Test?

DfA/B Test

An A/B test is a randomized controlled experiment that compares two versions (control and treatment) to determine which performs better, using statistical inference to determine whether observed differences are due to the treatment or random chance.

Architecture Diagram
                     Traffic
                       |
           +----------+----------+
           |                     |
      Random Assignment     Random Assignment
           |                     |
           v                     v
    +------+------+       +------+------+
    |  Control   |       | Treatment   |
    |  (A)       |       |  (B)        |
    |  Original  |       |  New Version|
    +------+-----+       +------+------+
           |                     |
           v                     v
    +------+------+       +------+------+
    |  Metrics   |       |  Metrics    |
    |  Collect   |       |  Collect    |
    +------+-----+       +------+------+
           |                     |
           +----------+----------+
                      |
                      v
              Statistical Analysis
              (Is B significantly better?)

ℹ️ Credibility Hierarchy

A/B tests sit near the top of the evidence credibility hierarchy. Unlike observational studies, proper randomization ensures that both observed and unobserved confounders are balanced across groups, allowing causal conclusions.

Key Concepts

TermDefinition
Control (A)The baseline/original version
Treatment (B)The modified version being tested
Unit of AnalysisUser, session, pageview, etc.
RandomizationRandom assignment to eliminate confounders
Sample SizeNumber of observations needed for reliable results
Statistical SignificanceProbability that the observed effect is real
Effect SizeMagnitude of the difference between variants

Hypothesis Testing Framework

Null Hypothesis

H0:μA=μB(no difference)H_0: \mu_A = \mu_B \quad \text{(no difference)}

Here,

  • H0H_0=Null hypothesis
  • μA\mu_A=Mean of control group
  • μB\mu_B=Mean of treatment group

Alternative Hypothesis (Two-sided)

H1:μAμBH_1: \mu_A \neq \mu_B

Here,

  • H1H_1=Alternative hypothesis

Or one-sided:

Alternative Hypothesis (One-sided)

H1:μB>μA(B is better)H_1: \mu_B > \mu_A \quad \text{(B is better)}

Here,

  • H1H_1=Alternative hypothesis

Types of A/B Tests

Architecture Diagram
A/B Testing Types
       |
       +--+--> Conversion Rate Tests (click-through rate, sign-ups)
       |
       +--+--> Revenue Tests (average order value, LTV)
       |
       +--+--> Engagement Tests (time on page, bounce rate)
       |
       +--+--> Funnel Tests (multi-step conversions)
       |
       +--+--> Multivariate Tests (multiple changes simultaneously)

Statistical Foundations

Z-Test for Proportions

For conversion rate tests, the test statistic is:

Z-Test Statistic for Proportions

z=p^Bp^Ap^(1p^)(1nA+1nB)z = \frac{\hat{p}_B - \hat{p}_A}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_A} + \frac{1}{n_B}\right)}}

Here,

  • zz=Test statistic
  • p^A\hat{p}_A=Sample proportion for control
  • p^B\hat{p}_B=Sample proportion for treatment
  • p^\hat{p}=Pooled proportion
  • nA,nBn_A, n_B=Sample sizes for each group

where the pooled proportion is:

Pooled Proportion

p^=xA+xBnA+nB\hat{p} = \frac{x_A + x_B}{n_A + n_B}

Here,

  • xA,xBx_A, x_B=Number of successes in each group

T-Test for Continuous Metrics

For continuous metrics (revenue, time on page):

with degrees of freedom (Welch's approximation):

Welch-Satterthwaite Degrees of Freedom

df=(sA2nA+sB2nB)2(sA2/nA)2nA1+(sB2/nB)2nB1df = \frac{\left(\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}\right)^2}{\frac{(s_A^2/n_A)^2}{n_A-1} + \frac{(s_B^2/n_B)^2}{n_B-1}}

Here,

  • dfdf=Degrees of freedom

ℹ️ Why Welch's over Student's t-test

Welch's t-test does not assume equal variances between groups, making it more robust for real-world A/B tests where control and treatment groups may have different variance structures.


Sample Size Calculation

For Conversion Rate Tests

The required sample size per group is:

For Continuous Metrics

Sample Size for Continuous Metrics

n=2(zα/2+zβ)2σ2δ2n = \frac{2(z_{\alpha/2} + z_{\beta})^2 \cdot \sigma^2}{\delta^2}

Here,

  • nn=Sample size per group
  • σ\sigma=Standard deviation of the metric
  • δ\delta=Minimum detectable effect (MDE)

Minimum Detectable Effect (MDE)

Minimum Detectable Effect

MDE=(zα/2+zβ)2σ2n\text{MDE} = (z_{\alpha/2} + z_{\beta}) \cdot \sqrt{\frac{2\sigma^2}{n}}

Here,

  • MDEMDE=Minimum detectable effect

Power Analysis Visualization

Architecture Diagram
Power vs Sample Size
Power
1.0 |
    |                          ___________
    |                        /
0.8 |-----------------------/------------ Target Power
    |                    /
    |                  /
0.6 |               /
    |            /
    |         /
0.4 |      /
    |    /
    |  /
0.2 |/
    |
0.0 +----+----+----+----+----+----+----+----> Sample Size
    0   500  1000 1500 2000 2500 3000 3500

Key insight: Diminishing returns as n increases

Sample Size Calculator

import numpy as np
from scipy import stats

def sample_size_proportions(p_control, mde_relative, alpha=0.05, power=0.80):
    """
    Calculate sample size for A/B test with conversion rates.
    
    Args:
        p_control: Baseline conversion rate
        mde_relative: Relative minimum detectable effect (e.g., 0.10 for 10% lift)
        alpha: Significance level
        power: Statistical power
    
    Returns:
        Sample size per group
    """
    p_treatment = p_control * (1 + mde_relative)
    
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    p_pooled = (p_control + p_treatment) / 2
    
    n = (z_alpha + z_beta)**2 * (
        p_control * (1 - p_control) + p_treatment * (1 - p_treatment)
    ) / (p_treatment - p_control)**2
    
    return int(np.ceil(n))

def sample_size_continuous(std_dev, mde_absolute, alpha=0.05, power=0.80):
    """
    Calculate sample size for A/B test with continuous metrics.
    
    Args:
        std_dev: Standard deviation of the metric
        mde_absolute: Minimum detectable effect (absolute)
        alpha: Significance level
        power: Statistical power
    
    Returns:
        Sample size per group
    """
    z_alpha = stats.norm.ppf(1 - alpha / 2)
    z_beta = stats.norm.ppf(power)
    
    n = 2 * (z_alpha + z_beta)**2 * std_dev**2 / mde_absolute**2
    
    return int(np.ceil(n))

# Example: Conversion rate test
baseline_cr = 0.10  # 10% baseline conversion
lift = 0.10  # Want to detect 10% relative lift (10% -> 11%)

n_per_group = sample_size_proportions(baseline_cr, lift)
print(f"Sample size per group: {n_per_group:,}")
print(f"Total sample needed: {n_per_group * 2:,}")
print(f"Test duration at 1000 visitors/day: {n_per_group * 2 / 1000:.1f} days")

# Example: Revenue test
revenue_std = 50.0  # $50 std dev in revenue
min_effect = 2.0    # Want to detect $2 increase

n_per_group_rev = sample_size_continuous(revenue_std, min_effect)
print(f"\nRevenue test - Sample per group: {n_per_group_rev:,}")

Statistical Significance

P-Value

DfP-Value

The p-value is the probability of observing a result as extreme as (or more extreme than) the data, assuming the null hypothesis is true. It quantifies the evidence against the null hypothesis.

P-Value Definition

p-value=P(observed effect or more extremeH0 is true)p\text{-value} = P(\text{observed effect or more extreme} \mid H_0 \text{ is true})

Here,

  • pp=P-value

Interpretation

P-ValueInterpretationAction
< 0.01Very strong evidence against H0H_0Ship with confidence
0.01 - 0.05Strong evidence against H0H_0Likely ship
0.05 - 0.10Weak evidenceConsider more data
> 0.10No significant evidenceDon't ship

Confidence Intervals

For the difference in proportions:

Confidence Interval for Difference in Proportions

CI=(p^Bp^A)±zα/2p^A(1p^A)nA+p^B(1p^B)nBCI = (\hat{p}_B - \hat{p}_A) \pm z_{\alpha/2} \cdot \sqrt{\frac{\hat{p}_A(1-\hat{p}_A)}{n_A} + \frac{\hat{p}_B(1-\hat{p}_B)}{n_B}}

Here,

  • CICI=Confidence interval
  • zα/2z_{\alpha/2}=Critical value for significance level

💡 Confidence Intervals vs P-Values

Always report confidence intervals alongside p-values. A CI provides both statistical significance (does it exclude zero?) and practical significance (how large is the effect?). A narrow CI centered at zero is strong evidence of no effect.

Effect Size Measures

Cohen's h for proportions:

Cohen's h for Proportions

h=2arcsin(pB)2arcsin(pA)h = 2 \arcsin(\sqrt{p_B}) - 2 \arcsin(\sqrt{p_A})

Here,

  • hh=Effect size
  • pA,pBp_A, p_B=Proportions for each group

Cohen's d for continuous metrics:

Cohen's d for Continuous Metrics

d=XˉBXˉAspooledd = \frac{\bar{X}_B - \bar{X}_A}{s_{pooled}}

Here,

  • dd=Effect size
  • spooleds_{pooled}=Pooled standard deviation
Effect SizeCohen's dCohen's h
Small0.20.05
Medium0.50.15
Large0.80.25

Complete A/B Test Implementation

import numpy as np
import pandas as pd
from scipy import stats
from dataclasses import dataclass
from typing import Optional

@dataclass
class ABTestResult:
    """Results of an A/B test."""
    control_mean: float
    treatment_mean: float
    lift: float
    p_value: float
    confidence_interval: tuple
    significant: bool
    sample_size_control: int
    sample_size_treatment: int
    power: float

def run_ab_test(
    control_data: np.ndarray,
    treatment_data: np.ndarray,
    metric_type: str = 'continuous',
    alpha: float = 0.05,
    power_threshold: float = 0.80
) -> ABTestResult:
    """
    Run a comprehensive A/B test.
    
    Args:
        control_data: Metrics for control group
        treatment_data: Metrics for treatment group
        metric_type: 'continuous' or 'proportion'
        alpha: Significance level
        power_threshold: Minimum power required
    
    Returns:
        ABTestResult with all statistics
    """
    n_c, n_t = len(control_data), len(treatment_data)
    
    if metric_type == 'proportion':
        # Z-test for proportions
        p_c = np.mean(control_data)
        p_t = np.mean(treatment_data)
        p_pooled = (np.sum(control_data) + np.sum(treatment_data)) / (n_c + n_t)
        
        se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n_c + 1/n_t))
        z_stat = (p_t - p_c) / se
        p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
        
        # Confidence interval
        se_diff = np.sqrt(p_c*(1-p_c)/n_c + p_t*(1-p_t)/n_t)
        z_crit = stats.norm.ppf(1 - alpha/2)
        ci = ((p_t - p_c) - z_crit*se_diff, (p_t - p_c) + z_crit*se_diff)
        
        lift = (p_t - p_c) / p_c if p_c > 0 else float('inf')
        mean_c, mean_t = p_c, p_t
        
    else:
        # Welch's t-test for continuous metrics
        mean_c, mean_t = np.mean(control_data), np.mean(treatment_data)
        std_c, std_t = np.std(control_data, ddof=1), np.std(treatment_data, ddof=1)
        
        se = np.sqrt(std_c**2/n_c + std_t**2/n_t)
        t_stat = (mean_t - mean_c) / se
        
        # Welch-Satterthwaite degrees of freedom
        df = (std_c**2/n_c + std_t**2/n_t)**2 / (
            (std_c**2/n_c)**2/(n_c-1) + (std_t**2/n_t)**2/(n_t-1)
        )
        
        p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df))
        
        # Confidence interval
        t_crit = stats.t.ppf(1 - alpha/2, df)
        ci = ((mean_t - mean_c) - t_crit*se, (mean_t - mean_c) + t_crit*se)
        
        lift = (mean_t - mean_c) / mean_c if mean_c > 0 else float('inf')
    
    # Calculate observed power
    effect_size = abs(mean_t - mean_c) / np.sqrt((np.var(control_data) + np.var(treatment_data)) / 2)
    power = 1 - stats.norm.cdf(
        stats.norm.ppf(1 - alpha/2) - effect_size * np.sqrt(n_c * n_t / (n_c + n_t))
    )
    
    return ABTestResult(
        control_mean=mean_c,
        treatment_mean=mean_t,
        lift=lift,
        p_value=p_value,
        confidence_interval=ci,
        significant=p_value < alpha,
        sample_size_control=n_c,
        sample_size_treatment=n_t,
        power=power
    )

# Example: Simulate and test
np.random.seed(42)
control = np.random.binomial(1, 0.10, 10000)  # 10% conversion
treatment = np.random.binomial(1, 0.11, 10000)  # 11% conversion

result = run_ab_test(control, treatment, metric_type='proportion')
print("=== A/B Test Results ===")
print(f"Control Conversion: {result.control_mean:.4f}")
print(f"Treatment Conversion: {result.treatment_mean:.4f}")
print(f"Lift: {result.lift:.2%}")
print(f"P-value: {result.p_value:.4f}")
print(f"Significant: {result.significant}")
print(f"95% CI: ({result.confidence_interval[0]:.4f}, {result.confidence_interval[1]:.4f})")
print(f"Power: {result.power:.2%}")

Common Pitfalls

1. Peeking Problem (Optional Stopping)

Architecture Diagram
Problem: Checking results before reaching target sample size
                    *
           *       *  *
         *           *
       *               *
     *                   *
   *  <-- Stop early?    *
 *      "Looks good!"     *
+---------------------------> Time

Inflates false positive rate from 5% to 30%+

⚠️ Peeking Problem

Checking results before reaching target sample size inflates the false positive rate from 5% to 30%+. Always use sequential testing or run to completion.

2. Multiple Comparisons

Testing many variants inflates Type I error:

Family-wise Error Rate

P(at least one false positive)=1(1α)kP(\text{at least one false positive}) = 1 - (1-\alpha)^k

Here,

  • kk=Number of comparisons
  • α\alpha=Significance level per test
Comparisons (k)Family-wise Error Rate
29.75%
522.6%
1040.1%
2064.2%

Solutions:

  • Bonferroni correction: αadjusted=α/k\alpha_{adjusted} = \alpha / k
  • Benjamini-Hochberg (FDR control)
  • Sequential testing

3. Novelty and Primacy Effects

Architecture Diagram
Effect Size
    ^
    |  *
    |   *  *  Novelty effect (new wears off)
    |     *  *  *
    |           * * * * * * * True effect
    |                   * * * * * * * *
    +-------------------------------------> Time
    Launch

ℹ️ Novelty Effect

New features get a temporary boost that fades over time. Run tests for at least 1-2 full business cycles (typically 2 weeks).

4. Simpson's Paradox

Aggregate results can contradict segment-level results:

Architecture Diagram
Overall: A beats B (5% vs 4%)
         |                 |
    +----+----+       +----+----+
    | Mobile  |       | Desktop |
    | A: 2%   |       | A: 8%   |
    | B: 3%   |       | B: 7%   |
    | B wins! |       | A wins! |
    +---------+       +---------+

Confounding variable: Traffic source distribution

⚠️ Simpson's Paradox

Always check results by key segments (device, traffic source, geography). Aggregate results can be misleading.

5. Network Effects

Users in treatment interact with users in control, contaminating results.

ℹ️ Network Effects

Use cluster randomization (randomize by social group, geography) to avoid contamination between control and treatment.

6. Selection Bias

# Example: Detecting selection bias
import pandas as pd
import numpy as np

np.random.seed(42)

# Simulate biased experiment
n = 10000
assignment = np.random.choice(['A', 'B'], n)
user_type = np.random.choice(['new', 'returning'], n, p=[0.3, 0.7])

# Biased: more new users in treatment
treatment_mask = assignment == 'B'
biased_assignment = assignment.copy()
biased_assignment[treatment_mask & (np.random.random(n) < 0.5)] = 'B'
biased_assignment[treatment_mask & (np.random.random(n) >= 0.5)] = 'B'

# Check balance
df = pd.DataFrame({'group': assignment, 'user_type': user_type})
print("Group balance check:")
print(pd.crosstab(df['group'], df['user_type'], normalize='index'))

Multi-Armed Bandits

Concept

Instead of fixed 50/50 split, dynamically allocate more traffic to better-performing variants:

Architecture Diagram
Traditional A/B Test          Multi-Armed Bandit
                              
Fixed 50/50 split            Adaptive allocation
     |                              |
  +--+--+                      +---+---+
  |  |  |                      |       |
  A  A  B  A  B  A  B       A  A  B  B  B  B  B
  (equal throughout)          (more to winner over time)

Epsilon-Greedy Strategy

Epsilon-Greedy Action Selection

Action={Randomwith probability ϵBest knownwith probability 1ϵ\text{Action} = \begin{cases} \text{Random} & \text{with probability } \epsilon \\ \text{Best known} & \text{with probability } 1-\epsilon \end{cases}

Here,

  • ϵ\epsilon=Exploration probability

Upper Confidence Bound (UCB)

Upper Confidence Bound

UCBi=μ^i+2lntni\text{UCB}_i = \hat{\mu}_i + \sqrt{\frac{2 \ln t}{n_i}}

Here,

  • tt=Total rounds
  • nin_i=Times arm i was pulled
  • μ^i\hat{\mu}_i=Estimated mean reward for arm i

Thompson Sampling

Maintain posterior distribution for each arm and sample from them:

Multi-Armed Bandit Implementation

import numpy as np
from scipy import stats

class EpsilonGreedy:
    """Epsilon-Greedy multi-armed bandit."""
    
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
    
    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)
        return np.argmax(self.values)
    
    def update(self, arm, reward):
        self.counts[arm] += 1
        n = self.counts[arm]
        self.values[arm] = self.values[arm] * (n-1)/n + reward/n

class ThompsonSampling:
    """Thompson Sampling for binary rewards."""
    
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)  # successes + 1
        self.beta = np.ones(n_arms)   # failures + 1
    
    def select_arm(self):
        samples = [np.random.beta(self.alpha[i], self.beta[i]) 
                   for i in range(self.n_arms)]
        return np.argmax(samples)
    
    def update(self, arm, reward):
        if reward == 1:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

class UCB1:
    """Upper Confidence Bound algorithm."""
    
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.counts = np.zeros(n_arms)
        self.values = np.zeros(n_arms)
        self.total_counts = 0
    
    def select_arm(self):
        # Ensure each arm is tried at least once
        for i in range(self.n_arms):
            if self.counts[i] == 0:
                return i
        
        ucb_values = self.values + np.sqrt(2 * np.log(self.total_counts) / self.counts)
        return np.argmax(ucb_values)
    
    def update(self, arm, reward):
        self.counts[arm] += 1
        self.total_counts += 1
        n = self.counts[arm]
        self.values[arm] = self.values[arm] * (n-1)/n + reward/n

# Simulation
def simulate_bandit(algorithm, true_probs, n_rounds=1000):
    """Simulate bandit algorithm performance."""
    rewards = []
    arms_selected = []
    
    for _ in range(n_rounds):
        arm = algorithm.select_arm()
        reward = np.random.random() < true_probs[arm]
        algorithm.update(arm, reward)
        rewards.append(reward)
        arms_selected.append(arm)
    
    return np.array(rewards), np.array(arms_selected)

# Compare algorithms
np.random.seed(42)
true_probs = [0.10, 0.15, 0.20, 0.05]  # True conversion rates

algorithms = {
    'Epsilon-Greedy (0.1)': EpsilonGreedy(4, epsilon=0.1),
    'Thompson Sampling': ThompsonSampling(4),
    'UCB1': UCB1(4)
}

for name, algo in algorithms.items():
    rewards, arms = simulate_bandit(algo, true_probs, n_rounds=10000)
    print(f"\n{name}:")
    print(f"  Total reward: {rewards.sum():.0f}")
    print(f"  Average reward: {rewards.mean():.4f}")
    print(f"  Best arm selected: {np.argmax(np.bincount(arms))} "
          f"(true best: {np.argmax(true_probs)})")

Real-World Example: E-commerce Button Test

import numpy as np
import pandas as pd
from scipy import stats

np.random.seed(42)

# Simulate e-commerce A/B test
n_users = 50000

# User segments
segments = np.random.choice(
    ['new', 'returning', 'vip'],
    n_users,
    p=[0.5, 0.35, 0.15]
)

# Base conversion rates by segment
base_rates = {'new': 0.05, 'returning': 0.12, 'vip': 0.25}

# Treatment effect varies by segment
treatment_lift = {'new': 0.15, 'returning': 0.08, 'vip': 0.05}

# Assign groups
groups = np.random.choice(['control', 'treatment'], n_users)

# Generate conversions
conversions = np.zeros(n_users)
for i in range(n_users):
    seg = segments[i]
    if groups[i] == 'control':
        conversions[i] = np.random.random() < base_rates[seg]
    else:
        conversions[i] = np.random.random() < base_rates[seg] * (1 + treatment_lift[seg])

# Create DataFrame
df = pd.DataFrame({
    'user_id': range(n_users),
    'group': groups,
    'segment': segments,
    'converted': conversions.astype(int)
})

# Overall results
print("=== Overall Results ===")
ctrl = df[df['group'] == 'control']
treat = df[df['group'] == 'treatment']

ctrl_cr = ctrl['converted'].mean()
treat_cr = treat['converted'].mean()
lift = (treat_cr - ctrl_cr) / ctrl_cr

print(f"Control CR: {ctrl_cr:.4f}")
print(f"Treatment CR: {treat_cr:.4f}")
print(f"Lift: {lift:.2%}")

# Statistical test
z_stat, p_value = stats.proportions_ztest(
    [treat['converted'].sum(), ctrl['converted'].sum()],
    [len(treat), len(ctrl)]
)
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.4f}")
print(f"Significant at 0.05: {p_value < 0.05}")

# Segment-level results
print("\n=== Segment-Level Results ===")
for seg in ['new', 'returning', 'vip']:
    seg_data = df[df['segment'] == seg]
    seg_ctrl = seg_data[seg_data['group'] == 'control']['converted'].mean()
    seg_treat = seg_data[seg_data['group'] == 'treatment']['converted'].mean()
    seg_lift = (seg_treat - seg_ctrl) / seg_ctrl
    print(f"{seg:>10}: Control={seg_ctrl:.4f}, Treatment={seg_treat:.4f}, Lift={seg_lift:.2%}")

Experimentation Platform Architecture

Architecture Diagram
+-------------------+     +-------------------+     +-------------------+
|   Experiment      |     |   Traffic         |     |   Assignment      |
|   Configuration   | --> |   Allocation      | --> |   Service         |
|   (YAML/DB)       |     |   (% by variant)  |     |   (user -> variant)|
+-------------------+     +-------------------+     +-------------------+
           |                        |                        |
           v                        v                        v
+-------------------+     +-------------------+     +-------------------+
|   Hypothesis      |     |   Event           |     |   Statistical     |
|   Registry        |     |   Logging         |     |   Engine          |
+-------------------+     +-------------------+     +-------------------+
                                   |
                                   v
                          +-------------------+
                          |   Results         |
                          |   Dashboard       |
                          +-------------------+

Key Takeaways

📋Summary: A/B Testing

  1. Randomization is critical: Without proper randomization, results are unreliable — it balances both observed and unobserved confounders
  2. Sample size matters: Calculate before running the test — underpowered tests waste time and miss real effects
  3. Don't peek: Checking results early inflates the false positive rate from 5% to 30%+
  4. Check segments: Aggregate results can be misleading (Simpson's paradox)
  5. Run long enough: Cover at least one full business cycle (typically 1-2 weeks) to avoid novelty effects
  6. Multi-armed bandits are better for optimization; A/B tests are better for causal learning
  7. Pre-registration: Define hypotheses, metrics, and sample size before starting to avoid p-hacking
  8. Report effect sizes and confidence intervals, not just p-values — they communicate practical significance

Practice Exercises

Exercise 1: Sample Size Calculation

Calculate the required sample size for:

  • Baseline conversion: 5%
  • Minimum detectable effect: 10% relative lift
  • Significance level: 5%
  • Power: 80%

Exercise 2: Analyze Real Data

# Given experiment data, run a complete A/B test analysis:
# 1. Check group balance
# 2. Calculate point estimates and CIs
# 3. Run significance tests
# 4. Check for novelty effects
# 5. Segment analysis

Exercise 3: Build a Bandit

Implement a Thompson Sampling algorithm and compare against epsilon-greedy on a simulation with 5 arms.

Exercise 4: Sequential Testing

Research and implement a sequential testing framework (e.g., always-valid p-values or group sequential design).

Discussion Questions

  1. When would you choose a multi-armed bandit over a traditional A/B test?
  2. How do you handle experiments where the metric takes weeks to mature (e.g., customer lifetime value)?
  3. What are the ethical considerations of running experiments on users?

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement