The Interview Question
ℹ️
Question: You're a data scientist at Netflix testing whether a new recommendation algorithm increases user engagement. You run an A/B test with:
- Control group: 50,000 users, mean watch time = 45 min/day, std = 12 min
- Treatment group: 50,000 users, mean watch time = 47 min/day, std = 13 min
- Set up the hypothesis test properly
- Calculate the p-value and interpret it
- Determine if the result is practically significant
- What are the potential pitfalls and how do you address them?
Detailed Answer
1. Hypothesis Testing Framework
Hypothesis testing is the foundation of statistical inference. It provides a structured way to make decisions about population parameters based on sample data.
Step 1: Define Hypotheses
Null Hypothesis (H₀): μ_treatment - μ_control = 0
Alternative Hypothesis (H₁): μ_treatment - μ_control ≠ 0 (two-tailed)
OR: μ_treatment - μ_control > 0 (one-tailed)
💡
Pro Tip: For business decisions, a one-tailed test is often more appropriate. If we only care whether the new algorithm increases engagement (not decreases), use H₁: μ_treatment > μ_control.
Step 2: Choose Significance Level (α)
The significance level is the probability of rejecting the null hypothesis when it's actually true (Type I error).
α = 0.05 (5%) — Standard for most tests
α = 0.01 (1%) — For high-stakes decisions
α = 0.10 (10%) — For exploratory analysis
2. Calculating the Test Statistic
For comparing two means with large samples, we use the z-test:
import numpy as np
from scipy import stats
# Given data
n_control = 50000
n_treatment = 50000
mean_control = 45
mean_treatment = 47
std_control = 12
std_treatment = 13
# Calculate pooled standard error
pooled_std = np.sqrt((std_control**2 / n_control) + (std_treatment**2 / n_treatment))
# Calculate z-statistic
z_stat = (mean_treatment - mean_control) / pooled_std
# Calculate p-value (two-tailed)
p_value = 2 * (1 - stats.norm.cdf(abs(z_stat)))
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.6f}")
Mathematical Formula:
z = (x̄₁ - x̄₂) / √(s₁²/n₁ + s₂²/n₂)
where:
x̄₁, x̄₂ = sample means
s₁, s₂ = sample standard deviations
n₁, n₂ = sample sizes
3. Interpreting the P-value
# Interpret the results
alpha = 0.05
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Significance level (α): {alpha}")
print(f"\nInterpretation:")
if p_value < alpha:
print(f"Since p-value ({p_value:.6f}) < α ({alpha})")
print("We reject the null hypothesis")
print("The new recommendation algorithm has a statistically significant effect")
else:
print(f"Since p-value ({p_value:.6f}) >= α ({alpha})")
print("We fail to reject the null hypothesis")
print("There's insufficient evidence to conclude the algorithm has an effect")
Common Misconceptions About P-values:
| Misconception | Reality |
|---|---|
| "P-value is the probability H₀ is true" | P-value is P(data | H₀), not P(H₀ | data) |
| "P < 0.05 means the effect is real" | P < 0.05 means we'd see this data rarely if H₀ were true |
| "Larger p-value means no effect" | Large p-value means insufficient evidence, not proof of no effect |
| "P-value measures effect size" | P-value depends on sample size; large n can make tiny effects significant |
4. Confidence Intervals
A confidence interval provides a range of plausible values for the true difference in means.
# Calculate 95% confidence interval for difference in means
diff = mean_treatment - mean_control
se_diff = pooled_std
# 95% CI: diff ± 1.96 * SE
ci_lower = diff - 1.96 * se_diff
ci_upper = diff + 1.96 * se_diff
print(f"Point estimate: {diff} minutes")
print(f"95% CI: ({ci_lower:.3f}, {ci_upper:.3f})")
print(f"CI width: {ci_upper - ci_lower:.3f} minutes")
# Interpretation
print(f"\nInterpretation:")
print(f"We are 95% confident that the true difference in mean watch time")
print(f"between treatment and control is between {ci_lower:.2f} and {ci_upper:.2f} minutes")
Mathematical Formula:
CI = (x̄₁ - x̄₂) ± z_(α/2) × √(s₁²/n₁ + s₂²/n₂)
For 95% CI: z_(α/2) = 1.96
For 99% CI: z_(α/2) = 2.576
For 90% CI: z_(α/2) = 1.645
5. Effect Size and Practical Significance
Statistical significance ≠ Practical significance. We need to measure effect size.
# Cohen's d for effect size
pooled_std_cohensd = np.sqrt(
((n_control - 1) * std_control**2 + (n_treatment - 1) * std_treatment**2) /
(n_control + n_treatment - 2)
)
cohens_d = (mean_treatment - mean_control) / pooled_std_cohensd
print(f"Cohen's d: {cohens_d:.4f}")
print(f"\nEffect size interpretation:")
if abs(cohens_d) < 0.2:
print("Negligible effect")
elif abs(cohens_d) < 0.5:
print("Small effect")
elif abs(cohens_d) < 0.8:
print("Medium effect")
else:
print("Large effect")
# Practical significance calculation
revenue_per_minute = 0.05 #假设每分钟观看产生$0.05收入
annual_impact_per_user = diff * revenue_per_minute * 365
total_annual_impact = annual_impact_per_user * 1000000 # 1M users
print(f"\nPractical Impact:")
print(f"Additional watch time per user per day: {diff} minutes")
print(f"Annual revenue impact per user: ${annual_impact_per_user:.2f}")
print(f"Total annual impact (1M users): ${total_annual_impact:,.0f}")
6. Power Analysis
Power is the probability of correctly rejecting a false null hypothesis (1 - β).
from statsmodels.stats.power import TTestIndPower
# Calculate required sample size for 80% power
effect_size = cohens_d
alpha = 0.05
power = 0.80
analysis = TTestIndPower()
required_n = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
power=power,
ratio=1.0 # Equal group sizes
)
print(f"Required sample size per group: {int(np.ceil(required_n))}")
print(f"Current sample size: {n_control}")
print(f"Sufficient power: {'Yes' if n_control >= required_n else 'No'}")
# Calculate actual power with current sample size
actual_power = analysis.solve_power(
effect_size=effect_size,
alpha=alpha,
nobs1=n_control,
ratio=1.0
)
print(f"Actual power: {actual_power:.4f}")
Power Analysis Formula:
n = (z_(α/2) + z_β)² × 2σ² / δ²
where:
n = sample size per group
z_(α/2) = critical value for significance level
z_β = critical value for power (1 - β)
σ = standard deviation
δ = minimum detectable effect size
7. Potential Pitfalls and Solutions
# Pitfall 1: Multiple Comparisons
# When testing multiple metrics, inflate alpha
n_tests = 5
bonferroni_alpha = 0.05 / n_tests
print(f"Bonferroni-corrected alpha: {bonferroni_alpha:.4f}")
# Pitfall 2: Peeking at Results
# Use sequential testing or always-run procedures
from statsmodels.stats.power import TTestIndPower
# Calculate alpha spending function
def alpha_spending(spent_fraction, total_alpha=0.05):
"""O'Brien-Fleming spending function"""
return total_alpha * (1 - np.exp(-4 * spent_fraction))
# Pitfall 3: Simpson's Paradox
# Check for confounding variables
print("\nExample of Simpson's Paradox:")
print("Overall: Treatment looks better")
print("But when stratified by user segment:")
print("- New users: Control better")
print("- Power users: Control better")
print("Treatment only better for medium users who happen to be larger group")
# Pitfall 4: Selection Bias
# Ensure random assignment
print("\nChecking for selection bias:")
print("Pre-test characteristics should be similar:")
print(f"Control group pre-test mean: {mean_control:.2f}")
print(f"Treatment group pre-test mean: {mean_treatment:.2f}")
⚠️
Critical Warning: Never stop a test early just because you see "significant" results. This inflates Type I error. Use proper sequential testing methods instead.
8. Common Follow-Up Questions
Follow-up 1: What if the data isn't normally distributed?
# Use non-parametric tests
from scipy.stats import mannwhitneyu, wilcoxon
# Mann-Whitney U test (non-parametric alternative to t-test)
stat, p_value_mw = mannwhitneyu(
treatment_group_data,
control_group_data,
alternative='greater'
)
print(f"Mann-Whitney U test p-value: {p_value_mw:.6f}")
# Bootstrap confidence interval
def bootstrap_ci(data1, data2, n_bootstrap=10000, ci=0.95):
"""Calculate bootstrap confidence interval for difference in means"""
boot_diffs = []
for _ in range(n_bootstrap):
boot1 = np.random.choice(data1, size=len(data1), replace=True)
boot2 = np.random.choice(data2, size=len(data2), replace=True)
boot_diffs.append(np.mean(boot1) - np.mean(boot2))
lower = np.percentile(boot_diffs, (1-ci)/2 * 100)
upper = np.percentile(boot_diffs, (1+ci)/2 * 100)
return lower, upper
Follow-up 2: How do you handle multiple metrics?
# Family-wise error rate control
from statsmodels.stats.multitest import multipletests
p_values = [0.02, 0.04, 0.08, 0.12, 0.03]
metric_names = ['Watch Time', 'Completion Rate', 'Searches', 'Downloads', 'Return Visits']
# Bonferroni correction
rejected, corrected_p, _, _ = multipletests(p_values, alpha=0.05, method='bonferroni')
print("Results with Bonferroni correction:")
for name, p, corr_p, rej in zip(metric_names, p_values, corrected_p, rejected):
print(f"{name}: p={p:.4f}, corrected_p={corr_p:.4f}, significant={rej}")
# False Discovery Rate (less conservative)
rejected_fdr, corrected_p_fdr, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')
print("\nResults with FDR correction:")
for name, p, corr_p, rej in zip(metric_names, p_values, corrected_p_fdr, rejected_fdr):
print(f"{name}: p={p:.4f}, corrected_p={corr_p:.4f}, significant={rej}")
Company-Specific Tips
ℹ️
Google Tips:
- Google often asks about Bayesian vs Frequentist approaches
- Be prepared to explain p-values in business terms
- Know when to use z-test vs t-test vs chi-square
- Practice power analysis calculations
Netflix Tips:
- Netflix heavily tests on A/B testing methodology
- Understand sequential testing and early stopping rules
- Know how to handle network effects and interference
- Be comfortable with regression-based analysis of experiments
Quiz Section
Related Topics
- Bayesian Statistics — Alternative to frequentist hypothesis testing
- A/B Testing Design — Setting up proper experiments
- Power Analysis — Determining sample sizes
- Multiple Comparisons — Handling multiple tests