The Interview Question
"We're launching a new feature on [Netflix recommendation engine / Amazon Prime Video]. How would you measure its success?"
Metrics design is one of the highest-leverage questions in data science interviews. Amazon and Netflix use it to evaluate whether you can connect business strategy to measurable outcomes.
Why Companies Ask This
βΉοΈ
At Amazon, this question is directly tied to their Leadership Principle of "Deliver Results" β can you define what success looks like before building anything? At Netflix, it tests whether you understand the tension between engagement and satisfaction.
Interviewers are assessing:
- Strategic Thinking β Do you understand the business model?
- Metric Hierarchy β Can you build from north star β supporting β guardrail metrics?
- Trade-off Awareness β Do you understand Goodhart's Law and metric gaming?
- Practical Execution β Can you design a measurement plan with statistical rigor?
- Cross-functional Communication β Can you explain metrics to non-technical leaders?
The Metric Design Framework
Layer 1: North Star Metric
The single metric that best captures the core value delivered to users.
Layer 2: Supporting Metrics
Metrics that influence or decompose the north star into actionable components.
Layer 3: Guardrail Metrics
Metrics you monitor to ensure you're not inadvertently hurting other parts of the business.
Layer 4: Operational Metrics
Technical and system health metrics for the feature itself.
Example: Netflix "New Personalized Homepage" Feature
The Feature
A redesigned homepage that uses deep learning to personalize content rows based on viewing patterns, time of day, and device type.
North Star Metric
Hours of content streamed per unique viewer per week
Why this? Netflix's business model is subscription-based. Value creation = content consumption β retention β reduced churn.
Supporting Metrics
| Metric | Why It Matters | Target |
|---|---|---|
| Content discovery rate | % of streams from titles recommended by the new algorithm | +15% |
| Browse-to-play conversion | % of sessions where user starts playing content within 5 minutes | +8% |
| Session length | Average viewing duration per session | +12% |
| Return frequency | Number of app opens per week per user | +5% |
| Title diversity | Unique titles watched per user per month | +20% |
Guardrail Metrics
β οΈ
These are critical β without guardrails, your "success" might be destroying value elsewhere.
| Guardrail | Why It's a Guardrail |
|---|---|
| Churn rate | If churn increases, your feature is hurting retention |
| Customer support tickets | Feature confusion = bad UX |
| App crash rate | Performance regressions |
| Time to first play | If users take longer to start watching, you've added friction |
| Content creator satisfaction | Ensure smaller creators still get visibility |
Technical Implementation
import pandas as pd
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest
class MetricsExperiment:
def __init__(self, control_data, treatment_data):
self.control = control_data
self.treatment = treatment_data
def compute_north_star(self, data):
"""Compute hours streamed per unique viewer per week."""
return (
data.groupby(['user_id', 'week'])['watch_duration_minutes']
.sum()
.reset_index()
.groupby('user_id')['watch_duration_minutes']
.mean()
.mean() / 60 # Convert to hours
)
def compute_supporting_metrics(self, data):
"""Compute all supporting metrics."""
return {
'discovery_rate': (
data[data['is_recommended']]['streamed'].sum() /
data['streamed'].sum()
),
'browse_to_play': (
data[data['started_playing']]['session_duration_minutes'] <= 5
).mean(),
'avg_session_length': data.groupby('session_id')[
'watch_duration_minutes'
].sum().mean(),
'title_diversity': data.groupby('user_id')[
'title_id'
].nunique().mean(),
}
def statistical_test(self, metric_name, alpha=0.05):
"""Run two-proportion z-test for binary metrics."""
control_val = self.control[metric_name]
treatment_val = self.treatment[metric_name]
nobs = [len(self.control), len(self.treatment)]
count = [
control_val.sum() if hasattr(control_val, 'sum') else control_val,
treatment_val.sum() if hasattr(treatment_val, 'sum') else treatment_val
]
stat, p_value = proportions_ztest(count, nobs)
return {
'metric': metric_name,
'control': control_val.mean() if hasattr(control_val, 'mean') else control_val,
'treatment': treatment_val.mean() if hasattr(treatment_val, 'mean') else treatment_val,
'lift': (treatment_val.mean() - control_val.mean()) / control_val.mean(),
'p_value': p_value,
'significant': p_value < alpha,
}
def sample_size_calculation(self, baseline_rate, mde, alpha=0.05, power=0.8):
"""Calculate required sample size for A/B test."""
from scipy.stats import norm
z_alpha = norm.ppf(1 - alpha / 2)
z_beta = norm.ppf(power)
p1 = baseline_rate
p2 = baseline_rate * (1 + mde)
p_avg = (p1 + p2) / 2
n = (
(z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 /
(p2 - p1) ** 2
)
return int(np.ceil(n))
Netflix-Specific Metrics Considerations
The Engagement vs. Satisfaction Tension
π‘
Netflix interviewers love probing this: "What if your feature increases engagement but users report feeling 'addicted' or 'regretful'?" This tests whether you understand that raw engagement isn't always good.
Netflix distinguishes between:
- Active engagement β deliberate, enjoyable viewing
- Passive consumption β autoplay loops, doom-scrolling
- Satisfaction signals β post-viewing surveys, retention
Your metrics should capture this distinction:
satisfaction_metrics = {
'intentional_starts': content_started_by_user_action / total_starts,
'post_survey_score': average_satisfaction_rating,
'regret_watches': content_hidden_or_rated_low_after_viewing / total_views,
'return_next_day': users_returning_within_24h / active_users,
}
Amazon-Specific Metrics Considerations
The "Flywheel" Mentality
Amazon thinks in flywheels. Your metrics should show how the feature creates a self-reinforcing loop:
Better Recommendations β More Content Consumed β
Better Data for Algorithms β Even Better Recommendations β
Higher Retention β More Investment in Content β
More Content Available β Better Recommendations
Customer Obsession Metrics
Always frame metrics in terms of customer value:
- Customer Effort Score β How hard was it to find something to watch?
- Time to Value β How quickly did they find something they loved?
- Repeat Usage β Did they come back because of the feature?
Common Anti-Patterns to Avoid
1. The "Vanity Metric" Trap
# BAD: "Total streams increased by 10M!"
# This means nothing without context
# GOOD: "Streams per active user increased 12% (p<0.01)
# while maintaining churn rate at 2.1%"
2. The "Metric Overload" Problem
Don't propose 30 metrics. The interviewer wants to see you can prioritize.
3. The "Missing Guardrails" Mistake
Always ask: "What could go wrong?" and add guardrail metrics.
4. The "No Statistical Rigor" Omission
Always mention:
- Sample size calculations
- Statistical significance thresholds
- Multiple testing corrections
- Novelty and novelty effects
How to Structure Your Answer
Step 1: Clarify the feature and business context (2 minutes) Step 2: Define the north star metric and justify it (2 minutes) Step 3: Decompose into supporting metrics (3 minutes) Step 4: Identify guardrail metrics (2 minutes) Step 5: Discuss measurement methodology (2 minutes) Step 6: Address edge cases and trade-offs (2 minutes)