📊

Asked at Amazon & Netflix

Metrics Design

How Would You Measure Success for [Product]?

The Interview Question

"We're launching a new feature on [Netflix recommendation engine / Amazon Prime Video]. How would you measure its success?"

Metrics design is one of the highest-leverage questions in data science interviews. Amazon and Netflix use it to evaluate whether you can connect business strategy to measurable outcomes.

Why Companies Ask This

ℹ️

At Amazon, this question is directly tied to their Leadership Principle of "Deliver Results" — can you define what success looks like before building anything? At Netflix, it tests whether you understand the tension between engagement and satisfaction.

Interviewers are assessing:

Strategic Thinking — Do you understand the business model?
Metric Hierarchy — Can you build from north star → supporting → guardrail metrics?
Trade-off Awareness — Do you understand Goodhart's Law and metric gaming?
Practical Execution — Can you design a measurement plan with statistical rigor?
Cross-functional Communication — Can you explain metrics to non-technical leaders?

The Metric Design Framework

Layer 1: North Star Metric

The single metric that best captures the core value delivered to users.

Layer 2: Supporting Metrics

Metrics that influence or decompose the north star into actionable components.

Layer 3: Guardrail Metrics

Metrics you monitor to ensure you're not inadvertently hurting other parts of the business.

Layer 4: Operational Metrics

Technical and system health metrics for the feature itself.

Example: Netflix "New Personalized Homepage" Feature

The Feature

A redesigned homepage that uses deep learning to personalize content rows based on viewing patterns, time of day, and device type.

North Star Metric

Hours of content streamed per unique viewer per week

Why this? Netflix's business model is subscription-based. Value creation = content consumption → retention → reduced churn.

Supporting Metrics

Metric	Why It Matters	Target
Content discovery rate	% of streams from titles recommended by the new algorithm	+15%
Browse-to-play conversion	% of sessions where user starts playing content within 5 minutes	+8%
Session length	Average viewing duration per session	+12%
Return frequency	Number of app opens per week per user	+5%
Title diversity	Unique titles watched per user per month	+20%

Guardrail Metrics

⚠️

These are critical — without guardrails, your "success" might be destroying value elsewhere.

Guardrail	Why It's a Guardrail
Churn rate	If churn increases, your feature is hurting retention
Customer support tickets	Feature confusion = bad UX
App crash rate	Performance regressions
Time to first play	If users take longer to start watching, you've added friction
Content creator satisfaction	Ensure smaller creators still get visibility

Technical Implementation

import pandas as pd
from scipy import stats
from statsmodels.stats.proportion import proportions_ztest

class MetricsExperiment:
    def __init__(self, control_data, treatment_data):
        self.control = control_data
        self.treatment = treatment_data
    
    def compute_north_star(self, data):
        """Compute hours streamed per unique viewer per week."""
        return (
            data.groupby(['user_id', 'week'])['watch_duration_minutes']
            .sum()
            .reset_index()
            .groupby('user_id')['watch_duration_minutes']
            .mean()
            .mean() / 60  # Convert to hours
        )
    
    def compute_supporting_metrics(self, data):
        """Compute all supporting metrics."""
        return {
            'discovery_rate': (
                data[data['is_recommended']]['streamed'].sum() /
                data['streamed'].sum()
            ),
            'browse_to_play': (
                data[data['started_playing']]['session_duration_minutes'] <= 5
            ).mean(),
            'avg_session_length': data.groupby('session_id')[
                'watch_duration_minutes'
            ].sum().mean(),
            'title_diversity': data.groupby('user_id')[
                'title_id'
            ].nunique().mean(),
        }
    
    def statistical_test(self, metric_name, alpha=0.05):
        """Run two-proportion z-test for binary metrics."""
        control_val = self.control[metric_name]
        treatment_val = self.treatment[metric_name]
        
        nobs = [len(self.control), len(self.treatment)]
        count = [
            control_val.sum() if hasattr(control_val, 'sum') else control_val,
            treatment_val.sum() if hasattr(treatment_val, 'sum') else treatment_val
        ]
        
        stat, p_value = proportions_ztest(count, nobs)
        return {
            'metric': metric_name,
            'control': control_val.mean() if hasattr(control_val, 'mean') else control_val,
            'treatment': treatment_val.mean() if hasattr(treatment_val, 'mean') else treatment_val,
            'lift': (treatment_val.mean() - control_val.mean()) / control_val.mean(),
            'p_value': p_value,
            'significant': p_value < alpha,
        }
    
    def sample_size_calculation(self, baseline_rate, mde, alpha=0.05, power=0.8):
        """Calculate required sample size for A/B test."""
        from scipy.stats import norm
        
        z_alpha = norm.ppf(1 - alpha / 2)
        z_beta = norm.ppf(power)
        
        p1 = baseline_rate
        p2 = baseline_rate * (1 + mde)
        p_avg = (p1 + p2) / 2
        
        n = (
            (z_alpha * np.sqrt(2 * p_avg * (1 - p_avg)) +
             z_beta * np.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 /
            (p2 - p1) ** 2
        )
        
        return int(np.ceil(n))

Netflix-Specific Metrics Considerations

The Engagement vs. Satisfaction Tension

💡

Netflix interviewers love probing this: "What if your feature increases engagement but users report feeling 'addicted' or 'regretful'?" This tests whether you understand that raw engagement isn't always good.

Netflix distinguishes between:

Active engagement — deliberate, enjoyable viewing
Passive consumption — autoplay loops, doom-scrolling
Satisfaction signals — post-viewing surveys, retention

Your metrics should capture this distinction:

satisfaction_metrics = {
    'intentional_starts': content_started_by_user_action / total_starts,
    'post_survey_score': average_satisfaction_rating,
    'regret_watches': content_hidden_or_rated_low_after_viewing / total_views,
    'return_next_day': users_returning_within_24h / active_users,
}

Amazon-Specific Metrics Considerations

The "Flywheel" Mentality

Amazon thinks in flywheels. Your metrics should show how the feature creates a self-reinforcing loop:

Architecture Diagram

Better Recommendations → More Content Consumed → 
Better Data for Algorithms → Even Better Recommendations → 
Higher Retention → More Investment in Content → 
More Content Available → Better Recommendations

Customer Obsession Metrics

Always frame metrics in terms of customer value:

Customer Effort Score — How hard was it to find something to watch?
Time to Value — How quickly did they find something they loved?
Repeat Usage — Did they come back because of the feature?

Common Anti-Patterns to Avoid

1. The "Vanity Metric" Trap

# BAD: "Total streams increased by 10M!"
# This means nothing without context

# GOOD: "Streams per active user increased 12% (p<0.01) 
#         while maintaining churn rate at 2.1%"

2. The "Metric Overload" Problem

Don't propose 30 metrics. The interviewer wants to see you can prioritize.

3. The "Missing Guardrails" Mistake

Always ask: "What could go wrong?" and add guardrail metrics.

4. The "No Statistical Rigor" Omission

Always mention:

Sample size calculations
Statistical significance thresholds
Multiple testing corrections
Novelty and novelty effects

How to Structure Your Answer

Step 1: Clarify the feature and business context (2 minutes) Step 2: Define the north star metric and justify it (2 minutes) Step 3: Decompose into supporting metrics (3 minutes) Step 4: Identify guardrail metrics (2 minutes) Step 5: Discuss measurement methodology (2 minutes) Step 6: Address edge cases and trade-offs (2 minutes)

Metrics Design: How Would You Measure Success for [Product]?