Netflix, Spotify, SaaS Companies

Design Customer Churn Prediction System

Building predictive churn models for millions of subscribers with actionable insights

Interview Question

"Design a customer churn prediction system like Netflix or Spotify that can predict which customers are likely to cancel their subscription, with enough lead time to enable effective retention interventions, while handling millions of users and updating predictions in real-time."

Difficulty: Hard | Frequently asked at Netflix, Spotify, SaaS companies, Telecom, Banking

1. Requirements Gathering

Functional Requirements

Churn Prediction: Predict probability of customer churning within N days
Real-time Scoring: Update predictions as new behavior data arrives
Risk Segmentation: Categorize customers by churn risk level
Intervention Triggers: Trigger retention actions based on risk
Root Cause Analysis: Identify key factors driving churn
A/B Testing: Test different retention strategies
Reporting: Dashboards for business stakeholders

Non-Functional Requirements

Latency: < 1s for batch predictions, < 100ms for real-time updates
Throughput: Score millions of customers daily
Accuracy: AUC > 0.8, precision > 70% at top 10% risk
Freshness: Predictions update within hours of new data
Scalability: Handle 10x growth in customers
Explainability: All predictions must be explainable
Privacy: GDPR/CCPA compliant

ℹ️

Scale Perspective: Netflix has 260M+ subscribers. Even a 1% improvement in churn prediction can save millions in revenue. The system must identify at-risk customers early enough for effective intervention while maintaining prediction accuracy.

2. High-Level Architecture Overview

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                         DATA SOURCES                                        │
│  User Activity │ Subscription Data │ Support Tickets │ Payment Data │ NPS   │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         DATA PIPELINE                                       │
│  Real-time Streaming │ Feature Computation │ Feature Store │ Batch Processing│
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
┌────────────────────────┐ ┌───────────────┐ ┌──────────────────────┐
│  CHURN PREDICTION      │ │ SEGMENTATION  │ │ ROOT CAUSE           │
│  MODEL                 │ │ ENGINE        │ │ ANALYSIS             │
│  (GBDT + NN)           │ │               │ │                      │
└────────────────────────┘ └───────────────┘ └──────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                        DECISION ENGINE                                       │
│  Risk Scoring │ Intervention Selection │ Timing Optimization │ Attribution  │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
┌────────────────────────┐ ┌───────────────┐ ┌──────────────────────┐
│  RETENTION             │ │ MARKETING     │ │ PRODUCT              │
│  INTERVENTIONS         │ │ CAMPAIGNS     │ │ IMPROVEMENTS         │
└────────────────────────┘ └───────────────┘ └──────────────────────┘

💡

Key Insight: Churn prediction is not just about building an accurate model. The real value is in identifying actionable insights and triggering effective interventions at the right time.

3. Data Pipeline Design

3.1 Customer Data Model

from dataclasses import dataclass
from typing import List, Dict, Optional
from datetime import datetime
from decimal import Decimal

@dataclass
class Customer:
    customer_id: str
    subscription_tier: str
    subscription_start: datetime
    monthly_revenue: Decimal
    payment_method: str
    billing_cycle: str
    
@dataclass
class CustomerActivity:
    customer_id: str
    timestamp: datetime
    activity_type: str  # login, watch, listen, search, etc.
    duration_minutes: float
    device_type: str
    content_type: Optional[str]
    
@dataclass
class ChurnLabel:
    customer_id: str
    churn_date: Optional[datetime]
    churn_reason: Optional[str]
    churn_type: str  # voluntary, involuntary, downgrade
    lifetime_value: Decimal

3.2 Feature Engineering

class ChurnFeatureExtractor:
    def __init__(self):
        self.feature_store = FeatureStore()
    
    async def extract_features(self, customer_id: str, prediction_date: datetime) -> Dict:
        features = {}
        
        # Engagement features
        engagement = await self.extract_engagement_features(customer_id, prediction_date)
        features.update(engagement)
        
        # Subscription features
        subscription = await self.extract_subscription_features(customer_id, prediction_date)
        features.update(subscription)
        
        # Support features
        support = await self.extract_support_features(customer_id, prediction_date)
        features.update(support)
        
        # Payment features
        payment = await self.extract_payment_features(customer_id, prediction_date)
        features.update(payment)
        
        # Trend features
        trends = await self.extract_trend_features(customer_id, prediction_date)
        features.update(trends)
        
        return features
    
    async def extract_engagement_features(self, customer_id, prediction_date):
        # Get activity data for different time windows
        windows = [7, 14, 30, 60, 90]
        features = {}
        
        for window in windows:
            activities = await self.get_activities(
                customer_id, 
                prediction_date, 
                days=window
            )
            
            features[f'login_count_{window}d'] = len([a for a in activities if a.activity_type == 'login'])
            features[f'active_days_{window}d'] = len(set(a.timestamp.date() for a in activities))
            features[f'total_duration_{window}d'] = sum(a.duration_minutes for a in activities)
            features[f'avg_session_length_{window}d'] = (
                features[f'total_duration_{window}d'] / max(features[f'login_count_{window}d'], 1)
            )
        
        # Trend features
        features['login_trend'] = (
            features['login_count_7d'] / max(features['login_count_30d'] / 4, 1)
        )
        features['duration_trend'] = (
            features['total_duration_7d'] / max(features['total_duration_30d'] / 4, 1)
        )
        
        return features
    
    async def extract_subscription_features(self, customer_id, prediction_date):
        customer = await self.get_customer(customer_id)
        
        subscription_days = (prediction_date - customer.subscription_start).days
        
        return {
            'subscription_age_days': subscription_days,
            'subscription_age_months': subscription_days / 30,
            'monthly_revenue': float(customer.monthly_revenue),
            'is_annual_plan': customer.billing_cycle == 'annual',
            'payment_method_encoded': self.encode_payment_method(customer.payment_method),
            'tenure_group': self.get_tenure_group(subscription_days)
        }

⚠️

Critical Feature Engineering Considerations:

Temporal features: Use sliding windows for engagement
Trend features: Capture changes in behavior over time
Recency features: Recent behavior is most predictive
Interaction features: Cross features between different data sources

4. Model Selection and Training

4.1 Multi-Model Architecture

class ChurnPredictionEnsemble:
    def __init__(self):
        self.models = {
            'gbdt': GradientBoostingModel(),
            'neural_net': NeuralNetworkModel(),
            'survival': SurvivalAnalysisModel()
        }
        self.meta_model = MetaLearner()
    
    async def predict(self, features: Dict) -> Dict:
        predictions = {}
        
        for name, model in self.models.items():
            pred = await model.predict(features)
            predictions[name] = pred
        
        # Meta-learner combines predictions
        meta_features = np.array([predictions[name] for name in predictions]).reshape(1, -1)
        final_prob = self.meta_model.predict(meta_features)[0][0]
        
        return {
            'churn_probability': float(final_prob),
            'component_predictions': predictions,
            'risk_level': self.get_risk_level(final_prob),
            'time_to_churn': await self.predict_time_to_churn(features)
        }

class SurvivalAnalysisModel:
    """Predict time until churn using survival analysis"""
    
    def __init__(self):
        self.model = CoxPHFitter()
    
    async def predict_time_to_churn(self, features):
        # Fit survival model
        self.model.fit(features, duration_col='tenure', event_col='churned')
        
        # Predict median survival time
        median_survival = self.model.predict_median(features)
        
        return {
            'median_time_to_churn_days': median_survival,
            'survival_function': self.model.predict_survival_function(features)
        }

4.2 Handling Class Imbalance

class ChurnImbalanceHandler:
    def __init__(self):
        pass
    
    def focal_loss(self, y_true, y_pred, alpha=0.25, gamma=2.0):
        y_pred = tf.clip_by_value(y_pred, 1e-7, 1 - 1e-7)
        bce = -y_true * tf.math.log(y_pred) - (1 - y_true) * tf.math.log(1 - y_pred)
        p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
        alpha_t = y_true * alpha + (1 - y_true) * (1 - alpha)
        focal_weight = alpha_t * tf.pow(1 - p_t, gamma)
        return focal_weight * bce
    
    def cost_sensitive_loss(self, y_true, y_pred, churn_cost=100, retention_cost=10):
        # Cost of missing a churner vs cost of unnecessary retention
        cost_matrix = np.array([
            [0, retention_cost],  # True negative, False positive
            [churn_cost, 0]      # False negative, True positive
        ])
        
        # Compute loss
        y_pred_classes = tf.cast(y_pred > 0.5, tf.float32)
        confusion = tf.math.confusion_matrix(y_true, y_pred_classes)
        
        return tf.reduce_sum(confusion * cost_matrix)

ℹ️

Training Strategy:

Use focal loss or cost-sensitive learning
Combine multiple data sources
Use survival analysis for time-to-churn prediction
Regular retraining with fresh data

5. Serving Architecture

5.1 Real-time Scoring Pipeline

Architecture Diagram

Customer Event → Feature Computation → Model Inference → Risk Update → Action Trigger
      (< 5ms)         (< 20ms)           (< 50ms)         (< 10ms)        (< 5ms)

5.2 Batch Scoring

class BatchScoringPipeline:
    def __init__(self):
        self.spark = SparkSession.builder \
            .appName("ChurnBatchScoring") \
            .getOrCreate()
    
    async def score_all_customers(self):
        # Read customer data
        customers = self.spark.read.parquet("s3://customers/")
        
        # Compute features
        features = self.compute_features_batch(customers)
        
        # Score customers
        predictions = self.model.predict_batch(features)
        
        # Write predictions
        predictions.write.mode("overwrite").parquet("s3://churn-predictions/")
        
        # Trigger interventions for high-risk customers
        high_risk = predictions.filter(col('churn_probability') > 0.7)
        await self.trigger_interventions(high_risk)

5.3 Intervention Engine

class InterventionEngine:
    def __init__(self):
        self.intervention_strategies = {
            'high_risk': ['personal_offer', 'success_call', 'feature_highlight'],
            'medium_risk': ['email_campaign', 'in_app_message'],
            'low_risk': ['newsletter', 'product_update']
        }
    
    async def select_intervention(self, customer_id, risk_level, churn_reasons):
        # Select intervention based on risk and root cause
        if churn_reason == 'price_sensitivity':
            return 'discount_offer'
        elif churn_reason == 'low_engagement':
            return 'feature_highlight'
        elif churn_reason == 'competitor_switch':
            return 'loyalty_reward'
        else:
            return self.intervention_strategies[risk_level][0]
    
    async def trigger_intervention(self, customer_id, intervention):
        # Log intervention
        await self.log_intervention(customer_id, intervention)
        
        # Execute intervention
        if intervention == 'discount_offer':
            await self.send_discount_offer(customer_id)
        elif intervention == 'success_call':
            await self.schedule_success_call(customer_id)
        
        # Track outcome
        await self.track_intervention_outcome(customer_id, intervention)

💡

Intervention Tips:

Time interventions appropriately (not too early, not too late)
Personalize interventions based on churn reasons
Track intervention effectiveness
Avoid over-communication

6. Monitoring and Observability

6.1 Key Metrics

class ChurnMetrics:
    MODEL_METRICS = ['auc_roc', 'precision_at_k', 'recall_at_k', 'calibration_error']
    BUSINESS_METRICS = ['churn_rate', 'retention_rate', 'intervention_success_rate', 'ltv']
    OPERATIONAL_METRICS = ['prediction_latency', 'throughput', 'feature_freshness']
    FAIRNESS_METRICS = ['demographic_parity', 'equal_opportunity']

7. Scale Considerations and Trade-offs

7.1 Horizontal Scaling

Architecture Diagram

Customer Data: Shard by customer_id
Feature Computation: Distributed processing with Spark
Model Serving: Horizontal scaling with load balancing
Intervention Engine: Async processing with message queue

7.2 Cost vs Performance Trade-offs

Dimension	Option A (Cost Optimized)	Option B (Performance Optimized)
Model Complexity	Simple GBDT	Deep ensemble
Feature Freshness	Daily batch	Real-time streaming
Scoring Frequency	Weekly	Daily
Intervention	Automated only	Human + automated

8. Advanced Topics

8.1 Causal Inference for Churn

class CausalChurnAnalyzer:
    def __init__(self):
        self.uplift_model = UpliftModel()
    
    async def estimate_treatment_effect(self, customer_id, intervention):
        # Estimate causal effect of intervention
        uplift = self.uplift_model.predict(customer_id, intervention)
        
        return {
            'uplift_score': uplift,
            'expected_incremental_retention': uplift,
            'confidence_interval': self.compute_confidence(uplift)
        }

8.2 Explainable Predictions

class ChurnExplainer:
    def __init__(self):
        self.shap_explainer = shap.TreeExplainer(self.model)
    
    async def explain_prediction(self, customer_id, features):
        shap_values = self.shap_explainer.shap_values(features)
        
        # Get top factors
        feature_importance = list(zip(self.feature_names, shap_values[0]))
        feature_importance.sort(key=lambda x: abs(x[1]), reverse=True)
        
        # Generate narrative
        narrative = self.generate_narrative(feature_importance[:5])
        
        return {
            'top_factors': feature_importance[:5],
            'narrative': narrative,
            'recommended_actions': self.get_recommended_actions(feature_importance[:3])
        }

9. Implementation Roadmap

Phase 1: Basic Model (Weeks 1-4)

Feature engineering pipeline
Basic GBDT model
Batch scoring pipeline

Phase 2: Advanced Models (Weeks 5-8)

Survival analysis
Meta-learner ensemble
Real-time scoring

Phase 3: Intervention System (Weeks 9-12)

Intervention engine
A/B testing framework
Explainability

Phase 4: Optimization (Weeks 13-16)

Causal inference
Cost optimization
Advanced monitoring

10. Summary and Key Takeaways

Architecture Recap

Feature engineering: Engagement, subscription, support, payment features
Multi-model ensemble: GBDT + Neural Network + Survival Analysis
Intervention engine: Automated retention actions
Explainability: Understand why customers churn

Key Metrics

AUC: > 0.8
Precision at top 10%: > 70%
Intervention success rate: > 20%

Common Interview Mistakes

Not discussing class imbalance
Ignoring time-to-churn prediction
Forgetting about intervention strategies
Not considering explainability

ℹ️

Final Interview Tip: Emphasize the business impact of churn prediction. Discuss how you'd identify actionable insights and trigger effective interventions. Show understanding of both ML techniques and retention strategies.

Design Customer Churn Prediction System

Design Customer Churn Prediction System

Interview Question

1. Requirements Gathering

Functional Requirements

Non-Functional Requirements

2. High-Level Architecture Overview

3. Data Pipeline Design

3.1 Customer Data Model

3.2 Feature Engineering

4. Model Selection and Training

4.1 Multi-Model Architecture

4.2 Handling Class Imbalance

5. Serving Architecture

5.1 Real-time Scoring Pipeline

5.2 Batch Scoring

5.3 Intervention Engine

6. Monitoring and Observability

6.1 Key Metrics

7. Scale Considerations and Trade-offs

7.1 Horizontal Scaling

7.2 Cost vs Performance Trade-offs

8. Advanced Topics

8.1 Causal Inference for Churn

8.2 Explainable Predictions

9. Implementation Roadmap

Phase 1: Basic Model (Weeks 1-4)

Phase 2: Advanced Models (Weeks 5-8)

Phase 3: Intervention System (Weeks 9-12)

Phase 4: Optimization (Weeks 13-16)

10. Summary and Key Takeaways

Architecture Recap

Key Metrics

Common Interview Mistakes

Further Reading