The Interview Question
"You need to build a recommendation system that processes 100M user events per hour. How would you design the data pipeline?"
This question tests whether you can design systems that work at scale β not just in a notebook, but in production with real constraints.
Why Companies Ask This
βΉοΈ
Amazon and Meta operate at scales most companies can't imagine. They need data scientists who understand distributed systems, can work with big data tools, and design pipelines that are reliable, efficient, and cost-effective.
Interviewers evaluate:
- System Design β Can you architect scalable solutions?
- Tool Knowledge β Do you understand distributed data tools?
- Trade-off Awareness β Can you balance latency, cost, and complexity?
- Production Mindset β Do you think about monitoring, failures, and recovery?
- Cost Consciousness β Do you optimize for both performance and cost?
The Pipeline Design Framework
Step 1: Understand Requirements
requirements_gathering = {
'data_volume': 'How much data per day/hour/minute?',
'latency': 'How quickly do we need results? (real-time, near-real-time, batch)',
'freshness': 'How stale can the data be?',
'reliability': 'What\'s the acceptable downtime?',
'cost': 'What\'s the budget for infrastructure?',
'complexity': 'What\'s the team\'s technical capability?',
}
Step 2: Choose Processing Paradigm
processing_paradigms = {
'batch': {
'description': 'Process data in large batches (hourly, daily)',
'tools': ['Apache Spark', 'Hive', 'BigQuery', 'Redshift'],
'latency': 'Minutes to hours',
'use_cases': [
'Daily reports',
'Model training',
'Historical analysis',
'Cost-sensitive workloads',
],
'pros': ['Efficient for large volumes', 'Easier to debug', 'Lower cost'],
'cons': ['Higher latency', 'Not suitable for real-time'],
},
'streaming': {
'description': 'Process data as it arrives',
'tools': ['Apache Kafka', 'Apache Flink', 'Spark Streaming', 'Pub/Sub'],
'latency': 'Milliseconds to seconds',
'use_cases': [
'Real-time recommendations',
'Fraud detection',
'Live dashboards',
'Alerting systems',
],
'pros': ['Low latency', 'Up-to-date results', 'Event-driven'],
'cons': ['Higher complexity', 'Higher cost', 'Harder to debug'],
},
'micro_batch': {
'description': 'Process data in small batches (every few seconds)',
'tools': ['Spark Structured Streaming', 'Kafka Streams'],
'latency': 'Seconds to minutes',
'use_cases': [
'Near-real-time analytics',
'Aggregations',
'Feature engineering',
],
'pros': ['Balance of latency and efficiency', 'Easier than pure streaming'],
'cons': ['Not truly real-time', 'Batch boundary issues'],
},
'lambda_architecture': {
'description': 'Combines batch and streaming for completeness',
'tools': ['Spark + Kafka + Cassandra'],
'latency': 'Both batch and real-time',
'use_cases': [
'Systems needing both historical and real-time views',
'Complex event processing',
],
'pros': ['Complete data view', 'Fault-tolerant'],
'cons': ['Complex to maintain', 'Two codebases'],
},
}
Example: Recommendation System Pipeline
Architecture
recommendation_pipeline = {
'data_sources': {
'user_events': {
'volume': '100M events/hour',
'format': 'JSON',
'source': 'Kafka topic',
},
'user_profiles': {
'volume': '500M users',
'format': 'Parquet',
'source': 'Data warehouse',
},
'item_catalog': {
'volume': '100M items',
'format': 'Protobuf',
'source': 'Service API',
},
},
'ingestion_layer': {
'tool': 'Apache Kafka',
'partitions': 100,
'replication_factor': 3,
'retention': '7 days',
},
'processing_layer': {
'real_time': {
'tool': 'Apache Flink',
'purpose': 'Real-time feature computation',
'features': [
'User\'s last 10 interactions',
'Session-level aggregations',
'Trending items',
],
},
'batch': {
'tool': 'Apache Spark',
'purpose': 'Model training and batch features',
'features': [
'User historical preferences',
'Item similarity scores',
'Collaborative filtering embeddings',
],
},
},
'storage_layer': {
'feature_store': {
'tool': 'Redis or Cassandra',
'purpose': 'Low-latency feature serving',
'latency': '< 10ms',
},
'data_warehouse': {
'tool': 'BigQuery or Redshift',
'purpose': 'Historical data and batch analytics',
},
'model_registry': {
'tool': 'MLflow or SageMaker',
'purpose': 'Model versioning and deployment',
},
},
'serving_layer': {
'tool': 'TensorFlow Serving or custom API',
'latency': '< 100ms for recommendations',
'scaling': 'Auto-scaling based on traffic',
},
}
Implementation
# Real-time feature computation with Flink (pseudocode)
class RealTimeFeatureEngine:
"""
Compute real-time features from user event stream.
"""
def __init__(self):
self.feature_store = RedisFeatureStore()
self.window_size = 3600 # 1 hour
def process_event(self, event):
"""
Process a single user event and update features.
"""
user_id = event['user_id']
item_id = event['item_id']
timestamp = event['timestamp']
# Update interaction history
self.feature_store.append_to_list(
f'user:{user_id}:recent_items',
item_id,
max_length=100,
expiry=self.window_size
)
# Update item interaction counts
self.feature_store.increment(
f'item:{item_id}:interaction_count',
expiry=self.window_size
)
# Compute session features
session_key = f'user:{user_id}:session'
self.feature_store.increment(session_key, expiry=3600)
# Compute trending score
self.update_trending_score(item_id, timestamp)
def get_features(self, user_id, item_id):
"""
Retrieve features for model serving.
"""
features = {
'user_recent_items': self.feature_store.get_list(
f'user:{user_id}:recent_items'
),
'item_interaction_count': self.feature_store.get(
f'item:{item_id}:interaction_count'
),
'user_item_similarity': self.compute_similarity(
user_id, item_id
),
'trending_score': self.feature_store.get(
f'item:{item_id}:trending_score'
),
}
return features
# Batch feature computation with Spark
def compute_batch_features(spark, user_events_df, item_catalog_df):
"""
Compute batch features for model training.
"""
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler
# User-level aggregations
user_features = user_events_df.groupBy('user_id').agg(
F.count('*').alias('total_interactions'),
F.countDistinct('item_id').alias('unique_items'),
F.avg('rating').alias('avg_rating'),
F.stddev('rating').alias('rating_std'),
F.max('timestamp').alias('last_interaction'),
F.expr('percentile_approx(rating, 0.5)').alias('median_rating'),
)
# Item-level aggregations
item_features = user_events_df.groupBy('item_id').agg(
F.count('*').alias('total_interactions'),
F.countDistinct('user_id').alias('unique_users'),
F.avg('rating').alias('avg_rating'),
F.expr('percentile_approx(rating, 0.5)').alias('median_rating'),
)
# User-item interaction matrix for collaborative filtering
interaction_matrix = user_events_df.groupBy('user_id', 'item_id').agg(
F.count('*').alias('interaction_count'),
F.avg('rating').alias('avg_rating'),
)
return user_features, item_features, interaction_matrix
Real-time vs Batch Decision Framework
def choose_processing_paradigm(requirements):
"""
Help choose between real-time and batch processing.
"""
decision_factors = {
'latency_requirement': {
'< 1 second': 'streaming',
'1-60 seconds': 'micro_batch',
'1-60 minutes': 'batch',
'> 1 hour': 'batch',
},
'data_volume': {
'< 1GB/hour': 'micro_batch',
'1-100GB/hour': 'streaming or micro_batch',
'> 100GB/hour': 'streaming with careful design',
},
'cost_budget': {
'low': 'batch',
'medium': 'micro_batch',
'high': 'streaming',
},
'team_expertise': {
'batch_only': 'batch',
'some_streaming': 'micro_batch',
'expert': 'streaming',
},
}
recommendations = []
for factor, value in requirements.items():
if factor in decision_factors and value in decision_factors[factor]:
recommendations.append({
'factor': factor,
'recommendation': decision_factors[factor][value],
})
# Count recommendations
from collections import Counter
paradigm_counts = Counter(r['recommendation'] for r in recommendations)
return {
'primary_recommendation': paradigm_counts.most_common(1)[0][0],
'all_recommendations': recommendations,
}
Common Pipeline Patterns
1. Event-Driven Architecture
event_driven_architecture = {
'pattern': 'Events trigger downstream processing',
'example': 'User clicks item β Update features β Refresh recommendations',
'tools': ['Kafka', 'EventBridge', 'Pub/Sub'],
'pros': ['Loose coupling', 'Scalable', 'Auditable'],
'cons': ['Eventual consistency', 'Complex debugging'],
}
2. CDC (Change Data Capture)
cdc_pattern = {
'pattern': 'Capture database changes in real-time',
'example': 'User updates profile β Capture change β Update feature store',
'tools': ['Debezium', 'AWS DMS', 'Maxwell'],
'pros': ['Real-time sync', 'No impact on source DB'],
'cons': ['Schema evolution challenges', 'Ordering guarantees'],
}
3. Materialized Views
materialized_views = {
'pattern': 'Pre-compute aggregations for fast queries',
'example': 'Pre-compute daily active users by country',
'tools': ['BigQuery Materialized Views', 'Spark + Parquet', 'ClickHouse'],
'pros': ['Fast queries', 'Reduced compute cost'],
'cons': ['Staleness', 'Storage cost'],
}
Amazon-Specific Pipeline Considerations
The "Two-Pizza Team" Pattern
amazon_pipeline_design = {
'principle': 'Keep teams small and services decoupled',
'application': 'Each pipeline component should be owned by a small team',
'benefit': 'Independent scaling, deployment, and failure isolation',
}
AWS Services for Data Pipelines
aws_services = {
'ingestion': ['Kinesis', 'MSK (Managed Kafka)', 'DynamoDB Streams'],
'processing': ['EMR (Spark)', 'Glue', 'Lambda', 'Flink on KDA'],
'storage': ['S3', 'DynamoDB', 'Redshift', 'Athena'],
'orchestration': ['Step Functions', 'MWAA (Airflow)'],
'monitoring': ['CloudWatch', 'X-Ray'],
}
Meta-Specific Pipeline Considerations
Scale and Speed
meta_pipeline_scale = {
'daily_data': 'Petabytes',
'events_per_second': 'Millions',
'models_trained': 'Thousands per day',
'serving_requests': 'Billions per day',
}
The "Move Fast" Pipeline Philosophy
meta_philosophy = {
'principle': 'Ship quickly, iterate based on data',
'application': 'Build pipelines that can be modified quickly',
'trade_off': 'May sacrifice some efficiency for flexibility',
}
Common Mistakes to Avoid
β οΈ
These mistakes can cause pipeline failures at scale:
- Not planning for failure β Networks fail, disks crash, services go down
- Ignoring data quality β Bad data propagates through pipelines
- Over-engineering β Start simple, add complexity as needed
- Not monitoring β You can't fix what you can't see
- Ignoring cost β Cloud costs can spiral without optimization
- Not testing β Test with production-like data volumes