πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Scaling Pipelines: Handling Petabytes, Real-time vs Batch

Data Scientist Role InterviewScaling Data Pipelines & Infrastructure⭐ Premium

Advertisement

⚑

Asked at Amazon & Meta

Scaling Pipelines

Handling Petabytes, Real-time vs Batch

The Interview Question

"You need to build a recommendation system that processes 100M user events per hour. How would you design the data pipeline?"

This question tests whether you can design systems that work at scale β€” not just in a notebook, but in production with real constraints.


Why Companies Ask This

ℹ️

Amazon and Meta operate at scales most companies can't imagine. They need data scientists who understand distributed systems, can work with big data tools, and design pipelines that are reliable, efficient, and cost-effective.

Interviewers evaluate:

  1. System Design β€” Can you architect scalable solutions?
  2. Tool Knowledge β€” Do you understand distributed data tools?
  3. Trade-off Awareness β€” Can you balance latency, cost, and complexity?
  4. Production Mindset β€” Do you think about monitoring, failures, and recovery?
  5. Cost Consciousness β€” Do you optimize for both performance and cost?

The Pipeline Design Framework

Step 1: Understand Requirements

requirements_gathering = {
    'data_volume': 'How much data per day/hour/minute?',
    'latency': 'How quickly do we need results? (real-time, near-real-time, batch)',
    'freshness': 'How stale can the data be?',
    'reliability': 'What\'s the acceptable downtime?',
    'cost': 'What\'s the budget for infrastructure?',
    'complexity': 'What\'s the team\'s technical capability?',
}

Step 2: Choose Processing Paradigm

processing_paradigms = {
    'batch': {
        'description': 'Process data in large batches (hourly, daily)',
        'tools': ['Apache Spark', 'Hive', 'BigQuery', 'Redshift'],
        'latency': 'Minutes to hours',
        'use_cases': [
            'Daily reports',
            'Model training',
            'Historical analysis',
            'Cost-sensitive workloads',
        ],
        'pros': ['Efficient for large volumes', 'Easier to debug', 'Lower cost'],
        'cons': ['Higher latency', 'Not suitable for real-time'],
    },
    'streaming': {
        'description': 'Process data as it arrives',
        'tools': ['Apache Kafka', 'Apache Flink', 'Spark Streaming', 'Pub/Sub'],
        'latency': 'Milliseconds to seconds',
        'use_cases': [
            'Real-time recommendations',
            'Fraud detection',
            'Live dashboards',
            'Alerting systems',
        ],
        'pros': ['Low latency', 'Up-to-date results', 'Event-driven'],
        'cons': ['Higher complexity', 'Higher cost', 'Harder to debug'],
    },
    'micro_batch': {
        'description': 'Process data in small batches (every few seconds)',
        'tools': ['Spark Structured Streaming', 'Kafka Streams'],
        'latency': 'Seconds to minutes',
        'use_cases': [
            'Near-real-time analytics',
            'Aggregations',
            'Feature engineering',
        ],
        'pros': ['Balance of latency and efficiency', 'Easier than pure streaming'],
        'cons': ['Not truly real-time', 'Batch boundary issues'],
    },
    'lambda_architecture': {
        'description': 'Combines batch and streaming for completeness',
        'tools': ['Spark + Kafka + Cassandra'],
        'latency': 'Both batch and real-time',
        'use_cases': [
            'Systems needing both historical and real-time views',
            'Complex event processing',
        ],
        'pros': ['Complete data view', 'Fault-tolerant'],
        'cons': ['Complex to maintain', 'Two codebases'],
    },
}

Example: Recommendation System Pipeline

Architecture

recommendation_pipeline = {
    'data_sources': {
        'user_events': {
            'volume': '100M events/hour',
            'format': 'JSON',
            'source': 'Kafka topic',
        },
        'user_profiles': {
            'volume': '500M users',
            'format': 'Parquet',
            'source': 'Data warehouse',
        },
        'item_catalog': {
            'volume': '100M items',
            'format': 'Protobuf',
            'source': 'Service API',
        },
    },
    
    'ingestion_layer': {
        'tool': 'Apache Kafka',
        'partitions': 100,
        'replication_factor': 3,
        'retention': '7 days',
    },
    
    'processing_layer': {
        'real_time': {
            'tool': 'Apache Flink',
            'purpose': 'Real-time feature computation',
            'features': [
                'User\'s last 10 interactions',
                'Session-level aggregations',
                'Trending items',
            ],
        },
        'batch': {
            'tool': 'Apache Spark',
            'purpose': 'Model training and batch features',
            'features': [
                'User historical preferences',
                'Item similarity scores',
                'Collaborative filtering embeddings',
            ],
        },
    },
    
    'storage_layer': {
        'feature_store': {
            'tool': 'Redis or Cassandra',
            'purpose': 'Low-latency feature serving',
            'latency': '< 10ms',
        },
        'data_warehouse': {
            'tool': 'BigQuery or Redshift',
            'purpose': 'Historical data and batch analytics',
        },
        'model_registry': {
            'tool': 'MLflow or SageMaker',
            'purpose': 'Model versioning and deployment',
        },
    },
    
    'serving_layer': {
        'tool': 'TensorFlow Serving or custom API',
        'latency': '< 100ms for recommendations',
        'scaling': 'Auto-scaling based on traffic',
    },
}

Implementation

# Real-time feature computation with Flink (pseudocode)
class RealTimeFeatureEngine:
    """
    Compute real-time features from user event stream.
    """
    
    def __init__(self):
        self.feature_store = RedisFeatureStore()
        self.window_size = 3600  # 1 hour
    
    def process_event(self, event):
        """
        Process a single user event and update features.
        """
        user_id = event['user_id']
        item_id = event['item_id']
        timestamp = event['timestamp']
        
        # Update interaction history
        self.feature_store.append_to_list(
            f'user:{user_id}:recent_items',
            item_id,
            max_length=100,
            expiry=self.window_size
        )
        
        # Update item interaction counts
        self.feature_store.increment(
            f'item:{item_id}:interaction_count',
            expiry=self.window_size
        )
        
        # Compute session features
        session_key = f'user:{user_id}:session'
        self.feature_store.increment(session_key, expiry=3600)
        
        # Compute trending score
        self.update_trending_score(item_id, timestamp)
    
    def get_features(self, user_id, item_id):
        """
        Retrieve features for model serving.
        """
        features = {
            'user_recent_items': self.feature_store.get_list(
                f'user:{user_id}:recent_items'
            ),
            'item_interaction_count': self.feature_store.get(
                f'item:{item_id}:interaction_count'
            ),
            'user_item_similarity': self.compute_similarity(
                user_id, item_id
            ),
            'trending_score': self.feature_store.get(
                f'item:{item_id}:trending_score'
            ),
        }
        return features

# Batch feature computation with Spark
def compute_batch_features(spark, user_events_df, item_catalog_df):
    """
    Compute batch features for model training.
    """
    from pyspark.sql import functions as F
    from pyspark.ml.feature import VectorAssembler
    
    # User-level aggregations
    user_features = user_events_df.groupBy('user_id').agg(
        F.count('*').alias('total_interactions'),
        F.countDistinct('item_id').alias('unique_items'),
        F.avg('rating').alias('avg_rating'),
        F.stddev('rating').alias('rating_std'),
        F.max('timestamp').alias('last_interaction'),
        F.expr('percentile_approx(rating, 0.5)').alias('median_rating'),
    )
    
    # Item-level aggregations
    item_features = user_events_df.groupBy('item_id').agg(
        F.count('*').alias('total_interactions'),
        F.countDistinct('user_id').alias('unique_users'),
        F.avg('rating').alias('avg_rating'),
        F.expr('percentile_approx(rating, 0.5)').alias('median_rating'),
    )
    
    # User-item interaction matrix for collaborative filtering
    interaction_matrix = user_events_df.groupBy('user_id', 'item_id').agg(
        F.count('*').alias('interaction_count'),
        F.avg('rating').alias('avg_rating'),
    )
    
    return user_features, item_features, interaction_matrix

Real-time vs Batch Decision Framework

def choose_processing_paradigm(requirements):
    """
    Help choose between real-time and batch processing.
    """
    decision_factors = {
        'latency_requirement': {
            '< 1 second': 'streaming',
            '1-60 seconds': 'micro_batch',
            '1-60 minutes': 'batch',
            '> 1 hour': 'batch',
        },
        'data_volume': {
            '< 1GB/hour': 'micro_batch',
            '1-100GB/hour': 'streaming or micro_batch',
            '> 100GB/hour': 'streaming with careful design',
        },
        'cost_budget': {
            'low': 'batch',
            'medium': 'micro_batch',
            'high': 'streaming',
        },
        'team_expertise': {
            'batch_only': 'batch',
            'some_streaming': 'micro_batch',
            'expert': 'streaming',
        },
    }
    
    recommendations = []
    
    for factor, value in requirements.items():
        if factor in decision_factors and value in decision_factors[factor]:
            recommendations.append({
                'factor': factor,
                'recommendation': decision_factors[factor][value],
            })
    
    # Count recommendations
    from collections import Counter
    paradigm_counts = Counter(r['recommendation'] for r in recommendations)
    
    return {
        'primary_recommendation': paradigm_counts.most_common(1)[0][0],
        'all_recommendations': recommendations,
    }

Common Pipeline Patterns

1. Event-Driven Architecture

event_driven_architecture = {
    'pattern': 'Events trigger downstream processing',
    'example': 'User clicks item β†’ Update features β†’ Refresh recommendations',
    'tools': ['Kafka', 'EventBridge', 'Pub/Sub'],
    'pros': ['Loose coupling', 'Scalable', 'Auditable'],
    'cons': ['Eventual consistency', 'Complex debugging'],
}

2. CDC (Change Data Capture)

cdc_pattern = {
    'pattern': 'Capture database changes in real-time',
    'example': 'User updates profile β†’ Capture change β†’ Update feature store',
    'tools': ['Debezium', 'AWS DMS', 'Maxwell'],
    'pros': ['Real-time sync', 'No impact on source DB'],
    'cons': ['Schema evolution challenges', 'Ordering guarantees'],
}

3. Materialized Views

materialized_views = {
    'pattern': 'Pre-compute aggregations for fast queries',
    'example': 'Pre-compute daily active users by country',
    'tools': ['BigQuery Materialized Views', 'Spark + Parquet', 'ClickHouse'],
    'pros': ['Fast queries', 'Reduced compute cost'],
    'cons': ['Staleness', 'Storage cost'],
}

Amazon-Specific Pipeline Considerations

The "Two-Pizza Team" Pattern

amazon_pipeline_design = {
    'principle': 'Keep teams small and services decoupled',
    'application': 'Each pipeline component should be owned by a small team',
    'benefit': 'Independent scaling, deployment, and failure isolation',
}

AWS Services for Data Pipelines

aws_services = {
    'ingestion': ['Kinesis', 'MSK (Managed Kafka)', 'DynamoDB Streams'],
    'processing': ['EMR (Spark)', 'Glue', 'Lambda', 'Flink on KDA'],
    'storage': ['S3', 'DynamoDB', 'Redshift', 'Athena'],
    'orchestration': ['Step Functions', 'MWAA (Airflow)'],
    'monitoring': ['CloudWatch', 'X-Ray'],
}

Meta-Specific Pipeline Considerations

Scale and Speed

meta_pipeline_scale = {
    'daily_data': 'Petabytes',
    'events_per_second': 'Millions',
    'models_trained': 'Thousands per day',
    'serving_requests': 'Billions per day',
}

The "Move Fast" Pipeline Philosophy

meta_philosophy = {
    'principle': 'Ship quickly, iterate based on data',
    'application': 'Build pipelines that can be modified quickly',
    'trade_off': 'May sacrifice some efficiency for flexibility',
}

Common Mistakes to Avoid

⚠️

These mistakes can cause pipeline failures at scale:

  1. Not planning for failure β€” Networks fail, disks crash, services go down
  2. Ignoring data quality β€” Bad data propagates through pipelines
  3. Over-engineering β€” Start simple, add complexity as needed
  4. Not monitoring β€” You can't fix what you can't see
  5. Ignoring cost β€” Cloud costs can spiral without optimization
  6. Not testing β€” Test with production-like data volumes

Quiz: Test Your Understanding


Related Topics

Advertisement