Staff-Level DE System Design: End-to-End Platforms
Difficulty: Staff Level | Companies: Netflix, Uber, Airbnb, Stripe, Databricks
1. Design Framework
Architecture Diagram
Staff-Level Design Process:
βββ Requirements (functional + non-functional)
βββ Architecture (high-level components)
βββ Deep Dives (bottlenecks, scaling)
βββ Trade-offs (cost vs. reliability vs. latency)
βββ Operational Excellence (monitoring, on-call)
βββ Organizational Impact (team structure)
2. Design: Real-Time Analytics Platform
Requirements: 1M events/sec, < 1min freshness, 99.99% availability
Key Design Decisions
| Decision | Choice | Trade-off |
|---|---|---|
| Streaming engine | Flink (stateful) | More complex than Spark, but better for exactly-once |
| Storage format | Delta Lake | Better upserts, slightly slower than raw Parquet |
| Online store | Redis | Fast but limited storage |
| Batch processing | Spark | Best ecosystem, higher latency |
| Warehouse | Snowflake | Expensive but easy for analysts |
Capacity Planning
class CapacityPlanner:
def plan_kafka_cluster(self, events_per_sec, retention_days):
events_per_day = events_per_sec * 86400
total_events = events_per_day * retention_days
avg_event_size_kb = 2
total_storage_tb = (total_events * avg_event_size_kb) / (1024 * 1024 * 1024)
return {
"partitions": events_per_sec * 10, # 10 partitions per 1K events/sec
"brokers": max(3, int(total_storage_tb / 4)), # 4TB per broker
"replication_factor": 3,
"total_storage_tb": total_storage_tb,
}
def plan_flink_cluster(self, events_per_sec, state_size_gb):
return {
"taskmanagers": max(3, int(events_per_sec / 10000)),
"taskmanager_memory": "8GB",
"checkpoint_interval_ms": 60000,
"state_backend": "rocksdb",
}
3. Reliability Engineering
class ReliabilityDesign:
def __init__(self):
self.sla_targets = {
"availability": 99.99, # 52 min downtime/year
"freshness_minutes": 5,
"recovery_time_minutes": 15,
}
def design_for_availability(self):
return {
"kafka": {"replication": 3, "min_insync_replicas": 2, "acks": "all"},
"flink": {"checkpointing": True, "savepoints": True, "state_backend": "rocksdb"},
"storage": {"replication": 3, "cross_region": True},
"compute": {"auto_scaling": True, "multi_az": True},
}
def incident_response_plan(self):
return {
"detection": "automated_alerting (< 5 min)",
"triage": "runbook + on-call (< 15 min)",
"mitigation": "failover / throttle (< 30 min)",
"resolution": "root_cause_fix (< 4 hours)",
"post_mortem": "within 48 hours",
}
βΉοΈ
Best Practice: Design for failure. Every component will fail β the question is how gracefully. Use circuit breakers, retries with backoff, and graceful degradation.
Follow-Up Questions
- Design a data platform that processes 10B events/day with 99.99% availability.
- How would you design a multi-region data platform?
- Design a data platform for a company that acquired 5 other companies.
- How would you handle disaster recovery for a petabyte-scale data lake?
- Design an organizational structure for a 50-person data engineering team.