Design a Data Platform: End-to-End Architecture
System design for scalable data platforms
Interview Question
"Design a data platform for a ride-sharing company that processes 10TB of data daily. The platform must: (1) ingest real-time and batch data, (2) store data cost-effectively, (3) support ML model training, (4) provide BI dashboards, (5) handle 1000+ concurrent users. Include architecture, technology choices, and scaling strategies."
Difficulty: Hard | Frequently asked at Google, Netflix, Uber, Amazon
Theoretical Foundation
System Design Framework
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β System Design Framework β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Requirements Gathering β
β - Functional requirements β
β - Non-functional requirements β
β - Constraints β
β β
β 2. High-Level Design β
β - Architecture components β
β - Data flow β
β - Technology choices β
β β
β 3. Detailed Design β
β - Component deep dives β
β - Data models β
β - APIs β
β β
β 4. Scaling Strategies β
β - Horizontal scaling β
β - Vertical scaling β
β - Caching β
β β
β 5. Tradeoffs β
β - Cost vs performance β
β - Consistency vs availability β
β - Complexity vs simplicity β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Requirements Analysis
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Requirements Analysis β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Functional: β
β - Ingest 10TB/day (real-time + batch) β
β - Store data for 3 years β
β - Support ML model training β
β - Provide BI dashboards β
β - Handle 1000+ concurrent users β
β β
β Non-Functional: β
β - Availability: 99.9% β
β - Latency: < 5 seconds for dashboards β
β - Throughput: 100K queries/hour β
β - Data durability: 99.999999999% β
β β
β Constraints: β
β - Budget: $500K/year β
β - Team: 10 data engineers β
β - Timeline: 6 months β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
High-Level Architecture
Data Flow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Flow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Real-Time Path: β
β Mobile App βββΆ Kafka βββΆ Flink βββΆ Redis βββΆ API β
β (ride events) (ingest) (process) (serve) (recommend) β
β β
β Batch Path: β
β Database βββΆ Airflow βββΆ Spark βββΆ Delta Lake βββΆ dbt β
β (daily ETL) (orchestrate) (transform) (store) (model) β
β β
β ML Path: β
β Delta Lake βββΆ Spark βββΆ Feature Store βββΆ MLflow βββΆ API β
β (training data) (process) (features) (train) (serve) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Code Implementation
Ingestion Layer
# ============================================================
# INGESTION LAYER
# ============================================================
from kafka import KafkaProducer
from pyspark.sql import SparkSession
import json
# Real-time ingestion
producer = KafkaProducer(
bootstrap_servers='kafka:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
acks='all',
enable_idempotence=True
)
def ingest_ride_event(event):
"""Ingest ride event into Kafka"""
producer.send('ride_events', value=event)
# Batch ingestion
spark = SparkSession.builder \
.appName("BatchIngestion") \
.getOrCreate()
def ingest_batch_data(source_path, target_path):
"""Ingest batch data to Delta Lake"""
df = spark.read.parquet(source_path)
df.write \
.format("delta") \
.mode("append") \
.partitionBy("date") \
.save(target_path)
Storage Layer
# ============================================================
# STORAGE LAYER (LAKEHOUSE)
# ============================================================
# Bronze Layer: Raw data
def write_bronze(df, table_name):
"""Write raw data to Bronze layer"""
df.write \
.format("delta") \
.mode("append") \
.save(f"s3://lakehouse/bronze/{table_name}")
# Silver Layer: Cleaned data
def write_silver(df, table_name):
"""Write cleaned data to Silver layer"""
cleaned_df = df \
.filter(df.amount > 0) \
.filter(df.user_id.isNotNull()) \
.withColumn("processed_at", current_timestamp())
cleaned_df.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("date") \
.save(f"s3://lakehouse/silver/{table_name}")
# Gold Layer: Business aggregates
def write_gold(df, table_name):
"""Write business aggregates to Gold layer"""
df.write \
.format("delta") \
.mode("overwrite") \
.save(f"s3://lakehouse/gold/{table_name}")
Processing Layer
# ============================================================
# PROCESSING LAYER
# ============================================================
# Spark batch processing
def transform_orders(spark):
"""Transform orders for BI"""
# Read from Silver
orders = spark.read.format("delta").load("s3://lakehouse/silver/orders")
customers = spark.read.format("delta").load("s3://lakehouse/silver/customers")
# Join and aggregate
result = orders \
.join(customers, "customer_id") \
.groupBy("customer_segment", "order_date") \
.agg(
count("*").alias("order_count"),
sum("amount").alias("total_revenue")
)
# Write to Gold
write_gold(result, "daily_revenue")
# dbt SQL transformations
# models/marts/fct_orders.sql
"""
SELECT
order_id,
customer_id,
amount,
status,
order_date
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
"""
Serving Layer
# ============================================================
# SERVING LAYER
# ============================================================
# Redis for real-time features
import redis
class FeatureStore:
def __init__(self):
self.redis = redis.RedisCluster(
startup_nodes=[{"host": "redis", "port": 7000}],
decode_responses=True
)
def get_user_features(self, user_id):
"""Get real-time user features"""
return self.redis.hgetall(f"user:{user_id}:features")
def update_user_features(self, user_id, features):
"""Update user features"""
self.redis.hset(f"user:{user_id}:features", mapping=features)
# Snowflake for BI queries
import snowflake.connector
class QueryEngine:
def __init__(self):
self.conn = snowflake.connector.connect(
account='company.snowflakecomputing.com',
user='analyst',
password='password',
database='ANALYTICS',
schema='GOLD'
)
def execute_query(self, query):
"""Execute query on Snowflake"""
cursor = self.conn.cursor()
cursor.execute(query)
return cursor.fetchall()
API Layer
# ============================================================
# API LAYER
# ============================================================
from fastapi import FastAPI
from pydantic import BaseModel
import redis
import snowflake.connector
app = FastAPI()
# Dependencies
feature_store = FeatureStore()
query_engine = QueryEngine()
class RecommendationRequest(BaseModel):
user_id: str
num_recommendations: int = 5
@app.post("/recommendations")
async def get_recommendations(request: RecommendationRequest):
"""Get real-time recommendations"""
# Get user features
features = feature_store.get_user_features(request.user_id)
# Get recommendations from model
recommendations = await get_recommendations_from_model(features)
return {
"user_id": request.user_id,
"recommendations": recommendations
}
@app.get("/analytics/revenue")
async def get_revenue_analytics(start_date: str, end_date: str):
"""Get revenue analytics"""
query = f"""
SELECT
order_date,
SUM(total_revenue) as revenue
FROM daily_revenue
WHERE order_date BETWEEN '{start_date}' AND '{end_date}'
GROUP BY order_date
ORDER BY order_date
"""
results = query_engine.execute_query(query)
return {
"data": results,
"start_date": start_date,
"end_date": end_date
}
Orchestration
# ============================================================
# ORCHESTRATION (AIRFLOW)
# ============================================================
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
with DAG(
'data_platform_pipeline',
schedule_interval='0 2 * * *',
start_date=days_ago(1),
catchup=False,
) as dag:
# Ingestion
ingest_batch = PythonOperator(
task_id='ingest_batch',
python_callable=ingest_batch_data,
)
# Bronze
write_bronze = PythonOperator(
task_id='write_bronze',
python_callable=write_bronze,
)
# Silver
write_silver = PythonOperator(
task_id='write_silver',
python_callable=write_silver,
)
# Gold (dbt)
run_dbt = BashOperator(
task_id='run_dbt',
bash_command='cd /opt/dbt && dbt run --select marts',
)
# Testing
run_tests = BashOperator(
task_id='run_tests',
bash_command='cd /opt/dbt && dbt test',
)
# Dependencies
ingest_batch >> write_bronze >> write_silver >> run_dbt >> run_tests
Scaling Strategies
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Scaling Strategies β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Horizontal Scaling: β
β - Add more nodes to Kafka cluster β
β - Add more Spark executors β
β - Add more Snowflake warehouses β
β β
β Vertical Scaling: β
β - Upgrade instance types β
β - Increase memory/CPU β
β β
β Caching: β
β - Redis for real-time features β
β - Snowflake result cache β
β - CDN for static assets β
β β
β Partitioning: β
β - Partition data by date β
β - Partition by region β
β - Partition by user β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Cost Estimation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Cost Estimation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Storage (10TB/day Γ 3 years): β
β - S3: 10TB Γ 365 Γ 3 Γ $0.023 = $25,287/year β
β - Delta Lake metadata: $5,000/year β
β - Total: $30,287/year β
β β
β Compute: β
β - Kafka: 3 nodes Γ $500/month Γ 12 = $18,000/year β
β - Spark: 10 nodes Γ $1,000/month Γ 12 = $120,000/year β
β - Snowflake: $200,000/year β
β - Redis: $30,000/year β
β - Total: $368,000/year β
β β
β Other: β
β - Airflow: $20,000/year β
β - Monitoring: $10,000/year β
β - Total: $30,000/year β
β β
β Grand Total: $428,287/year β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π‘
Production Tip: Always start with a simpler architecture and iterate. Don't over-engineer from day one. Use managed services (Kafka, Snowflake) to reduce operational overhead. Focus on business value, not technology.
Common Follow-Up Questions
Q1: How do you handle data quality?
# Data quality checks
def validate_data_quality(df, rules):
"""Validate data quality"""
for column, rule in rules.items():
violations = df.filter(~rule(F.col(column))).count()
if violations > 0:
raise ValueError(f"Quality check failed for {column}")
return True
# Use dbt tests
# models/marts/_marts__models.yml
"""
- name: fct_orders
tests:
- unique
- not_null
- dbt_utils.accepted_range:
min_value: 0
"""
Q2: How do you handle disaster recovery?
# Cross-region replication
def replicate_to_dr_region(source_region, dr_region):
"""Replicate data to DR region"""
# Replicate S3
aws s3 sync s3://source-bucket s3://dr-bucket --source-region us-east-1 --region us-west-2
# Replicate Snowflake
# Use Snowflake replication
Q3: How do you monitor the platform?
# Monitoring stack
- Prometheus: Metrics collection
- Grafana: Dashboards
- AlertManager: Alerting
- ELK: Logging
# Key metrics
- Kafka lag
- Spark job duration
- Snowflake query performance
- API latency
- Error rates
Q4: How do you handle security?
# Security measures
- Encryption at rest (S3, Snowflake)
- Encryption in transit (TLS)
- IAM roles and policies
- VPC and security groups
- Data masking for PII
- Audit logging
β οΈ
Critical Consideration: Data platform design is about tradeoffs. There's no perfect architecture. Focus on your specific requirements, constraints, and team capabilities. Start simple, measure, and iterate.
Company-Specific Tips
Google Interview Tips
- Discuss BigQuery for analytics
- Explain Dataflow for stream processing
- Mention Cloud Storage for data lake
- Talk about Vertex AI for ML
Netflix Interview Tips
- Focus on multi-region architecture
- Explain cost optimization strategies
- Mention real-time personalization
- Talk about data mesh implementation
Uber Interview Tips
- Discuss real-time ride matching
- Explain geospatial data processing
- Mention ML for dynamic pricing
- Talk about multi-tenant platforms
βΉοΈ
Final Takeaway: Data platform design is a complex system design problem. Start with requirements, design high-level architecture, then dive into details. Always consider tradeoffs, scalability, and cost. Remember: the best architecture is the one that meets your specific needs.