πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Design a Data Platform: End-to-End Architecture

Data EngineeringArchitecture Design⭐ Premium

Advertisement

Google & Netflix Interview

Design a Data Platform: End-to-End Architecture

System design for scalable data platforms

Interview Question

"Design a data platform for a ride-sharing company that processes 10TB of data daily. The platform must: (1) ingest real-time and batch data, (2) store data cost-effectively, (3) support ML model training, (4) provide BI dashboards, (5) handle 1000+ concurrent users. Include architecture, technology choices, and scaling strategies."

Difficulty: Hard | Frequently asked at Google, Netflix, Uber, Amazon


Theoretical Foundation

System Design Framework

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              System Design Framework                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  1. Requirements Gathering                                 β”‚
β”‚     - Functional requirements                              β”‚
β”‚     - Non-functional requirements                          β”‚
β”‚     - Constraints                                          β”‚
β”‚                                                             β”‚
β”‚  2. High-Level Design                                      β”‚
β”‚     - Architecture components                              β”‚
β”‚     - Data flow                                            β”‚
β”‚     - Technology choices                                   β”‚
β”‚                                                             β”‚
β”‚  3. Detailed Design                                        β”‚
β”‚     - Component deep dives                                 β”‚
β”‚     - Data models                                          β”‚
β”‚     - APIs                                                 β”‚
β”‚                                                             β”‚
β”‚  4. Scaling Strategies                                     β”‚
β”‚     - Horizontal scaling                                   β”‚
β”‚     - Vertical scaling                                     β”‚
β”‚     - Caching                                              β”‚
β”‚                                                             β”‚
β”‚  5. Tradeoffs                                              β”‚
β”‚     - Cost vs performance                                  β”‚
β”‚     - Consistency vs availability                          β”‚
β”‚     - Complexity vs simplicity                             β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Requirements Analysis

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Requirements Analysis                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  Functional:                                                β”‚
β”‚  - Ingest 10TB/day (real-time + batch)                     β”‚
β”‚  - Store data for 3 years                                  β”‚
β”‚  - Support ML model training                               β”‚
β”‚  - Provide BI dashboards                                   β”‚
β”‚  - Handle 1000+ concurrent users                           β”‚
β”‚                                                             β”‚
β”‚  Non-Functional:                                            β”‚
β”‚  - Availability: 99.9%                                     β”‚
β”‚  - Latency: < 5 seconds for dashboards                     β”‚
β”‚  - Throughput: 100K queries/hour                           β”‚
β”‚  - Data durability: 99.999999999%                          β”‚
β”‚                                                             β”‚
β”‚  Constraints:                                               β”‚
β”‚  - Budget: $500K/year                                      β”‚
β”‚  - Team: 10 data engineers                                 β”‚
β”‚  - Timeline: 6 months                                      β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

High-Level Architecture

Data Platform ArchitectureSources β†’ Ingestion β†’ Storage β†’ Processing β†’ ServingIngestion Layer: Kafka (real-time) | Airflow (batch) | Fivetran (managed)↓Storage Layer (Lakehouse): Bronze (Raw) β†’ Silver (Cleaned) β†’ Gold (Aggregates)↓Processing Layer: Spark (batch) | Flink (stream) | dbt (SQL)↓Serving Layer: Snowflake (BI) | Redis (real-time) | MLflow (ML)

Data Flow

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Data Flow                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  Real-Time Path:                                            β”‚
β”‚  Mobile App ──▢ Kafka ──▢ Flink ──▢ Redis ──▢ API          β”‚
β”‚  (ride events)  (ingest) (process) (serve) (recommend)     β”‚
β”‚                                                             β”‚
β”‚  Batch Path:                                                β”‚
β”‚  Database ──▢ Airflow ──▢ Spark ──▢ Delta Lake ──▢ dbt     β”‚
β”‚  (daily ETL)   (orchestrate) (transform) (store) (model)   β”‚
β”‚                                                             β”‚
β”‚  ML Path:                                                   β”‚
β”‚  Delta Lake ──▢ Spark ──▢ Feature Store ──▢ MLflow ──▢ API β”‚
β”‚  (training data) (process) (features)    (train)   (serve) β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Code Implementation

Ingestion Layer

# ============================================================
# INGESTION LAYER
# ============================================================

from kafka import KafkaProducer
from pyspark.sql import SparkSession
import json

# Real-time ingestion
producer = KafkaProducer(
    bootstrap_servers='kafka:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    acks='all',
    enable_idempotence=True
)

def ingest_ride_event(event):
    """Ingest ride event into Kafka"""
    producer.send('ride_events', value=event)

# Batch ingestion
spark = SparkSession.builder \
    .appName("BatchIngestion") \
    .getOrCreate()

def ingest_batch_data(source_path, target_path):
    """Ingest batch data to Delta Lake"""
    df = spark.read.parquet(source_path)
    
    df.write \
        .format("delta") \
        .mode("append") \
        .partitionBy("date") \
        .save(target_path)

Storage Layer

# ============================================================
# STORAGE LAYER (LAKEHOUSE)
# ============================================================

# Bronze Layer: Raw data
def write_bronze(df, table_name):
    """Write raw data to Bronze layer"""
    df.write \
        .format("delta") \
        .mode("append") \
        .save(f"s3://lakehouse/bronze/{table_name}")

# Silver Layer: Cleaned data
def write_silver(df, table_name):
    """Write cleaned data to Silver layer"""
    cleaned_df = df \
        .filter(df.amount > 0) \
        .filter(df.user_id.isNotNull()) \
        .withColumn("processed_at", current_timestamp())
    
    cleaned_df.write \
        .format("delta") \
        .mode("overwrite") \
        .partitionBy("date") \
        .save(f"s3://lakehouse/silver/{table_name}")

# Gold Layer: Business aggregates
def write_gold(df, table_name):
    """Write business aggregates to Gold layer"""
    df.write \
        .format("delta") \
        .mode("overwrite") \
        .save(f"s3://lakehouse/gold/{table_name}")

Processing Layer

# ============================================================
# PROCESSING LAYER
# ============================================================

# Spark batch processing
def transform_orders(spark):
    """Transform orders for BI"""
    
    # Read from Silver
    orders = spark.read.format("delta").load("s3://lakehouse/silver/orders")
    customers = spark.read.format("delta").load("s3://lakehouse/silver/customers")
    
    # Join and aggregate
    result = orders \
        .join(customers, "customer_id") \
        .groupBy("customer_segment", "order_date") \
        .agg(
            count("*").alias("order_count"),
            sum("amount").alias("total_revenue")
        )
    
    # Write to Gold
    write_gold(result, "daily_revenue")

# dbt SQL transformations
# models/marts/fct_orders.sql
"""
SELECT
    order_id,
    customer_id,
    amount,
    status,
    order_date
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
"""

Serving Layer

# ============================================================
# SERVING LAYER
# ============================================================

# Redis for real-time features
import redis

class FeatureStore:
    def __init__(self):
        self.redis = redis.RedisCluster(
            startup_nodes=[{"host": "redis", "port": 7000}],
            decode_responses=True
        )
    
    def get_user_features(self, user_id):
        """Get real-time user features"""
        return self.redis.hgetall(f"user:{user_id}:features")
    
    def update_user_features(self, user_id, features):
        """Update user features"""
        self.redis.hset(f"user:{user_id}:features", mapping=features)

# Snowflake for BI queries
import snowflake.connector

class QueryEngine:
    def __init__(self):
        self.conn = snowflake.connector.connect(
            account='company.snowflakecomputing.com',
            user='analyst',
            password='password',
            database='ANALYTICS',
            schema='GOLD'
        )
    
    def execute_query(self, query):
        """Execute query on Snowflake"""
        cursor = self.conn.cursor()
        cursor.execute(query)
        return cursor.fetchall()

API Layer

# ============================================================
# API LAYER
# ============================================================

from fastapi import FastAPI
from pydantic import BaseModel
import redis
import snowflake.connector

app = FastAPI()

# Dependencies
feature_store = FeatureStore()
query_engine = QueryEngine()

class RecommendationRequest(BaseModel):
    user_id: str
    num_recommendations: int = 5

@app.post("/recommendations")
async def get_recommendations(request: RecommendationRequest):
    """Get real-time recommendations"""
    
    # Get user features
    features = feature_store.get_user_features(request.user_id)
    
    # Get recommendations from model
    recommendations = await get_recommendations_from_model(features)
    
    return {
        "user_id": request.user_id,
        "recommendations": recommendations
    }

@app.get("/analytics/revenue")
async def get_revenue_analytics(start_date: str, end_date: str):
    """Get revenue analytics"""
    
    query = f"""
        SELECT 
            order_date,
            SUM(total_revenue) as revenue
        FROM daily_revenue
        WHERE order_date BETWEEN '{start_date}' AND '{end_date}'
        GROUP BY order_date
        ORDER BY order_date
    """
    
    results = query_engine.execute_query(query)
    
    return {
        "data": results,
        "start_date": start_date,
        "end_date": end_date
    }

Orchestration

# ============================================================
# ORCHESTRATION (AIRFLOW)
# ============================================================

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

with DAG(
    'data_platform_pipeline',
    schedule_interval='0 2 * * *',
    start_date=days_ago(1),
    catchup=False,
) as dag:
    
    # Ingestion
    ingest_batch = PythonOperator(
        task_id='ingest_batch',
        python_callable=ingest_batch_data,
    )
    
    # Bronze
    write_bronze = PythonOperator(
        task_id='write_bronze',
        python_callable=write_bronze,
    )
    
    # Silver
    write_silver = PythonOperator(
        task_id='write_silver',
        python_callable=write_silver,
    )
    
    # Gold (dbt)
    run_dbt = BashOperator(
        task_id='run_dbt',
        bash_command='cd /opt/dbt && dbt run --select marts',
    )
    
    # Testing
    run_tests = BashOperator(
        task_id='run_tests',
        bash_command='cd /opt/dbt && dbt test',
    )
    
    # Dependencies
    ingest_batch >> write_bronze >> write_silver >> run_dbt >> run_tests

Scaling Strategies

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Scaling Strategies                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  Horizontal Scaling:                                        β”‚
β”‚  - Add more nodes to Kafka cluster                         β”‚
β”‚  - Add more Spark executors                                β”‚
β”‚  - Add more Snowflake warehouses                           β”‚
β”‚                                                             β”‚
β”‚  Vertical Scaling:                                          β”‚
β”‚  - Upgrade instance types                                  β”‚
β”‚  - Increase memory/CPU                                     β”‚
β”‚                                                             β”‚
β”‚  Caching:                                                   β”‚
β”‚  - Redis for real-time features                            β”‚
β”‚  - Snowflake result cache                                  β”‚
β”‚  - CDN for static assets                                   β”‚
β”‚                                                             β”‚
β”‚  Partitioning:                                              β”‚
β”‚  - Partition data by date                                  β”‚
β”‚  - Partition by region                                     β”‚
β”‚  - Partition by user                                       β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Cost Estimation

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Cost Estimation                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  Storage (10TB/day Γ— 3 years):                             β”‚
β”‚  - S3: 10TB Γ— 365 Γ— 3 Γ— $0.023 = $25,287/year            β”‚
β”‚  - Delta Lake metadata: $5,000/year                        β”‚
β”‚  - Total: $30,287/year                                     β”‚
β”‚                                                             β”‚
β”‚  Compute:                                                   β”‚
β”‚  - Kafka: 3 nodes Γ— $500/month Γ— 12 = $18,000/year        β”‚
β”‚  - Spark: 10 nodes Γ— $1,000/month Γ— 12 = $120,000/year    β”‚
β”‚  - Snowflake: $200,000/year                                β”‚
β”‚  - Redis: $30,000/year                                     β”‚
β”‚  - Total: $368,000/year                                    β”‚
β”‚                                                             β”‚
β”‚  Other:                                                     β”‚
β”‚  - Airflow: $20,000/year                                   β”‚
β”‚  - Monitoring: $10,000/year                                β”‚
β”‚  - Total: $30,000/year                                     β”‚
β”‚                                                             β”‚
β”‚  Grand Total: $428,287/year                                β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’‘

Production Tip: Always start with a simpler architecture and iterate. Don't over-engineer from day one. Use managed services (Kafka, Snowflake) to reduce operational overhead. Focus on business value, not technology.


Common Follow-Up Questions

Q1: How do you handle data quality?

# Data quality checks
def validate_data_quality(df, rules):
    """Validate data quality"""
    for column, rule in rules.items():
        violations = df.filter(~rule(F.col(column))).count()
        if violations > 0:
            raise ValueError(f"Quality check failed for {column}")
    return True

# Use dbt tests
# models/marts/_marts__models.yml
"""
- name: fct_orders
  tests:
    - unique
    - not_null
    - dbt_utils.accepted_range:
        min_value: 0
"""

Q2: How do you handle disaster recovery?

# Cross-region replication
def replicate_to_dr_region(source_region, dr_region):
    """Replicate data to DR region"""
    
    # Replicate S3
    aws s3 sync s3://source-bucket s3://dr-bucket --source-region us-east-1 --region us-west-2
    
    # Replicate Snowflake
    # Use Snowflake replication

Q3: How do you monitor the platform?

# Monitoring stack
- Prometheus: Metrics collection
- Grafana: Dashboards
- AlertManager: Alerting
- ELK: Logging

# Key metrics
- Kafka lag
- Spark job duration
- Snowflake query performance
- API latency
- Error rates

Q4: How do you handle security?

# Security measures
- Encryption at rest (S3, Snowflake)
- Encryption in transit (TLS)
- IAM roles and policies
- VPC and security groups
- Data masking for PII
- Audit logging

⚠️

Critical Consideration: Data platform design is about tradeoffs. There's no perfect architecture. Focus on your specific requirements, constraints, and team capabilities. Start simple, measure, and iterate.


Company-Specific Tips

Google Interview Tips

  • Discuss BigQuery for analytics
  • Explain Dataflow for stream processing
  • Mention Cloud Storage for data lake
  • Talk about Vertex AI for ML

Netflix Interview Tips

  • Focus on multi-region architecture
  • Explain cost optimization strategies
  • Mention real-time personalization
  • Talk about data mesh implementation

Uber Interview Tips

  • Discuss real-time ride matching
  • Explain geospatial data processing
  • Mention ML for dynamic pricing
  • Talk about multi-tenant platforms

ℹ️

Final Takeaway: Data platform design is a complex system design problem. Start with requirements, design high-level architecture, then dive into details. Always consider tradeoffs, scalability, and cost. Remember: the best architecture is the one that meets your specific needs.

Advertisement