Google & Netflix Interview

Design a Data Platform: End-to-End Architecture

System design for scalable data platforms

Interview Question

"Design a data platform for a ride-sharing company that processes 10TB of data daily. The platform must: (1) ingest real-time and batch data, (2) store data cost-effectively, (3) support ML model training, (4) provide BI dashboards, (5) handle 1000+ concurrent users. Include architecture, technology choices, and scaling strategies."

Difficulty: Hard | Frequently asked at Google, Netflix, Uber, Amazon

Theoretical Foundation

System Design Framework

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              System Design Framework                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Requirements Gathering                                 │
│     - Functional requirements                              │
│     - Non-functional requirements                          │
│     - Constraints                                          │
│                                                             │
│  2. High-Level Design                                      │
│     - Architecture components                              │
│     - Data flow                                            │
│     - Technology choices                                   │
│                                                             │
│  3. Detailed Design                                        │
│     - Component deep dives                                 │
│     - Data models                                          │
│     - APIs                                                 │
│                                                             │
│  4. Scaling Strategies                                     │
│     - Horizontal scaling                                   │
│     - Vertical scaling                                     │
│     - Caching                                              │
│                                                             │
│  5. Tradeoffs                                              │
│     - Cost vs performance                                  │
│     - Consistency vs availability                          │
│     - Complexity vs simplicity                             │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Requirements Analysis

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Requirements Analysis                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Functional:                                                │
│  - Ingest 10TB/day (real-time + batch)                     │
│  - Store data for 3 years                                  │
│  - Support ML model training                               │
│  - Provide BI dashboards                                   │
│  - Handle 1000+ concurrent users                           │
│                                                             │
│  Non-Functional:                                            │
│  - Availability: 99.9%                                     │
│  - Latency: < 5 seconds for dashboards                     │
│  - Throughput: 100K queries/hour                           │
│  - Data durability: 99.999999999%                          │
│                                                             │
│  Constraints:                                               │
│  - Budget: $500K/year                                      │
│  - Team: 10 data engineers                                 │
│  - Timeline: 6 months                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

High-Level Architecture

Data Flow

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Data Flow                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Real-Time Path:                                            │
│  Mobile App ──▶ Kafka ──▶ Flink ──▶ Redis ──▶ API          │
│  (ride events)  (ingest) (process) (serve) (recommend)     │
│                                                             │
│  Batch Path:                                                │
│  Database ──▶ Airflow ──▶ Spark ──▶ Delta Lake ──▶ dbt     │
│  (daily ETL)   (orchestrate) (transform) (store) (model)   │
│                                                             │
│  ML Path:                                                   │
│  Delta Lake ──▶ Spark ──▶ Feature Store ──▶ MLflow ──▶ API │
│  (training data) (process) (features)    (train)   (serve) │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Code Implementation

Ingestion Layer

# ============================================================
# INGESTION LAYER
# ============================================================

from kafka import KafkaProducer
from pyspark.sql import SparkSession
import json

# Real-time ingestion
producer = KafkaProducer(
    bootstrap_servers='kafka:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    acks='all',
    enable_idempotence=True
)

def ingest_ride_event(event):
    """Ingest ride event into Kafka"""
    producer.send('ride_events', value=event)

# Batch ingestion
spark = SparkSession.builder \
    .appName("BatchIngestion") \
    .getOrCreate()

def ingest_batch_data(source_path, target_path):
    """Ingest batch data to Delta Lake"""
    df = spark.read.parquet(source_path)
    
    df.write \
        .format("delta") \
        .mode("append") \
        .partitionBy("date") \
        .save(target_path)

Storage Layer

# ============================================================
# STORAGE LAYER (LAKEHOUSE)
# ============================================================

# Bronze Layer: Raw data
def write_bronze(df, table_name):
    """Write raw data to Bronze layer"""
    df.write \
        .format("delta") \
        .mode("append") \
        .save(f"s3://lakehouse/bronze/{table_name}")

# Silver Layer: Cleaned data
def write_silver(df, table_name):
    """Write cleaned data to Silver layer"""
    cleaned_df = df \
        .filter(df.amount > 0) \
        .filter(df.user_id.isNotNull()) \
        .withColumn("processed_at", current_timestamp())
    
    cleaned_df.write \
        .format("delta") \
        .mode("overwrite") \
        .partitionBy("date") \
        .save(f"s3://lakehouse/silver/{table_name}")

# Gold Layer: Business aggregates
def write_gold(df, table_name):
    """Write business aggregates to Gold layer"""
    df.write \
        .format("delta") \
        .mode("overwrite") \
        .save(f"s3://lakehouse/gold/{table_name}")

Processing Layer

# ============================================================
# PROCESSING LAYER
# ============================================================

# Spark batch processing
def transform_orders(spark):
    """Transform orders for BI"""
    
    # Read from Silver
    orders = spark.read.format("delta").load("s3://lakehouse/silver/orders")
    customers = spark.read.format("delta").load("s3://lakehouse/silver/customers")
    
    # Join and aggregate
    result = orders \
        .join(customers, "customer_id") \
        .groupBy("customer_segment", "order_date") \
        .agg(
            count("*").alias("order_count"),
            sum("amount").alias("total_revenue")
        )
    
    # Write to Gold
    write_gold(result, "daily_revenue")

# dbt SQL transformations
# models/marts/fct_orders.sql
"""
SELECT
    order_id,
    customer_id,
    amount,
    status,
    order_date
FROM {{ ref('stg_orders') }}
WHERE status = 'completed'
"""

Serving Layer

# ============================================================
# SERVING LAYER
# ============================================================

# Redis for real-time features
import redis

class FeatureStore:
    def __init__(self):
        self.redis = redis.RedisCluster(
            startup_nodes=[{"host": "redis", "port": 7000}],
            decode_responses=True
        )
    
    def get_user_features(self, user_id):
        """Get real-time user features"""
        return self.redis.hgetall(f"user:{user_id}:features")
    
    def update_user_features(self, user_id, features):
        """Update user features"""
        self.redis.hset(f"user:{user_id}:features", mapping=features)

# Snowflake for BI queries
import snowflake.connector

class QueryEngine:
    def __init__(self):
        self.conn = snowflake.connector.connect(
            account='company.snowflakecomputing.com',
            user='analyst',
            password='password',
            database='ANALYTICS',
            schema='GOLD'
        )
    
    def execute_query(self, query):
        """Execute query on Snowflake"""
        cursor = self.conn.cursor()
        cursor.execute(query)
        return cursor.fetchall()

API Layer

# ============================================================
# API LAYER
# ============================================================

from fastapi import FastAPI
from pydantic import BaseModel
import redis
import snowflake.connector

app = FastAPI()

# Dependencies
feature_store = FeatureStore()
query_engine = QueryEngine()

class RecommendationRequest(BaseModel):
    user_id: str
    num_recommendations: int = 5

@app.post("/recommendations")
async def get_recommendations(request: RecommendationRequest):
    """Get real-time recommendations"""
    
    # Get user features
    features = feature_store.get_user_features(request.user_id)
    
    # Get recommendations from model
    recommendations = await get_recommendations_from_model(features)
    
    return {
        "user_id": request.user_id,
        "recommendations": recommendations
    }

@app.get("/analytics/revenue")
async def get_revenue_analytics(start_date: str, end_date: str):
    """Get revenue analytics"""
    
    query = f"""
        SELECT 
            order_date,
            SUM(total_revenue) as revenue
        FROM daily_revenue
        WHERE order_date BETWEEN '{start_date}' AND '{end_date}'
        GROUP BY order_date
        ORDER BY order_date
    """
    
    results = query_engine.execute_query(query)
    
    return {
        "data": results,
        "start_date": start_date,
        "end_date": end_date
    }

Orchestration

# ============================================================
# ORCHESTRATION (AIRFLOW)
# ============================================================

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago

with DAG(
    'data_platform_pipeline',
    schedule_interval='0 2 * * *',
    start_date=days_ago(1),
    catchup=False,
) as dag:
    
    # Ingestion
    ingest_batch = PythonOperator(
        task_id='ingest_batch',
        python_callable=ingest_batch_data,
    )
    
    # Bronze
    write_bronze = PythonOperator(
        task_id='write_bronze',
        python_callable=write_bronze,
    )
    
    # Silver
    write_silver = PythonOperator(
        task_id='write_silver',
        python_callable=write_silver,
    )
    
    # Gold (dbt)
    run_dbt = BashOperator(
        task_id='run_dbt',
        bash_command='cd /opt/dbt && dbt run --select marts',
    )
    
    # Testing
    run_tests = BashOperator(
        task_id='run_tests',
        bash_command='cd /opt/dbt && dbt test',
    )
    
    # Dependencies
    ingest_batch >> write_bronze >> write_silver >> run_dbt >> run_tests

Scaling Strategies

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Scaling Strategies                             │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Horizontal Scaling:                                        │
│  - Add more nodes to Kafka cluster                         │
│  - Add more Spark executors                                │
│  - Add more Snowflake warehouses                           │
│                                                             │
│  Vertical Scaling:                                          │
│  - Upgrade instance types                                  │
│  - Increase memory/CPU                                     │
│                                                             │
│  Caching:                                                   │
│  - Redis for real-time features                            │
│  - Snowflake result cache                                  │
│  - CDN for static assets                                   │
│                                                             │
│  Partitioning:                                              │
│  - Partition data by date                                  │
│  - Partition by region                                     │
│  - Partition by user                                       │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Cost Estimation

Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│              Cost Estimation                                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Storage (10TB/day × 3 years):                             │
│  - S3: 10TB × 365 × 3 × $0.023 = $25,287/year            │
│  - Delta Lake metadata: $5,000/year                        │
│  - Total: $30,287/year                                     │
│                                                             │
│  Compute:                                                   │
│  - Kafka: 3 nodes × $500/month × 12 = $18,000/year        │
│  - Spark: 10 nodes × $1,000/month × 12 = $120,000/year    │
│  - Snowflake: $200,000/year                                │
│  - Redis: $30,000/year                                     │
│  - Total: $368,000/year                                    │
│                                                             │
│  Other:                                                     │
│  - Airflow: $20,000/year                                   │
│  - Monitoring: $10,000/year                                │
│  - Total: $30,000/year                                     │
│                                                             │
│  Grand Total: $428,287/year                                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

💡

Production Tip: Always start with a simpler architecture and iterate. Don't over-engineer from day one. Use managed services (Kafka, Snowflake) to reduce operational overhead. Focus on business value, not technology.

Common Follow-Up Questions

Q1: How do you handle data quality?

# Data quality checks
def validate_data_quality(df, rules):
    """Validate data quality"""
    for column, rule in rules.items():
        violations = df.filter(~rule(F.col(column))).count()
        if violations > 0:
            raise ValueError(f"Quality check failed for {column}")
    return True

# Use dbt tests
# models/marts/_marts__models.yml
"""
- name: fct_orders
  tests:
    - unique
    - not_null
    - dbt_utils.accepted_range:
        min_value: 0
"""

Q2: How do you handle disaster recovery?

# Cross-region replication
def replicate_to_dr_region(source_region, dr_region):
    """Replicate data to DR region"""
    
    # Replicate S3
    aws s3 sync s3://source-bucket s3://dr-bucket --source-region us-east-1 --region us-west-2
    
    # Replicate Snowflake
    # Use Snowflake replication

Q3: How do you monitor the platform?

# Monitoring stack
- Prometheus: Metrics collection
- Grafana: Dashboards
- AlertManager: Alerting
- ELK: Logging

# Key metrics
- Kafka lag
- Spark job duration
- Snowflake query performance
- API latency
- Error rates

Q4: How do you handle security?

# Security measures
- Encryption at rest (S3, Snowflake)
- Encryption in transit (TLS)
- IAM roles and policies
- VPC and security groups
- Data masking for PII
- Audit logging

⚠️

Critical Consideration: Data platform design is about tradeoffs. There's no perfect architecture. Focus on your specific requirements, constraints, and team capabilities. Start simple, measure, and iterate.

Company-Specific Tips

Google Interview Tips

Discuss BigQuery for analytics
Explain Dataflow for stream processing
Mention Cloud Storage for data lake
Talk about Vertex AI for ML

Netflix Interview Tips

Focus on multi-region architecture
Explain cost optimization strategies
Mention real-time personalization
Talk about data mesh implementation

Uber Interview Tips

Discuss real-time ride matching
Explain geospatial data processing
Mention ML for dynamic pricing
Talk about multi-tenant platforms

ℹ️

Final Takeaway: Data platform design is a complex system design problem. Start with requirements, design high-level architecture, then dive into details. Always consider tradeoffs, scalability, and cost. Remember: the best architecture is the one that meets your specific needs.