πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Lake vs Data Warehouse vs Lakehouse

Data EngineeringData Architecture⭐ Premium

Advertisement

Databricks & Snowflake Interview

Data Lake vs Data Warehouse vs Lakehouse

Understanding modern data architecture paradigms

Interview Question

"The CTO asks: 'Should we build a data lake, a data warehouse, or a lakehouse?' The company has: (1) 50TB of semi-structured logs, (2) 10TB of structured transactional data, (3) ML team needs raw data access, (4) BI team needs fast SQL queries, (5) budget is limited. Design the architecture and justify your choice."

Difficulty: Hard | Frequently asked at Databricks, Snowflake, Uber, Airbnb


Theoretical Foundation

Data Warehouse

A data warehouse is a centralized repository for structured, processed data optimized for analytical queries.

Characteristics:

  • Schema-on-write: Data must conform to a predefined schema before loading
  • Structured data only: Rows, columns, types
  • ACID transactions: Full compliance guarantees
  • Optimized for reads: Columnar storage, indexing, compression
  • ETL required: Transform before loading
DATA WAREHOUSESources β†’ ETL β†’ Schema Validation β†’ Storage β†’ Query EngineSales (Fact)Inventory (Fact)Customers (Dim)Products (Dim)Query Engine: Presto, Trino, Snowflake, BigQueryStorage: Columnar (Parquet, ORC, proprietary)

Pros:

  • Fast SQL queries
  • Data quality guaranteed
  • ACID compliance
  • Strong governance

Cons:

  • Expensive (storage + compute)
  • Inflexible schema
  • Long development cycles
  • Can't handle unstructured data

Data Lake

A data lake stores raw data in its native format, schema-on-read.

Characteristics:

  • Schema-on-read: Schema applied when reading, not writing
  • All data types: Structured, semi-structured, unstructured
  • Low-cost storage: S3, ADLS, GCS
  • ELT pattern: Load first, transform later
  • Supports ML: Direct access to raw data
DATA LAKESources β†’ Raw Ingestion β†’ Storage β†’ Processing β†’ ServingRaw Zonelogs/ json/ csv/parquet/ avro/Curated Zoneclean/ validated/enriched/Consumer Zonefeatures/ aggregates/ml_ready/Storage: S3, ADLS, GCS (object storage)Processing: Spark, Presto, Athena

Pros:

  • Low cost
  • Flexible schema
  • Supports all data types
  • Good for ML/AI
  • Scalable

Cons:

  • No ACID transactions (with raw files)
  • Data quality issues
  • No built-in governance
  • Query performance varies

Lakehouse

A lakehouse combines data lake storage with data warehouse management.

Characteristics:

  • Open formats: Parquet, Delta Lake, Iceberg, Hudi
  • ACID on data lakes: Transaction support on object storage
  • Schema enforcement: Optional schema validation
  • Time travel: Version control for data
  • Unified analytics: SQL + ML + Streaming
LAKEHOUSE ARCHITECTURELakehouse ArchitectureIngestion: Kafka, Kinesis, Spark Streaming↓Storage: S3/ADLS + Delta Lake/Iceberg↓Serving: SQL (Trino), ML (MLflow), BI (Tableau)Table Format: Delta Lake / Apache Iceberg / Apache HudiTransaction Log: _delta_log/ or metadata/ | File Format: Parquet

Pros:

  • Best of both worlds
  • Open standards
  • ACID on cheap storage
  • Time travel
  • Unified platform

Cons:

  • Newer technology (less mature)
  • Requires expertise
  • Tooling still evolving

Comparison Matrix

FeatureData WarehouseData LakeLakehouse
Data TypesStructured onlyAll typesAll types
SchemaOn-writeOn-readFlexible
Storage CostHighLowLow
Query PerformanceFast (optimized)VariableFast (improving)
ACIDYesNoYes
Time TravelLimitedNoYes
ML SupportLimitedExcellentGood
GovernanceStrongWeakImproving
ScalabilityGoodExcellentExcellent
Vendor Lock-inHighLowLow

The Evolution of Data Architecture

2000s: Data Warehouses dominate

Architecture Diagram
ETL β†’ Warehouse β†’ BI Tools

2010s: Data Lakes emerge

Architecture Diagram
ELT β†’ Data Lake β†’ Spark/Presto β†’ BI/ML

2020s: Lakehouse becomes mainstream

Architecture Diagram
Streaming β†’ Lakehouse β†’ Unified Analytics

ℹ️

Key Insight: The lakehouse isn't just a combination of lake + warehouseβ€”it's a new paradigm that uses open table formats (Delta Lake, Iceberg, Hudi) to bring warehouse-like features to data lakes. The key innovation is the transaction log that enables ACID, time travel, and schema enforcement on object storage.

Open Table Formats Deep Dive

Delta Lake

# Delta Lake: ACID transactions on S3
from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .getOrCreate()

# Create Delta table
spark.sql("""
    CREATE TABLE sales (
        sale_id BIGINT,
        product_id INT,
        quantity INT,
        amount DECIMAL(10,2),
        sale_date DATE
    ) USING DELTA
    LOCATION 's3://data-lake/sales/'
""")

# ACID transaction
spark.sql("""
    INSERT INTO sales VALUES
    (1, 101, 5, 49.99, '2024-01-15'),
    (2, 102, 3, 29.99, '2024-01-15')
""")

# Time travel
spark.sql("SELECT * FROM sales VERSION AS OF 0")  # Initial state
spark.sql("SELECT * FROM sales TIMESTAMP AS OF '2024-01-15 10:00:00'")

Apache Iceberg

# Iceberg: Hidden partitioning + partition evolution
spark.sql("""
    CREATE TABLE catalog.db.events (
        event_id BIGINT,
        event_time TIMESTAMP,
        user_id STRING,
        event_type STRING
    ) USING iceberg
    PARTITIONED BY (days(event_time))
""")

# Partition evolution (no data rewrite)
spark.sql("""
    ALTER TABLE catalog.db.events
    DROP PARTITION FIELD days(event_time)
""")

spark.sql("""
    ALTER TABLE catalog.db.events
    PARTITION BY (hours(event_time), user_id)
""")

Apache Hudi

# Hudi: Incremental processing + CDC
from hudi import DeltaStreamer

# Hudi table with upsert support
spark.sql("""
    CREATE TABLE hudi_events (
        uuid STRING,
        user_id STRING,
        event_time TIMESTAMP,
        event_type STRING
    ) USING hudi
    OPTIONS (
        'hoodie.table.name' = 'events',
        'hoodie.datasource.write.recordkey.field' = 'uuid',
        'hoodie.datasource.write.precombine.field' = 'event_time',
        'hoodie.datasource.write.operation' = 'upsert'
    )
""")

Architecture Patterns

Pattern 1: Lambda Architecture

Lambda ArchitectureSpeed LayerKafka β†’ Spark Streaming β†’ Real-time ViewsBatch LayerKafka β†’ Spark Batch β†’ Batch ViewsServing LayerMerge real-time + batch β†’ Query API

Pros: Low latency + high throughput Cons: Complex, duplicate logic

Pattern 2: Kappa Architecture

Kappa ArchitectureKafka (log)→Spark Streaming→ViewsAll processing is streaming. Batch = special case of streaming with large windows.

Pros: Simpler, single codebase Cons: Requires robust streaming infrastructure

Pattern 3: Lakehouse Architecture

Lakehouse ArchitectureIngestion: Kafka + Kinesis + Debezium (CDC)↓Storage: Delta Lake on S3 (Bronze β†’ Silver β†’ Gold)↓Processing: Spark + Flink + dbt↓Serving: Trino + MLflow + TableauBronze: RawSilver: CleanedGold: Aggregates

ℹ️

Best Practice: The Medallion Architecture (Bronze β†’ Silver β†’ Gold) is the most popular pattern for lakehouses. Each layer adds value: Bronze preserves raw data, Silver ensures quality, Gold provides business-ready aggregates.


Code Implementation

Building a Lakehouse with Delta Lake

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from delta.tables import DeltaTable

spark = SparkSession.builder \
    .appName("LakehouseArchitecture") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", 
            "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# ============================================================
# BRONZE LAYER: Raw data ingestion
# ============================================================

# Read raw data from various sources
raw_logs = spark.read.text("s3://data-lake/raw/logs/")
raw_transactions = spark.read.json("s3://data-lake/raw/transactions/")
raw_customers = spark.read.parquet("s3://data-lake/raw/customers/")

# Write to Bronze (raw, no transformations)
raw_logs.write.format("delta") \
    .mode("append") \
    .save("s3://lakehouse/bronze/logs/")

raw_transactions.write.format("delta") \
    .mode("append") \
    .save("s3://lakehouse/bronze/transactions/")

raw_customers.write.format("delta") \
    .mode("append") \
    .save("s3://lakehouse/bronze/customers/")

# ============================================================
# SILVER LAYER: Cleaned and validated
# ============================================================

# Read from Bronze
bronze_transactions = spark.read.format("delta") \
    .load("s3://lakehouse/bronze/transactions/")

# Clean and validate
silver_transactions = bronze_transactions \
    .filter(F.col("amount").isNotNull()) \
    .filter(F.col("amount") > 0) \
    .filter(F.col("transaction_date").isNotNull()) \
    .withColumn("amount", F.round("amount", 2)) \
    .withColumn("transaction_date", F.to_date("transaction_date")) \
    .withColumn("processed_timestamp", F.current_timestamp())

# Write to Silver with merge (upsert)
if DeltaTable.isDeltaTable(spark, "s3://lakehouse/silver/transactions/"):
    delta_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/transactions/")
    
    delta_table.alias("target").merge(
        silver_transactions.alias("source"),
        "target.transaction_id = source.transaction_id"
    ).whenMatchedUpdateAll() \
     .whenNotMatchedInsertAll() \
     .execute()
else:
    silver_transactions.write.format("delta") \
        .mode("overwrite") \
        .partitionBy("transaction_date") \
        .save("s3://lakehouse/silver/transactions/")

# ============================================================
# GOLD LAYER: Business aggregates
# ============================================================

# Read from Silver
silver_tx = spark.read.format("delta") \
    .load("s3://lakehouse/silver/transactions/")
silver_customers = spark.read.format("delta") \
    .load("s3://lakehouse/silver/customers/")

# Create business aggregates
gold_customer_summary = silver_tx \
    .join(silver_customers, "customer_id") \
    .groupBy("customer_id", "customer_name", "segment") \
    .agg(
        F.count("*").alias("total_transactions"),
        F.sum("amount").alias("total_amount"),
        F.avg("amount").alias("avg_amount"),
        F.min("transaction_date").alias("first_transaction"),
        F.max("transaction_date").alias("last_transaction")
    )

# Write to Gold
gold_customer_summary.write.format("delta") \
    .mode("overwrite") \
    .save("s3://lakehouse/gold/customer_summary/")

Time Travel and Versioning

# ============================================================
# TIME TRAVEL: Query historical data
# ============================================================

# Query by version
df_v0 = spark.read.format("delta") \
    .option("versionAsOf", 0) \
    .load("s3://lakehouse/silver/transactions/")

# Query by timestamp
df_historical = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-15") \
    .load("s3://lakehouse/silver/transactions/")

# Compare versions
current = spark.read.format("delta").load("s3://lakehouse/silver/transactions/")
previous = spark.read.format("delta").option("versionAsOf", 5) \
    .load("s3://lakehouse/silver/transactions/")

# Find changes
changes = current.alias("curr") \
    .join(previous.alias("prev"), 
          F.col("curr.transaction_id") == F.col("prev.transaction_id"), 
          "left") \
    .filter(F.col("prev.transaction_id").isNull() | 
            (F.col("curr.amount") != F.col("prev.amount")))

# Get Delta table history
delta_table = DeltaTable.forPath(spark, "s3://lakehouse/silver/transactions/")
history = delta_table.history()
history.show(truncate=False)

Data Quality Framework

# ============================================================
# DATA QUALITY: Validate at each layer
# ============================================================

def validate_data_quality(df, rules):
    """
    Validate DataFrame against quality rules.
    
    Args:
        df: Input DataFrame
        rules: Dict of {column: [rule_functions]}
    """
    results = {}
    
    for column, column_rules in rules.items():
        for rule in column_rules:
            rule_name = rule.__name__
            violations = df.filter(~rule(F.col(column))).count()
            total = df.count()
            quality_score = 1 - (violations / total) if total > 0 else 1
            
            results[f"{column}.{rule_name}"] = {
                "violations": violations,
                "total": total,
                "quality_score": quality_score,
                "passed": violations == 0
            }
    
    return results

# Define quality rules
quality_rules = {
    "amount": [
        lambda col: col.isNotNull(),
        lambda col: col > 0,
        lambda col: col < 1000000,
    ],
    "transaction_date": [
        lambda col: col.isNotNull(),
        lambda col: col <= F.current_date(),
    ],
    "customer_id": [
        lambda col: col.isNotNull(),
        lambda col: col.rlike("^CUST[0-9]{6}$"),
    ]
}

# Validate Silver layer
silver_df = spark.read.format("delta").load("s3://lakehouse/silver/transactions/")
quality_results = validate_data_quality(silver_df, quality_rules)

# Check results
for check, result in quality_results.items():
    if not result["passed"]:
        print(f"FAILED: {check} - {result['violations']} violations")
        # Alert or quarantine

Cost Analysis

# ============================================================
# COST ANALYSIS: Lakehouse vs Warehouse vs Lake
# ============================================================

# Assumptions
data_size_tb = 50  # 50TB
queries_per_day = 1000
avg_query_gb = 10

# Option 1: Traditional Warehouse (e.g., Redshift)
# Storage: $0.025/GB/month Γ— 50TB = $1,250/month
# Compute: 8-node ra3.4xlarge = $26.68/hr Γ— 730hr = $19,477/month
# Total: ~$20,727/month

# Option 2: Data Lake (S3 + Athena)
# Storage: $0.023/GB/month Γ— 50TB = $1,150/month
# Query: $5/TB scanned Γ— 10GB Γ— 1000 = $50/day = $1,500/month
# Total: ~$2,650/month

# Option 3: Lakehouse (S3 + Delta Lake + Spark)
# Storage: $0.023/GB/month Γ— 50TB = $1,150/month
# Compute: EMR cluster (auto-scaling) = ~$5,000/month
# Total: ~$6,150/month

# Option 4: Snowflake
# Storage: $23/TB/month Γ— 50TB = $1,150/month
# Compute: 10 credits/hr Γ— $3/credit Γ— 730hr = $21,900/month
# Total: ~$23,050/month

print("Cost Comparison (monthly):")
print(f"Warehouse:  ${20727:,}")
print(f"Data Lake:  ${2650:,}")
print(f"Lakehouse:  ${6150:,}")
print(f"Snowflake:  ${23050:,}")

⚠️

Critical Consideration: Cost analysis depends heavily on usage patterns. Data lakes are cheapest for storage-heavy workloads. Warehouses are cheapest for compute-heavy analytical workloads. Lakehouses offer the best balance for mixed workloads.


Common Follow-Up Questions

Q1: When would you choose a warehouse over a lakehouse?

Choose a warehouse when:

  • All data is structured
  • SQL performance is critical
  • Team has strong SQL skills
  • Governance requirements are strict
  • Budget allows for premium pricing

Q2: How do you migrate from a warehouse to a lakehouse?

# Migration strategy: Big Bang vs Phased
# Phased approach:

# Phase 1: Replicate to lakehouse (dual-write)
# Phase 2: Migrate read workloads one by one
# Phase 3: Stop writes to warehouse
# Phase 4: Decommission warehouse

# Example: Migrate a table
def migrate_table(spark, warehouse_table, lakehouse_path):
    """Migrate a table from warehouse to lakehouse"""
    
    # Read from warehouse
    df = spark.sql(f"SELECT * FROM {warehouse_table}")
    
    # Write to lakehouse with same schema
    df.write.format("delta") \
        .mode("overwrite") \
        .save(lakehouse_path)
    
    # Validate row counts
    warehouse_count = spark.sql(f"SELECT COUNT(*) FROM {warehouse_table}").collect()[0][0]
    lakehouse_count = spark.read.format("delta").load(lakehouse_path).count()
    
    assert warehouse_count == lakehouse_count, f"Count mismatch: {warehouse_count} vs {lakehouse_count}"
    
    print(f"Successfully migrated {warehouse_table} ({warehouse_count} rows)")

Q3: How do you handle governance in a lakehouse?

# Unity Catalog for governance (Databricks)
spark.sql("""
    -- Create catalog and schema
    CREATE CATALOG IF NOT EXISTS production;
    CREATE SCHEMA IF NOT EXISTS production.sales;
    
    -- Create table with governance
    CREATE TABLE production.sales.transactions (
        transaction_id BIGINT,
        customer_id STRING,
        amount DECIMAL(10,2),
        transaction_date DATE
    ) COMMENT 'Sales transactions'
    TBLPROPERTIES (
        'quality' = 'gold',
        'owner' = 'sales-team',
        'pii' = 'customer_id'
    );
    
    -- Grant access
    GRANT SELECT ON production.sales.transactions TO `sales-analysts`;
    GRANT MODIFY ON production.sales.transactions TO `sales-engineers`;
""")

Q4: What about real-time in a lakehouse?

# Real-time ingestion into lakehouse
# Using Spark Structured Streaming + Delta Lake

streaming_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "kafka:9092") \
    .option("subscribe", "transactions") \
    .load()

# Parse and transform
parsed_df = streaming_df \
    .select(F.from_json(F.col("value").cast("string"), schema).alias("data")) \
    .select("data.*")

# Write to Delta Lake (micro-batch)
query = parsed_df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://checkpoints/transactions/") \
    .start("s3://lakehouse/silver/transactions/")

πŸ’‘

Interview Tip: When discussing data architecture, always ask about: (1) data volume, (2) query patterns, (3) latency requirements, (4) team skills, (5) budget. The "right" architecture depends on these factors. There's no one-size-fits-all solution.


Company-Specific Tips

Databricks Interview Tips

  • Emphasize Delta Lake and Lakehouse architecture
  • Discuss Medallion Architecture (Bronze/Silver/Gold)
  • Be ready to explain time travel and ACID on S3
  • Talk about Photon engine and performance optimization

Snowflake Interview Tips

  • Focus on data warehouse strengths
  • Discuss separation of storage and compute
  • Mention time travel and Zero-Copy Cloning
  • Talk about Snowpark for ML workloads

Netflix Interview Tips

  • Discuss content recommendation architecture
  • Explain multi-tenant data platforms
  • Mention cost optimization strategies
  • Talk about real-time personalization pipelines

ℹ️

Final Takeaway: The lakehouse is becoming the default choice for new data platforms. It offers the flexibility of data lakes with the governance and performance of warehouses. However, traditional warehouses still make sense for purely SQL workloads with strict governance requirements. The key is understanding your use case and choosing the architecture that fits.

Advertisement