πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Data Lakehouse Architecture: Delta Lake, Iceberg, Hudi

Data EngineeringData Architecture⭐ Premium

Advertisement

Data Lakehouse Architecture: Delta Lake, Iceberg, Hudi

Difficulty: Senior Level | Companies: Databricks, Netflix, Uber, Airbnb, Apple

1. What is the Data Lakehouse?

The lakehouse combines data lake flexibility with data warehouse reliability:

Architecture Diagram
Traditional Stack:
Data Lake (S3/GCS) β†’ ETL β†’ Data Warehouse β†’ Analytics
  (Cheap, flexible)   (Slow)  (Expensive, fast)

Lakehouse:
Data Lake + ACID + Schema Enforcement + Time Travel
  (All in one layer)

Key Lakehouse Properties

FeatureData LakeData WarehouseLakehouse
Storage CostLowHighLow
Schema EnforcementNoneStrictFlexible
ACID TransactionsNoYesYes
Time TravelNoLimitedYes
FormatParquet/ORCProprietaryOpen (Parquet)
Streaming SupportGoodPoorGood

ℹ️

Key Insight: The lakehouse eliminates the need for separate lake and warehouse by adding warehouse-like features directly on top of open file formats.

2. Delta Lake

ACID Transactions on Data Lakes

from delta.tables import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaLakehouse") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Create Delta table with ACID
df = spark.read.parquet("s3://raw/events/")
df.write.format("delta") \
    .mode("overwrite") \
    .save("s3://lakehouse/events")

# MERGE (upsert) β€” impossible with plain Parquet
delta_table = DeltaTable.forPath(spark, "s3://lakehouse/events")

new_data = spark.read.parquet("s3://raw/events_new/")

delta_table.alias("target").merge(
    new_data.alias("source"),
    "target.event_id = source.event_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

# Time Travel β€” query historical data
df_yesterday = spark.read.format("delta") \
    .option("versionAsOf", 123) \
    .load("s3://lakehouse/events")

# Or by timestamp
df_jan1 = spark.read.format("delta") \
    .option("timestampAsOf", "2024-01-01") \
    .load("s3://lakehouse/events")

# VACUUM β€” clean up old files
delta_table.vacuum(retentionHours=168)  # 7 days

Delta Lake Internals

Architecture Diagram
Delta Table on Storage:
s3://lakehouse/events/
β”œβ”€β”€ _delta_log/
β”‚   β”œβ”€β”€ 00000000000000000000.json  (commit 0)
β”‚   β”œβ”€β”€ 00000000000000000001.json  (commit 1)
β”‚   └── 00000000000000000002.json  (commit 2)
β”œβ”€β”€ part-00000-...-.parquet
β”œβ”€β”€ part-00001-...-.parquet
└── _change_data/                  (CDC files)
    β”œβ”€β”€ cdc-00000-...-.parquet

3. Apache Iceberg

Table Format

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("IcebergLakehouse") \
    .config("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.iceberg.type", "hadoop") \
    .config("spark.sql.catalog.iceberg.warehouse", "s3://lakehouse/iceberg") \
    .getOrCreate()

# Create Iceberg table
spark.sql("""
    CREATE TABLE iceberg.db.events (
        event_id BIGINT,
        user_id BIGINT,
        event_type STRING,
        event_timestamp TIMESTAMP,
        properties MAP<STRING, STRING>
    ) USING iceberg
    PARTITIONED BY (days(event_timestamp))
    TBLPROPERTIES (
        'format-version' = '2',
        'write.parquet.compression-codec' = 'zstd'
    )
""")

# Schema evolution (safe, column-level)
spark.sql("""
    ALTER TABLE iceberg.db.events ADD COLUMN new_field STRING AFTER user_id
""")

# Hidden partitioning (auto-pruning)
spark.sql("""
    SELECT * FROM iceberg.db.events
    WHERE event_timestamp >= '2024-01-01'
    LIMIT 100
""")

# Time travel with snapshots
spark.sql("""
    SELECT * FROM iceberg.db.events
    TIMESTAMP AS OF '2024-01-15 10:00:00'
""")

# Snapshot inspection
spark.sql("SELECT * FROM iceberg.db.events.snapshots")

Iceberg Internals

Architecture Diagram
Iceberg Table Metadata:
s3://lakehouse/iceberg/db/events/
β”œβ”€β”€ metadata/
β”‚   β”œβ”€β”€ v1.metadata.json          (table metadata)
β”‚   β”œβ”€β”€ v2.metadata.json
β”‚   β”œβ”€β”€ snap-1234567890.avro      (snapshot list)
β”‚   └── 00000-1-uuid.manifest     (manifest file)
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ 00000-0-uuid.parquet      (data files)
β”‚   └── 00001-1-uuid.parquet
└── schema.json

4. Apache Hudi

Upsert-Optimized Tables

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("HudiLakehouse") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.hive.convertMetastoreParquet", "false") \
    .getOrCreate()

hudi_options = {
    'hoodie.table.name': 'events',
    'hoodie.datasource.write.recordkey.field': 'event_id',
    'hoodie.datasource.write.precombine.field': 'event_timestamp',
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
    'hoodie.compact.inline.max.delta.commits': '5',
}

# Upsert data
df.write.format("hudi") \
    .options(**hudi_options) \
    .mode("append") \
    .save("s3://lakehouse/hudi/events")

# Read with time travel
spark.read.format("hudi") \
    .option("as.of.instant", "20240115000000") \
    .load("s3://lakehouse/hudi/events")

# Incremental read
spark.read.format("hudi") \
    .option("hoodie.datasource.query.type", "incremental") \
    .option("hoodie.datasource.read.begin.instanttime", "20240101000000") \
    .option("hoodie.datasource.read.end.instanttime", "20240131000000") \
    .load("s3://lakehouse/hudi/events")

5. Lakehouse Comparison

FeatureDelta LakeIcebergHudi
ACID TransactionsYesYesYes
Schema EvolutionYesYes (better)Yes
Time TravelYesYesYes
Hidden PartitioningNoYesYes
Upsert PerformanceGoodGoodExcellent
Streaming ReadsYesYesYes
Open SourceYes (OSS)Yes (ASF)Yes (ASF)
Primary BackerDatabricksNetflixUber

⚠️

Common Interview Trap: Don't say "Iceberg is better than Delta" or vice versa. Each has tradeoffs β€” Iceberg has better hidden partitioning, Delta has better streaming, Hudi has better upserts.

6. Lakehouse Best Practices

File Optimization

# Optimize file sizes (aim for 128MB-1GB per file)
spark.sql("""
    OPTIMIZE delta.`s3://lakehouse/events`
    ZORDER BY (user_id, event_timestamp)
""")

# Iceberg compaction
spark.sql("""
    CALL iceberg.system.rewrite_data_files(
        table => 'db.events',
        options => map('target-file-size-bytes', '134217728')
    )
""")

Partition Strategy

Architecture Diagram
Partition by date (not user_id):
βœ… Date pruning for time-range queries
βœ… Even file sizes (events distributed by time)
❌ User-specific queries scan all partitions

Partition by user_id:
βœ… User lookups are fast
❌ Hot partitions (power users)
❌ Time-range queries scan all partitions

Best: Partition by date, Z-order by user_id

ℹ️

Best Practice: Use Z-ORDER or data clustering for multi-dimensional queries. This creates zone maps that enable skipping irrelevant data blocks.

Follow-Up Questions

  1. How would you migrate from Parquet to Delta Lake without downtime?
  2. Compare write amplification between COW and MOR table types.
  3. Design a lakehouse architecture for 10PB of streaming + batch data.
  4. How do you handle GDPR deletions in a lakehouse?
  5. Design a cost optimization strategy for a lakehouse on S3.

Advertisement