Delta Lake on Azure: Auto Loader & Databricks
ACID transactions, time travel, and streaming with Delta Lake on Azure Databricks
Delta Lake Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DELTA LAKE ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β DELTA TABLE LAYOUT β β
β β β β
β β /curated/fact_sales/ β β
β β βββ _delta_log/ β β
β β β βββ 00000000000000000000.json (version 0) β β
β β β βββ 00000000000000000001.json (version 1) β β
β β β βββ 00000000000000000002.json (version 2) β β
β β βββ part-00000-xxxx.parquet β β
β β βββ part-00001-xxxx.parquet β β
β β βββ part-00002-xxxx.parquet β β
β β β β
β β TRANSACTION LOG: β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β { β β β
β β β "add": { β β β
β β β "path": "part-00000-xxxx.parquet", β β β
β β β "size": 1048576, β β β
β β β "modificationTime": 1705305600000, β β β
β β β "dataChange": true, β β β
β β β "stats": "{\"numRecords\":10000}" β β β
β β β } β β β
β β β } β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β KEY FEATURES: β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ ACID Transactions (serializable isolation) β β
β β β’ Schema Enforcement & Evolution β β
β β β’ Time Travel (versioning) β β
β β β’ Data Skipping (Z-ORDER) β β
β β β’ Auto Compaction (OPTIMIZE) β β
β β β’ Streaming (Structured Streaming + Delta) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Auto Loader (Structured Streaming)
# Auto Loader for incremental ingestion
df = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "parquet") \
.option("cloudFiles.schemaLocation", "abfss://schema@stdatalake001.dfs.core.windows.net/sales") \
.option("cloudFiles.inferColumnTypes", "true") \
.option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
.load("abfss://raw@stdatalake001.dfs.core.windows.net/sales/")
# Write to Delta with merge
df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "abfss://checkpoints@stdatalake001.dfs.core.windows.net/sales") \
.option("mergeSchema", "true") \
.trigger(availableNow=True) \
.toTable("sales.fact_sales")
Time Travel
# Query previous version
df_v0 = spark.read.format("delta").option("versionAsOf", 0).load("/delta/fact_sales")
# Compare versions
spark.read.format("delta").option("versionAsOf", 0).load("/delta/fact_sales") \
.exceptAll(spark.read.format("delta").option("versionAsOf", 2).load("/delta/fact_sales"))
# Restore previous version
delta_table = DeltaTable.forPath(spark, "/delta/fact_sales")
delta_table.restoreToVersion(0)
OPTIMIZE and Z-ORDER
# Optimize table (compaction)
spark.sql("OPTIMIZE delta.`/delta/fact_sales`")
# Z-ORDER on frequently queried columns
spark.sql("OPTIMIZE delta.`/delta/fact_sales` ZORDER BY (customer_key, sale_date)")
# Vacuum old files (keep 7 days)
spark.sql("VACUUM delta.`/delta/fact_sales` RETAIN 168 HOURS")
βΉοΈ
Pro Tip: Use Auto Loader for streaming ingestion with schema evolution. It handles file discovery, schema inference, and checkpoint management automatically.
Interview Questions
Q1: What is the difference between Auto Loader and manual file detection? A: Auto Loader uses cloud file notification (Event Grid) for near-real-time file detection with automatic checkpointing. Manual detection requires polling or custom logic, which is more complex and less reliable.
Q2: How do you handle schema evolution in Delta Lake? A: Use mergeSchema option for additive changes, overwriteSchema for breaking changes. Auto Loader supports automatic schema evolution with cloudFiles.schemaEvolutionMode.
Q3: What is the VACUUM command and when should you use it? A: VACUUM removes old files no longer referenced by the transaction log. Use it to reclaim storage space. Default retention is 7 days. Never VACUUM before ensuring all readers have finished.