πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Delta Lake on Azure: Auto Loader & Databricks

Azure Data EngineeringDelta Lake⭐ Premium

Advertisement

Delta Lake on Azure: Auto Loader & Databricks

ACID transactions, time travel, and streaming with Delta Lake on Azure Databricks

Delta Lake Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DELTA LAKE ARCHITECTURE                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    DELTA TABLE LAYOUT                        β”‚   β”‚
β”‚  β”‚                                                               β”‚   β”‚
β”‚  β”‚  /curated/fact_sales/                                        β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ _delta_log/                                            β”‚   β”‚
β”‚  β”‚  β”‚   β”œβ”€β”€ 00000000000000000000.json  (version 0)            β”‚   β”‚
β”‚  β”‚  β”‚   β”œβ”€β”€ 00000000000000000001.json  (version 1)            β”‚   β”‚
β”‚  β”‚  β”‚   └── 00000000000000000002.json  (version 2)            β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ part-00000-xxxx.parquet                               β”‚   β”‚
β”‚  β”‚  β”œβ”€β”€ part-00001-xxxx.parquet                               β”‚   β”‚
β”‚  β”‚  └── part-00002-xxxx.parquet                               β”‚   β”‚
β”‚  β”‚                                                               β”‚   β”‚
β”‚  β”‚  TRANSACTION LOG:                                            β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚ {                                                     β”‚    β”‚   β”‚
β”‚  β”‚  β”‚   "add": {                                            β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     "path": "part-00000-xxxx.parquet",               β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     "size": 1048576,                                  β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     "modificationTime": 1705305600000,                β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     "dataChange": true,                               β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     "stats": "{\"numRecords\":10000}"                 β”‚    β”‚   β”‚
β”‚  β”‚  β”‚   }                                                   β”‚    β”‚   β”‚
β”‚  β”‚  β”‚ }                                                     β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                     β”‚
β”‚  KEY FEATURES:                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ β€’ ACID Transactions (serializable isolation)                β”‚   β”‚
β”‚  β”‚ β€’ Schema Enforcement & Evolution                            β”‚   β”‚
β”‚  β”‚ β€’ Time Travel (versioning)                                  β”‚   β”‚
β”‚  β”‚ β€’ Data Skipping (Z-ORDER)                                   β”‚   β”‚
β”‚  β”‚ β€’ Auto Compaction (OPTIMIZE)                                β”‚   β”‚
β”‚  β”‚ β€’ Streaming (Structured Streaming + Delta)                  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Auto Loader (Structured Streaming)

# Auto Loader for incremental ingestion
df = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "parquet") \
    .option("cloudFiles.schemaLocation", "abfss://schema@stdatalake001.dfs.core.windows.net/sales") \
    .option("cloudFiles.inferColumnTypes", "true") \
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns") \
    .load("abfss://raw@stdatalake001.dfs.core.windows.net/sales/")

# Write to Delta with merge
df.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "abfss://checkpoints@stdatalake001.dfs.core.windows.net/sales") \
    .option("mergeSchema", "true") \
    .trigger(availableNow=True) \
    .toTable("sales.fact_sales")

Time Travel

# Query previous version
df_v0 = spark.read.format("delta").option("versionAsOf", 0).load("/delta/fact_sales")

# Compare versions
spark.read.format("delta").option("versionAsOf", 0).load("/delta/fact_sales") \
    .exceptAll(spark.read.format("delta").option("versionAsOf", 2).load("/delta/fact_sales"))

# Restore previous version
delta_table = DeltaTable.forPath(spark, "/delta/fact_sales")
delta_table.restoreToVersion(0)

OPTIMIZE and Z-ORDER

# Optimize table (compaction)
spark.sql("OPTIMIZE delta.`/delta/fact_sales`")

# Z-ORDER on frequently queried columns
spark.sql("OPTIMIZE delta.`/delta/fact_sales` ZORDER BY (customer_key, sale_date)")

# Vacuum old files (keep 7 days)
spark.sql("VACUUM delta.`/delta/fact_sales` RETAIN 168 HOURS")

ℹ️

Pro Tip: Use Auto Loader for streaming ingestion with schema evolution. It handles file discovery, schema inference, and checkpoint management automatically.

Interview Questions

Q1: What is the difference between Auto Loader and manual file detection? A: Auto Loader uses cloud file notification (Event Grid) for near-real-time file detection with automatic checkpointing. Manual detection requires polling or custom logic, which is more complex and less reliable.

Q2: How do you handle schema evolution in Delta Lake? A: Use mergeSchema option for additive changes, overwriteSchema for breaking changes. Auto Loader supports automatic schema evolution with cloudFiles.schemaEvolutionMode.

Q3: What is the VACUUM command and when should you use it? A: VACUUM removes old files no longer referenced by the transaction log. Use it to reclaim storage space. Default retention is 7 days. Never VACUUM before ensuring all readers have finished.

Advertisement