Data Lakehouse Architecture with PySpark
Architecture Diagram: Lakehouse Stack
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA LAKEHOUSE ARCHITECTURE β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β CONSUMPTION LAYER β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β β
β β β BI & β β ML β β Ad-Hoc β β Real- β β Data β β β
β β β Reports β β Models β β Query β β Time β β Apps β β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SEMANTIC / GOVERNANCE LAYER β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β β
β β β Unity β β Data β β Column β β Access β β β
β β β Catalog β β Lineage β β Masking β β Control β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β TABLE FORMAT LAYER (ACID) β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β β
β β β Delta Lake β β Apache β β Apache β β Hudi β β β
β β β (Databricks)β β Iceberg β β Hudi β β (Uber) β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β² β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β STORAGE LAYER β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β β
β β β S3 / ADLS β β GCS β β MinIO β β HDFS β β β
β β β Object β β Object β β S3-Compat β β Distributedβ β β
β β β Store β β Store β β Store β β FS β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Architecture Diagram: Medallion Pattern (Bronze β Silver β Gold)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MEDALLION MULTI-HOP ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β EXTERNAL DATA SOURCES β
β ββββββββββ ββββββββββ ββββββββββ ββββββββββ ββββββββββ ββββββββββ β
β β IoT β β ERP β β CRM β β Logs β β APIs β β Files β β
β βSensors β β System β β System β β Stream β βFeeds β β (CSV) β β
β βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ βββββ¬βββββ β
β β β β β β β β
β ββββββββββββ΄βββββββββββ΄βββββββ¬ββββ΄βββββββββββ΄βββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β BRONZE LAYER (Raw Ingestion) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β’ Append-only snapshots of source data β β β
β β β β’ Schema-on-read, minimal transformation β β β
β β β β’ Full audit trail with ingestion timestamps β β β
β β β β’ Data quality: Raw, unvalidated, untransformed β β β
β β β β’ Retention: Permanent archival for compliance β β β
β β β β’ Format: Parquet/ORC with Delta Lake transaction log β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββ΄ββββββ β
β β Bronze β β β
β β Silver β β
β βββββββ¬ββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SILVER LAYER (Cleansed & Conformed) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β’ Deduplicated, cleansed, validated data β β β
β β β β’ Standardized schemas and naming conventions β β β
β β β β’ Conformed dimensions for cross-domain joins β β β
β β β β’ Data quality enforcement via constraints β β β
β β β β’ Type-2 SCD for temporal tracking β β β
β β β β’ Query-optimized with Z-ORDER / data skipping β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββ΄ββββββ β
β β Silver β β β
β β Gold β β
β βββββββ¬ββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GOLD LAYER (Business-Level Aggregates) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β’ Business-level aggregates and KPIs β β β
β β β β’ Dimensionally modeled star/snowflake schemas β β β
β β β β’ Ready for direct consumption by BI/ML tools β β β
β β β β’ Materialized views for low-latency queries β β β
β β β β’ Pre-computed joins and aggregations β β β
β β β β’ Consistent business definitions across domains β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β BI β β ML β β Data β β
β β Dashboardsβ β Pipelinesβ β Apps β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Architecture Diagram: ACID Transaction Flow on Data Lake
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DELTA LAKE ACID TRANSACTION LIFECYCLE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β WRITER 1 ββββββ WRITER 2 ββββββ WRITER 3 ββββββ β
β (INSERT) β (UPDATE) β (DELETE) β β
β β β β β β β β
β βΌ β βΌ β βΌ β β
β βββββββββ β βββββββββ β βββββββββ β β
β β Acquireβ β β Acquireβ β β Acquireβ β β
β β LOCK β β β LOCK β β β LOCK β β β
β βββββ¬ββββ β βββββ¬ββββ β βββββ¬ββββ β β
β β β β β β β β
β βΌ β βΌ β βΌ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β OPTIMISTIC CONCURRENCY CONTROL β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β 1. Read current snapshot (read timestamp) β β β β
β β β 2. Build transaction plan (read set + write set) β β β β
β β β 3. Validate: No conflicting writes since read time β β β β
β β β β’ Check _delta_log/ for new commits β β β β
β β β β’ Compare read-set versions β β β β
β β β 4. If conflict β ABORT, retry with new snapshot β β β β
β β β 5. If no conflict β PROCEED to commit β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β β β β
β βΌ β βΌ β βΌ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β WRITE TO TRANSACTION LOG β β β
β β _delta_log/ β β β
β β βββ 00000000000000000000.json (Commit 0: Initial) β β β
β β βββ 00000000000000000001.json (Commit 1: Writer 1) β β β
β β βββ 00000000000000000002.json (Commit 2: Writer 2) β β β
β β βββ 00000000000000000003.json (Commit 3: Writer 3) β β β
β β βββ _last_checkpoint (Checkpoint pointer) β β β
β β βββ CHECKPOINT/ (Parquet checkpoints for fast reads) β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β β β β
β βΌ β βΌ β βΌ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β DATA FILES (Parquet on Object Store) β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β β
β β β part- β β part- β β part- β β part- β β β β
β β β 00000 β β 00001 β β 00002 β β 00003 β β β β
β β β .snappy β β .snappy β β .snappy β β .snappy β β β β
β β β .parquetβ β .parquetβ β .parquetβ β .parquetβ β β β
β β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Detailed Explanation
The Data Lakehouse represents a paradigm shift in enterprise data architecture, merging the flexibility and cost-efficiency of data lakes with the transactional guarantees and performance of data warehouses. This convergence eliminates the traditional "two-system" problem where organizations maintained separate systems for analytical and operational workloads, each with its own copy of data, governance model, and query engine.
At its foundation, the Lakehouse relies on an open table format (Delta Lake, Apache Iceberg, or Apache Hudi) that provides ACID transactions, schema enforcement, time travel, and metadata management directly on top of cheap object storage (S3, ADLS, GCS). The transaction log (_delta_log/ for Delta Lake) serves as the single source of truth, recording every mutation as an immutable, ordered JSON entry. This log enables optimistic concurrency control β multiple writers can operate simultaneously, with conflicts detected and resolved at commit time through a compare-and-swap mechanism on read-set versions.
The Medallion Architecture organizes data into three distinct tiers (Bronze, Silver, Gold), each with specific quality guarantees and transformation requirements. The Bronze layer ingests raw, unvalidated data from all sources with minimal transformation, preserving the original format for auditability and replay capability. The Silver layer applies cleansing, deduplication, standardization, and schema enforcement, producing a "single source of truth" that conforms to enterprise data models. The Gold layer contains business-level aggregates, dimensional models, and materialized views optimized for consumption by BI tools, ML pipelines, and applications.
Each layer provides specific data quality guarantees: Bronze accepts any valid data regardless of quality; Silver enforces NOT NULL constraints, data type validation, referential integrity checks, and temporal correctness; Gold guarantees that aggregations are idempotent, dimensions are conformed, and business rules are applied consistently across all domains.
Time travel (also called data versioning) is a critical capability enabled by the transaction log. Every commit creates a snapshot that can be queried as of a specific timestamp or version number. This enables use cases such as regulatory audit trails (reproduce any historical state), debugging data quality issues (compare current vs. previous state), point-in-time joins (join two tables as of the same logical moment), and rollback capabilities (revert to any previous consistent state).
Schema evolution and enforcement work together to maintain data quality while accommodating changing business requirements. Schema enforcement validates that incoming data matches the existing schema and rejects incompatible writes. Schema evolution allows controlled modifications (adding columns, changing data types, renaming fields) through explicit ALTER commands or write-mode configurations (mergeSchema option).
The table format layer also provides advanced optimization features: data skipping through file-level statistics (min/max values stored in Parquet footers and the transaction log), Z-ORDER multi-dimensional clustering for efficient point queries on correlated columns, compaction (OPTIMIZE command) to merge small files into larger ones for better read performance, and liquid clustering for adaptive, automatic data layout optimization.
Mathematical Foundations
Definition: Lakehouse Architecture
A lakehouse combines data lake storage (cheap, scalable object storage) with data warehouse semantics (ACID, schema enforcement, governance) such that:
Medallion Pattern Flow
Data flows through three tiers:
Each tier applies transformations : where .
ACID on Data Lake Theorem
Transaction log ensures atomicity and isolation for concurrent operations on lake storage:
Consistency is maintained because each operation is a commit that atomically adds/removes file references.
Storage Cost Model
Total cost for lakehouse with tiers:
Typical ratio: .
Data Freshness
End-to-end freshness from ingestion to Gold tier:
Target: .
Key Insight
The lakehouse pattern eliminates the "copy tax" of traditional architectures where data must be moved between systems. Delta Lake/Iceberg/Hudi provide the transaction layer that makes ACID possible on object storage.
Summary
The lakehouse unifies lake storage with warehouse semantics through transaction logs. The medallion pattern organizes data by quality tiers. Storage costs follow tiered pricing models, and freshness is the sum of pipeline stage latencies. ACID guarantees eliminate the need for separate OLAP systems.
Key Concepts Table
| Concept | Delta Lake | Apache Iceberg | Apache Hudi | Description |
|---|---|---|---|---|
| ACID Transactions | Yes (MVCC) | Yes (snapshot isolation) | Yes (copy-on-write / merge-on-read) | Transactional semantics on object storage |
| Schema Evolution | Yes (mergeSchema) | Yes (evolve schema) | Yes (schema evolution) | Safe schema changes without rewriting data |
| Time Travel | Yes (VERSION AS OF) | Yes (snapshots) | Yes (timeline) | Query historical data states |
| Partition Evolution | Yes (dynamic) | Yes (hidden partitions) | Yes (evolvable) | Change partitioning without rewriting |
| Conflict Resolution | Optimistic concurrency | Optimistic concurrency | Optimistic concurrency | Handle concurrent writers gracefully |
| Open Format | Yes (Parquet + JSON log) | Yes (Parquet + metadata) | Yes (Parquet + metadata) | No vendor lock-in |
| Streaming Support | Yes (Structured Streaming) | Yes (Spark/Flink) | Yes (Spark/Flink) | Real-time ingestion and processing |
| DML Operations | INSERT, UPDATE, DELETE, MERGE | INSERT, UPDATE, DELETE, MERGE | INSERT, UPDATE, DELETE, UPSERT | Full SQL-compatible data manipulation |
| Caching | Yes (Delta Cache) | No (relies on storage) | No (relies on storage) | Local caching for performance |
| Statistics | Column-level stats | Partition-level stats | Column-level stats | Data skipping for query optimization |
Code Examples
Example 1: Setting Up a Delta Lake Table with Medallion Pattern
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import configure_spark_with_delta_pip
# Initialize Spark with Delta Lake
builder = (
SparkSession.builder
.appName("Lakehouse-Medallion")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("spark.databricks.delta.schema.autoMerge.enabled", "true")
.config("spark.sql.adaptive.enabled", "true")
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
# βββ BRONZE LAYER: Raw Ingestion βββ
# Ingest raw transaction data with full audit metadata
raw_transactions = (
spark.read
.format("json")
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.load("/mnt/raw/transactions/*.json")
.withColumn("_ingestion_timestamp", current_timestamp())
.withColumn("_source_system", lit("erp_system"))
.withColumn("_batch_id", lit("batch_20260101"))
)
# Write to Bronze with append-only semantics
(
raw_transactions
.write
.format("delta")
.mode("append")
.partitionBy("_source_system")
.save("/mnt/lakehouse/bronze/transactions")
)
# βββ SILVER LAYER: Cleansed & Conformed βββ
# Read Bronze, apply cleansing rules, write to Silver
bronze_df = spark.read.format("delta").load("/mnt/lakehouse/bronze/transactions")
silver_transactions = (
bronze_df
.filter(col("_corrupt_record").isNull()) # Remove corrupt records
.drop("_corrupt_record", "_ingestion_timestamp")
.withColumn("amount", col("amount").cast(DecimalType(18, 2)))
.withColumn("transaction_date", to_date(col("timestamp")))
.withColumn("customer_id", upper(trim(col("customer_id"))))
.dropDuplicates(["transaction_id"])
.withColumn("_quality_score",
when(col("amount") > 0, lit(1.0))
.when(col("amount") < 0, lit(0.8))
.otherwise(lit(0.5))
)
)
# Merge into Silver (upsert for idempotency)
(
spark.read.format("delta").load("/mnt/lakehouse/silver/transactions")
.alias("target")
.merge(
silver_transactions.alias("source"),
"target.transaction_id = source.transaction_id"
)
.whenMatchedUpdateAll()
.whenNotMatchedInsertAll()
.execute()
)
# βββ GOLD LAYER: Business Aggregates βββ
# Compute daily revenue metrics per region
gold_daily_revenue = (
spark.read.format("delta").load("/mnt/lakehouse/silver/transactions")
.groupBy(
col("transaction_date"),
col("region"),
col("product_category")
)
.agg(
count("*").alias("transaction_count"),
sum("amount").alias("total_revenue"),
avg("amount").alias("avg_transaction_value"),
approx_count_distinct("customer_id").alias("unique_customers"),
stddev("amount").alias("revenue_stddev")
)
.withColumn("revenue_per_customer", col("total_revenue") / col("unique_customers"))
)
(
gold_daily_revenue
.write
.format("delta")
.mode("overwrite")
.partitionBy("transaction_date")
.option("overwriteSchema", "true")
.save("/mnt/lakehouse/gold/daily_revenue")
)
Example 2: Advanced ACID Operations and Time Travel
from delta.tables import DeltaTable
from pyspark.sql.functions import *
# Load existing Delta table
transactions_table = DeltaTable.forPath(spark, "/mnt/lakehouse/silver/transactions")
# βββ MERGE (UPSERT) with Complex Logic βββ
new_data = spark.read.format("delta").load("/mnt/staging/transactions_updates")
(
transactions_table.alias("target")
.merge(
new_data.alias("source"),
"target.transaction_id = source.transaction_id"
)
.whenMatchedUpdate(
condition="source.update_type = 'CORRECTION'",
set={
"amount": col("source.amount"),
"description": col("source.description"),
"_last_modified": current_timestamp()
}
)
.whenMatchedDelete(
condition="source.update_type = 'REVERSAL'"
)
.whenNotMatchedInsert(
values={
"transaction_id": col("source.transaction_id"),
"customer_id": col("source.customer_id"),
"amount": col("source.amount"),
"product_category": col("source.product_category"),
"region": col("source.region"),
"_created_at": current_timestamp(),
"_last_modified": current_timestamp()
}
)
.execute()
)
# βββ TIME TRAVEL: Query Historical States βββ
# Query as of a specific timestamp
historical_df = (
spark.read
.format("delta")
.option("timestampAsOf", "2026-01-01")
.load("/mnt/lakehouse/silver/transactions")
)
# Query as of a specific version
version_df = (
spark.read
.format("delta")
.option("versionAsOf", 42)
.load("/mnt/lakehouse/silver/transactions")
)
# Compare two versions to detect changes
current_df = spark.read.format("delta").load("/mnt/lakehouse/silver/transactions")
changes_df = (
current_df.alias("current")
.join(
version_df.alias("old"),
"transaction_id",
"full_outer"
)
.withColumn("change_type",
when(col("old.transaction_id").isNull(), lit("INSERTED"))
.when(col("current.transaction_id").isNull(), lit("DELETED"))
.when(col("current.amount") != col("old.amount"), lit("UPDATED"))
.otherwise(lit("UNCHANGED"))
)
.filter(col("change_type") != "UNCHANGED")
)
# βββ VACUUM: Remove Old Files βββ
# Set retention interval (default 7 days, minimum 24 hours)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
transactions_table.vacuum(retentionHours=168) # 7 days
# βββ OPTIMIZE: Compact Small Files βββ
(
transactions_table.optimize()
.executeCompaction()
)
# Z-ORDER for multi-dimensional clustering
(
transactions_table.optimize()
.executeZOrderBy("customer_id", "product_category")
)
Example 3: Apache Iceberg Integration with PySpark
# Configure Spark for Iceberg
spark = (
SparkSession.builder
.appName("Iceberg-Lakehouse")
.config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
.config("spark.sql.catalog.my_catalog.type", "hadoop")
.config("spark.sql.catalog.my_catalog.warehouse", "/mnt/lakehouse/iceberg")
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.SparkSessionExtensions")
.getOrCreate()
)
# Create Iceberg table with advanced features
spark.sql("""
CREATE TABLE IF NOT EXISTS my_catalog.sales (
transaction_id STRING,
customer_id STRING,
amount DECIMAL(18, 2),
product_category STRING,
region STRING,
transaction_date DATE,
_created_at TIMESTAMP
)
USING iceberg
PARTITIONED BY (transaction_date)
TBLPROPERTIES (
'format-version' = '2',
'write.format.default' = 'parquet',
'write.parquet.compression-codec' = 'zstd',
'write.target-file-size-bytes' = '134217728',
'commit.manifest-merge.enabled' = 'true'
)
""")
# Schema evolution: add a new column
spark.sql("""
ALTER TABLE my_catalog.sales
ADD COLUMN payment_method STRING AFTER region
""")
# Partition evolution (no data rewrite required)
spark.sql("""
ALTER TABLE my_catalog.sales
DROP PARTITION FIELD transaction_date
""")
spark.sql("""
ALTER TABLE my_catalog.sales
ADD PARTITION FIELD bucket(16, customer_id)
""")
# Snapshot-based time travel
spark.sql("""
SELECT * FROM my_catalog.sales
TIMESTAMP AS OF '2026-01-01 00:00:00'
WHERE region = 'US'
""")
# Expire old snapshots
spark.sql("""
CALL my_catalog.system.expire_snapshots(
table => 'sales',
older_than => TIMESTAMP '2025-12-01 00:00:00',
retain_last => 5
)
""")
Performance Metrics
| Metric | Before Optimization | After Optimization | Improvement |
|---|---|---|---|
| Average File Size | 1 MB (small files) | 128 MB (optimized) | 128x fewer files |
| Query Latency (P50) | 45 seconds | 2.3 seconds | 95% reduction |
| Query Latency (P99) | 320 seconds | 8.7 seconds | 97% reduction |
| Storage Cost (per TB) | 12/month | 48% reduction | |
| Write Throughput | 50 MB/s | 450 MB/s | 9x improvement |
| Concurrent Writers | 2 (frequent conflicts) | 12 (rare conflicts) | 6x scaling |
| Time Travel Query | N/A | 3.1 seconds | New capability |
| Compaction Duration | N/A | 8.5 min/100GB | New capability |
| Data Skipped (%) | 0% | 85% (Z-ORDER) | New capability |
| Snapshot Retention | Unlimited | Configurable | Space savings |
Best Practices
-
Always use the Medallion Architecture β Maintain clear separation between Bronze (raw), Silver (cleansed), and Gold (aggregated) layers with explicit quality gates between them.
-
Enable Adaptive Query Execution (AQE) β Set
spark.sql.adaptive.enabled=trueto allow Spark to dynamically optimize shuffle partitions, join strategies, and skew handling at runtime. -
Configure file compaction proactively β Run
OPTIMIZEregularly on Silver/Gold tables to merge small files; target file sizes of 128β512 MB for optimal read performance. -
Apply Z-ORDER clustering on frequently filtered columns β Use
OPTIMIZE ZORDER BYon columns commonly used in WHERE clauses, JOINs, and GROUP BY operations for data skipping. -
Set appropriate vacuum retention β Configure vacuum retention to match your compliance requirements; use
VACUUMto reclaim storage from old files while maintaining the ability to time-travel. -
Use MERGE for idempotent writes β Always use
MERGE(upsert) instead ofINSERTfor Silver-layer writes to ensure idempotency and prevent duplicate records on job retries. -
Monitor file statistics β Use
DESCRIBE DETAILandDESCRIBE HISTORYto monitor table health, file counts, and commit frequency; set alerts for excessive small file counts. -
Partition wisely β Partition by low-cardinality, high-filtration columns (date, region); avoid partitioning by high-cardinality columns (customer_id, transaction_id) which create excessive small files.
-
Leverage schema evolution carefully β Use
mergeSchemaoption sparingly and always validate schema changes against downstream consumers before applying; maintain a schema registry for governance. -
Implement data quality gates β Use tools like Great Expectations, Deequ, or custom validation in the Silver layer to ensure data meets quality standards before promotion to Gold.
-
Version your data models β Tag Gold-layer tables with version numbers and maintain backward compatibility; use schema evolution to add columns rather than modifying existing ones.
-
Optimize for your workload β For read-heavy workloads, prioritize compaction and Z-ORDER; for write-heavy workloads, use larger file target sizes and consider liquid clustering for automatic optimization.
See also: Snowflake Time Travel (snowflake/02), Kafka CDC (kafka/04), Airflow DAGs (airflow/02)