Data Lakehouse Architecture with PySpark

Free Lesson

Advertisement

Data Lakehouse Architecture with PySpark

Architecture Diagram: Lakehouse Stack

Architecture Diagram
╔══════════════════════════════════════════════════════════════════════════════════╗
β•‘                         DATA LAKEHOUSE ARCHITECTURE                              β•‘
╠══════════════════════════════════════════════════════════════════════════════════╣
β•‘                                                                                  β•‘
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β•‘
β•‘  β”‚                        CONSUMPTION LAYER                                β”‚     β•‘
β•‘  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚     β•‘
β•‘  β”‚  β”‚  BI &    β”‚  β”‚  ML      β”‚  β”‚  Ad-Hoc  β”‚  β”‚  Real-   β”‚  β”‚  Data   β”‚  β”‚     β•‘
β•‘  β”‚  β”‚  Reports β”‚  β”‚  Models  β”‚  β”‚  Query   β”‚  β”‚  Time    β”‚  β”‚  Apps   β”‚  β”‚     β•‘
β•‘  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚     β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β•‘
β•‘                                      β–²                                            β•‘
β•‘                                      β”‚                                            β•‘
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β•‘
β•‘  β”‚                       SEMANTIC / GOVERNANCE LAYER                        β”‚     β•‘
β•‘  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚     β•‘
β•‘  β”‚  β”‚   Unity      β”‚  β”‚   Data       β”‚  β”‚   Column     β”‚  β”‚   Access   β”‚  β”‚     β•‘
β•‘  β”‚  β”‚   Catalog    β”‚  β”‚   Lineage    β”‚  β”‚   Masking    β”‚  β”‚   Control  β”‚  β”‚     β•‘
β•‘  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚     β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β•‘
β•‘                                      β–²                                            β•‘
β•‘                                      β”‚                                            β•‘
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β•‘
β•‘  β”‚                       TABLE FORMAT LAYER (ACID)                          β”‚     β•‘
β•‘  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚     β•‘
β•‘  β”‚  β”‚  Delta Lake  β”‚  β”‚   Apache     β”‚  β”‚   Apache     β”‚  β”‚   Hudi     β”‚  β”‚     β•‘
β•‘  β”‚  β”‚  (Databricks)β”‚  β”‚   Iceberg    β”‚  β”‚   Hudi       β”‚  β”‚   (Uber)   β”‚  β”‚     β•‘
β•‘  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚     β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β•‘
β•‘                                      β–²                                            β•‘
β•‘                                      β”‚                                            β•‘
β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β•‘
β•‘  β”‚                       STORAGE LAYER                                      β”‚     β•‘
β•‘  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚     β•‘
β•‘  β”‚  β”‚   S3 / ADLS  β”‚  β”‚   GCS        β”‚  β”‚   MinIO      β”‚  β”‚   HDFS     β”‚  β”‚     β•‘
β•‘  β”‚  β”‚   Object     β”‚  β”‚   Object     β”‚  β”‚   S3-Compat  β”‚  β”‚   Distributedβ”‚ β”‚     β•‘
β•‘  β”‚  β”‚   Store      β”‚  β”‚   Store      β”‚  β”‚   Store      β”‚  β”‚   FS       β”‚  β”‚     β•‘
β•‘  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚     β•‘
β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β•‘
β•‘                                                                                  β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

Architecture Diagram: Medallion Pattern (Bronze β†’ Silver β†’ Gold)

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        MEDALLION MULTI-HOP ARCHITECTURE                              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                     β”‚
β”‚   EXTERNAL DATA SOURCES                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚   β”‚  IoT   β”‚ β”‚  ERP   β”‚ β”‚  CRM   β”‚ β”‚  Logs  β”‚ β”‚  APIs  β”‚ β”‚  Files β”‚               β”‚
β”‚   β”‚Sensors β”‚ β”‚ System β”‚ β”‚ System β”‚ β”‚ Stream β”‚ β”‚Feeds   β”‚ β”‚ (CSV)  β”‚               β”‚
β”‚   β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”¬β”€β”€β”€β”€β”˜               β”‚
β”‚       β”‚          β”‚          β”‚          β”‚          β”‚          β”‚                      β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚                                    β–Ό                                                 β”‚
β”‚   ╔═══════════════════════════════════════════════════════════════════╗             β”‚
β”‚   β•‘  BRONZE LAYER (Raw Ingestion)                                     β•‘             β”‚
β”‚   β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Append-only snapshots of source data                      β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Schema-on-read, minimal transformation                    β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Full audit trail with ingestion timestamps                β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Data quality: Raw, unvalidated, untransformed             β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Retention: Permanent archival for compliance              β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Format: Parquet/ORC with Delta Lake transaction log       β”‚  β•‘             β”‚
β”‚   β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β•‘             β”‚
β”‚   β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•             β”‚
β”‚                                    β”‚                                                 β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”                                          β”‚
β”‚                              β”‚  Bronze β†’  β”‚                                          β”‚
β”‚                              β”‚  Silver    β”‚                                          β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                                          β”‚
β”‚                                    β–Ό                                                 β”‚
β”‚   ╔═══════════════════════════════════════════════════════════════════╗             β”‚
β”‚   β•‘  SILVER LAYER (Cleansed & Conformed)                              β•‘             β”‚
β”‚   β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Deduplicated, cleansed, validated data                    β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Standardized schemas and naming conventions               β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Conformed dimensions for cross-domain joins                β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Data quality enforcement via constraints                  β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Type-2 SCD for temporal tracking                          β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Query-optimized with Z-ORDER / data skipping              β”‚  β•‘             β”‚
β”‚   β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β•‘             β”‚
β”‚   β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•             β”‚
β”‚                                    β”‚                                                 β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”                                          β”‚
β”‚                              β”‚  Silver β†’  β”‚                                          β”‚
β”‚                              β”‚  Gold      β”‚                                          β”‚
β”‚                              β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                                          β”‚
β”‚                                    β–Ό                                                 β”‚
β”‚   ╔═══════════════════════════════════════════════════════════════════╗             β”‚
β”‚   β•‘  GOLD LAYER (Business-Level Aggregates)                           β•‘             β”‚
β”‚   β•‘  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Business-level aggregates and KPIs                        β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Dimensionally modeled star/snowflake schemas              β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Ready for direct consumption by BI/ML tools               β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Materialized views for low-latency queries                β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Pre-computed joins and aggregations                       β”‚  β•‘             β”‚
β”‚   β•‘  β”‚ β€’ Consistent business definitions across domains            β”‚  β•‘             β”‚
β”‚   β•‘  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β•‘             β”‚
β”‚   β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•             β”‚
β”‚                                    β”‚                                                 β”‚
β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                β”‚
β”‚                    β–Ό               β–Ό               β–Ό                                β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚              β”‚   BI     β”‚   β”‚   ML     β”‚   β”‚   Data   β”‚                             β”‚
β”‚              β”‚  Dashboardsβ”‚  β”‚ Pipelinesβ”‚   β”‚  Apps    β”‚                             β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architecture Diagram: ACID Transaction Flow on Data Lake

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DELTA LAKE ACID TRANSACTION LIFECYCLE                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                                     β”‚
β”‚  WRITER 1 ─────┐          WRITER 2 ─────┐          WRITER 3 ─────┐                β”‚
β”‚  (INSERT)       β”‚          (UPDATE)       β”‚          (DELETE)       β”‚                β”‚
β”‚      β”‚          β”‚              β”‚          β”‚              β”‚          β”‚                β”‚
β”‚      β–Ό          β”‚              β–Ό          β”‚              β–Ό          β”‚                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”      β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”      β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”      β”‚                β”‚
β”‚  β”‚ Acquireβ”‚     β”‚          β”‚ Acquireβ”‚     β”‚          β”‚ Acquireβ”‚     β”‚                β”‚
β”‚  β”‚  LOCK  β”‚     β”‚          β”‚  LOCK  β”‚     β”‚          β”‚  LOCK  β”‚     β”‚                β”‚
β”‚  β””β”€β”€β”€β”¬β”€β”€β”€β”˜      β”‚          β””β”€β”€β”€β”¬β”€β”€β”€β”˜      β”‚          β””β”€β”€β”€β”¬β”€β”€β”€β”˜      β”‚                β”‚
β”‚      β”‚          β”‚              β”‚          β”‚              β”‚          β”‚                β”‚
β”‚      β–Ό          β”‚              β–Ό          β”‚              β–Ό          β”‚                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚                β”‚
β”‚  β”‚              OPTIMISTIC CONCURRENCY CONTROL                    β”‚  β”‚                β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚                β”‚
β”‚  β”‚  β”‚  1. Read current snapshot (read timestamp)              β”‚  β”‚  β”‚                β”‚
β”‚  β”‚  β”‚  2. Build transaction plan (read set + write set)       β”‚  β”‚  β”‚                β”‚
β”‚  β”‚  β”‚  3. Validate: No conflicting writes since read time    β”‚  β”‚  β”‚                β”‚
β”‚  β”‚  β”‚     β€’ Check _delta_log/ for new commits                 β”‚  β”‚  β”‚                β”‚
β”‚  β”‚  β”‚     β€’ Compare read-set versions                         β”‚  β”‚  β”‚                β”‚
β”‚  β”‚  β”‚  4. If conflict β†’ ABORT, retry with new snapshot        β”‚  β”‚  β”‚                β”‚
β”‚  β”‚  β”‚  5. If no conflict β†’ PROCEED to commit                  β”‚  β”‚  β”‚                β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                β”‚
β”‚      β”‚          β”‚              β”‚          β”‚              β”‚          β”‚                β”‚
β”‚      β–Ό          β”‚              β–Ό          β”‚              β–Ό          β”‚                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚                β”‚
β”‚  β”‚                 WRITE TO TRANSACTION LOG                       β”‚  β”‚                β”‚
β”‚  β”‚  _delta_log/                                                  β”‚  β”‚                β”‚
β”‚  β”‚  β”œβ”€β”€ 00000000000000000000.json  (Commit 0: Initial)          β”‚  β”‚                β”‚
β”‚  β”‚  β”œβ”€β”€ 00000000000000000001.json  (Commit 1: Writer 1)         β”‚  β”‚                β”‚
β”‚  β”‚  β”œβ”€β”€ 00000000000000000002.json  (Commit 2: Writer 2)         β”‚  β”‚                β”‚
β”‚  β”‚  β”œβ”€β”€ 00000000000000000003.json  (Commit 3: Writer 3)         β”‚  β”‚                β”‚
β”‚  β”‚  β”œβ”€β”€ _last_checkpoint  (Checkpoint pointer)                   β”‚  β”‚                β”‚
β”‚  β”‚  └── CHECKPOINT/  (Parquet checkpoints for fast reads)        β”‚  β”‚                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                β”‚
β”‚      β”‚          β”‚              β”‚          β”‚              β”‚          β”‚                β”‚
β”‚      β–Ό          β”‚              β–Ό          β”‚              β–Ό          β”‚                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚                β”‚
β”‚  β”‚              DATA FILES (Parquet on Object Store)              β”‚  β”‚                β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚  β”‚                β”‚
β”‚  β”‚  β”‚ part-   β”‚  β”‚ part-   β”‚  β”‚ part-   β”‚  β”‚ part-   β”‚          β”‚  β”‚                β”‚
β”‚  β”‚  β”‚ 00000   β”‚  β”‚ 00001   β”‚  β”‚ 00002   β”‚  β”‚ 00003   β”‚          β”‚  β”‚                β”‚
β”‚  β”‚  β”‚ .snappy β”‚  β”‚ .snappy β”‚  β”‚ .snappy β”‚  β”‚ .snappy β”‚          β”‚  β”‚                β”‚
β”‚  β”‚  β”‚ .parquetβ”‚  β”‚ .parquetβ”‚  β”‚ .parquetβ”‚  β”‚ .parquetβ”‚          β”‚  β”‚                β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  β”‚                β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚                β”‚
β”‚                                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Detailed Explanation

The Data Lakehouse represents a paradigm shift in enterprise data architecture, merging the flexibility and cost-efficiency of data lakes with the transactional guarantees and performance of data warehouses. This convergence eliminates the traditional "two-system" problem where organizations maintained separate systems for analytical and operational workloads, each with its own copy of data, governance model, and query engine.

At its foundation, the Lakehouse relies on an open table format (Delta Lake, Apache Iceberg, or Apache Hudi) that provides ACID transactions, schema enforcement, time travel, and metadata management directly on top of cheap object storage (S3, ADLS, GCS). The transaction log (_delta_log/ for Delta Lake) serves as the single source of truth, recording every mutation as an immutable, ordered JSON entry. This log enables optimistic concurrency control β€” multiple writers can operate simultaneously, with conflicts detected and resolved at commit time through a compare-and-swap mechanism on read-set versions.

The Medallion Architecture organizes data into three distinct tiers (Bronze, Silver, Gold), each with specific quality guarantees and transformation requirements. The Bronze layer ingests raw, unvalidated data from all sources with minimal transformation, preserving the original format for auditability and replay capability. The Silver layer applies cleansing, deduplication, standardization, and schema enforcement, producing a "single source of truth" that conforms to enterprise data models. The Gold layer contains business-level aggregates, dimensional models, and materialized views optimized for consumption by BI tools, ML pipelines, and applications.

Each layer provides specific data quality guarantees: Bronze accepts any valid data regardless of quality; Silver enforces NOT NULL constraints, data type validation, referential integrity checks, and temporal correctness; Gold guarantees that aggregations are idempotent, dimensions are conformed, and business rules are applied consistently across all domains.

Time travel (also called data versioning) is a critical capability enabled by the transaction log. Every commit creates a snapshot that can be queried as of a specific timestamp or version number. This enables use cases such as regulatory audit trails (reproduce any historical state), debugging data quality issues (compare current vs. previous state), point-in-time joins (join two tables as of the same logical moment), and rollback capabilities (revert to any previous consistent state).

Schema evolution and enforcement work together to maintain data quality while accommodating changing business requirements. Schema enforcement validates that incoming data matches the existing schema and rejects incompatible writes. Schema evolution allows controlled modifications (adding columns, changing data types, renaming fields) through explicit ALTER commands or write-mode configurations (mergeSchema option).

The table format layer also provides advanced optimization features: data skipping through file-level statistics (min/max values stored in Parquet footers and the transaction log), Z-ORDER multi-dimensional clustering for efficient point queries on correlated columns, compaction (OPTIMIZE command) to merge small files into larger ones for better read performance, and liquid clustering for adaptive, automatic data layout optimization.

Mathematical Foundations

Definition: Lakehouse Architecture

A lakehouse combines data lake storage LL (cheap, scalable object storage) with data warehouse semantics WW (ACID, schema enforcement, governance) such that:

Lakehouse=L∩W={(s,ACID(s),Schema(s),Govern(s)):s∈L}\text{Lakehouse} = L \cap W = \{(s, \text{ACID}(s), \text{Schema}(s), \text{Govern}(s)) : s \in L\}

Medallion Pattern Flow

Data flows through three tiers:

Bronze(Draw)β†’cleanSilver(Dcurated)β†’aggregateGold(Dbusiness)\text{Bronze}(D_{\text{raw}}) \xrightarrow{\text{clean}} \text{Silver}(D_{\text{curated}}) \xrightarrow{\text{aggregate}} \text{Gold}(D_{\text{business}})

Each tier applies transformations TiT_i: Di+1=Ti(Di)D_{i+1} = T_i(D_i) where Quality(Di+1)β‰₯Quality(Di)\text{Quality}(D_{i+1}) \geq \text{Quality}(D_i).

ACID on Data Lake Theorem

Transaction log LL ensures atomicity and isolation for concurrent operations on lake storage:

βˆ€Β concurrentΒ O1,O2:LΒ ordersΒ themΒ asΒ O1<O2Β orΒ O2<O1\forall\ \text{concurrent } O_1, O_2: L \text{ orders them as } O_1 < O_2 \text{ or } O_2 < O_1

Consistency is maintained because each operation is a commit that atomically adds/removes file references.

Storage Cost Model

Total cost for lakehouse with tiers:

Ctotal=ChotΓ—βˆ£Dgold∣⏟GoldΒ tier+CwarmΓ—βˆ£Dsilver∣⏟SilverΒ tier+CcoldΓ—βˆ£Dbronze∣⏟BronzeΒ tier+CcomputeC_{\text{total}} = \underbrace{C_{\text{hot}} \times |D_{\text{gold}}|}_{\text{Gold tier}} + \underbrace{C_{\text{warm}} \times |D_{\text{silver}}|}_{\text{Silver tier}} + \underbrace{C_{\text{cold}} \times |D_{\text{bronze}}|}_{\text{Bronze tier}} + C_{\text{compute}}

Typical ratio: Ccold:Cwarm:Chot=1:5:20C_{\text{cold}} : C_{\text{warm}} : C_{\text{hot}} = 1 : 5 : 20.

Data Freshness

End-to-end freshness from ingestion to Gold tier:

Fe2e=Tingest+Tbronze+Tsilver+TgoldF_{\text{e2e}} = T_{\text{ingest}} + T_{\text{bronze}} + T_{\text{silver}} + T_{\text{gold}}

Target: Fe2e<SLAfreshnessF_{\text{e2e}} < \text{SLA}_{\text{freshness}}.

Key Insight

The lakehouse pattern eliminates the "copy tax" of traditional architectures where data must be moved between systems. Delta Lake/Iceberg/Hudi provide the transaction layer that makes ACID possible on object storage.

Summary

The lakehouse unifies lake storage with warehouse semantics through transaction logs. The medallion pattern organizes data by quality tiers. Storage costs follow tiered pricing models, and freshness is the sum of pipeline stage latencies. ACID guarantees eliminate the need for separate OLAP systems.

Key Concepts Table

ConceptDelta LakeApache IcebergApache HudiDescription
ACID TransactionsYes (MVCC)Yes (snapshot isolation)Yes (copy-on-write / merge-on-read)Transactional semantics on object storage
Schema EvolutionYes (mergeSchema)Yes (evolve schema)Yes (schema evolution)Safe schema changes without rewriting data
Time TravelYes (VERSION AS OF)Yes (snapshots)Yes (timeline)Query historical data states
Partition EvolutionYes (dynamic)Yes (hidden partitions)Yes (evolvable)Change partitioning without rewriting
Conflict ResolutionOptimistic concurrencyOptimistic concurrencyOptimistic concurrencyHandle concurrent writers gracefully
Open FormatYes (Parquet + JSON log)Yes (Parquet + metadata)Yes (Parquet + metadata)No vendor lock-in
Streaming SupportYes (Structured Streaming)Yes (Spark/Flink)Yes (Spark/Flink)Real-time ingestion and processing
DML OperationsINSERT, UPDATE, DELETE, MERGEINSERT, UPDATE, DELETE, MERGEINSERT, UPDATE, DELETE, UPSERTFull SQL-compatible data manipulation
CachingYes (Delta Cache)No (relies on storage)No (relies on storage)Local caching for performance
StatisticsColumn-level statsPartition-level statsColumn-level statsData skipping for query optimization

Code Examples

Example 1: Setting Up a Delta Lake Table with Medallion Pattern

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from delta import configure_spark_with_delta_pip

# Initialize Spark with Delta Lake
builder = (
    SparkSession.builder
    .appName("Lakehouse-Medallion")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
    .config("spark.databricks.delta.schema.autoMerge.enabled", "true")
    .config("spark.sql.adaptive.enabled", "true")
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()

# ─── BRONZE LAYER: Raw Ingestion ───
# Ingest raw transaction data with full audit metadata
raw_transactions = (
    spark.read
    .format("json")
    .option("mode", "PERMISSIVE")
    .option("columnNameOfCorruptRecord", "_corrupt_record")
    .load("/mnt/raw/transactions/*.json")
    .withColumn("_ingestion_timestamp", current_timestamp())
    .withColumn("_source_system", lit("erp_system"))
    .withColumn("_batch_id", lit("batch_20260101"))
)

# Write to Bronze with append-only semantics
(
    raw_transactions
    .write
    .format("delta")
    .mode("append")
    .partitionBy("_source_system")
    .save("/mnt/lakehouse/bronze/transactions")
)

# ─── SILVER LAYER: Cleansed & Conformed ───
# Read Bronze, apply cleansing rules, write to Silver
bronze_df = spark.read.format("delta").load("/mnt/lakehouse/bronze/transactions")

silver_transactions = (
    bronze_df
    .filter(col("_corrupt_record").isNull())  # Remove corrupt records
    .drop("_corrupt_record", "_ingestion_timestamp")
    .withColumn("amount", col("amount").cast(DecimalType(18, 2)))
    .withColumn("transaction_date", to_date(col("timestamp")))
    .withColumn("customer_id", upper(trim(col("customer_id"))))
    .dropDuplicates(["transaction_id"])
    .withColumn("_quality_score", 
        when(col("amount") > 0, lit(1.0))
        .when(col("amount") < 0, lit(0.8))
        .otherwise(lit(0.5))
    )
)

# Merge into Silver (upsert for idempotency)
(
    spark.read.format("delta").load("/mnt/lakehouse/silver/transactions")
    .alias("target")
    .merge(
        silver_transactions.alias("source"),
        "target.transaction_id = source.transaction_id"
    )
    .whenMatchedUpdateAll()
    .whenNotMatchedInsertAll()
    .execute()
)

# ─── GOLD LAYER: Business Aggregates ───
# Compute daily revenue metrics per region
gold_daily_revenue = (
    spark.read.format("delta").load("/mnt/lakehouse/silver/transactions")
    .groupBy(
        col("transaction_date"),
        col("region"),
        col("product_category")
    )
    .agg(
        count("*").alias("transaction_count"),
        sum("amount").alias("total_revenue"),
        avg("amount").alias("avg_transaction_value"),
        approx_count_distinct("customer_id").alias("unique_customers"),
        stddev("amount").alias("revenue_stddev")
    )
    .withColumn("revenue_per_customer", col("total_revenue") / col("unique_customers"))
)

(
    gold_daily_revenue
    .write
    .format("delta")
    .mode("overwrite")
    .partitionBy("transaction_date")
    .option("overwriteSchema", "true")
    .save("/mnt/lakehouse/gold/daily_revenue")
)

Example 2: Advanced ACID Operations and Time Travel

from delta.tables import DeltaTable
from pyspark.sql.functions import *

# Load existing Delta table
transactions_table = DeltaTable.forPath(spark, "/mnt/lakehouse/silver/transactions")

# ─── MERGE (UPSERT) with Complex Logic ───
new_data = spark.read.format("delta").load("/mnt/staging/transactions_updates")

(
    transactions_table.alias("target")
    .merge(
        new_data.alias("source"),
        "target.transaction_id = source.transaction_id"
    )
    .whenMatchedUpdate(
        condition="source.update_type = 'CORRECTION'",
        set={
            "amount": col("source.amount"),
            "description": col("source.description"),
            "_last_modified": current_timestamp()
        }
    )
    .whenMatchedDelete(
        condition="source.update_type = 'REVERSAL'"
    )
    .whenNotMatchedInsert(
        values={
            "transaction_id": col("source.transaction_id"),
            "customer_id": col("source.customer_id"),
            "amount": col("source.amount"),
            "product_category": col("source.product_category"),
            "region": col("source.region"),
            "_created_at": current_timestamp(),
            "_last_modified": current_timestamp()
        }
    )
    .execute()
)

# ─── TIME TRAVEL: Query Historical States ───
# Query as of a specific timestamp
historical_df = (
    spark.read
    .format("delta")
    .option("timestampAsOf", "2026-01-01")
    .load("/mnt/lakehouse/silver/transactions")
)

# Query as of a specific version
version_df = (
    spark.read
    .format("delta")
    .option("versionAsOf", 42)
    .load("/mnt/lakehouse/silver/transactions")
)

# Compare two versions to detect changes
current_df = spark.read.format("delta").load("/mnt/lakehouse/silver/transactions")
changes_df = (
    current_df.alias("current")
    .join(
        version_df.alias("old"),
        "transaction_id",
        "full_outer"
    )
    .withColumn("change_type",
        when(col("old.transaction_id").isNull(), lit("INSERTED"))
        .when(col("current.transaction_id").isNull(), lit("DELETED"))
        .when(col("current.amount") != col("old.amount"), lit("UPDATED"))
        .otherwise(lit("UNCHANGED"))
    )
    .filter(col("change_type") != "UNCHANGED")
)

# ─── VACUUM: Remove Old Files ───
# Set retention interval (default 7 days, minimum 24 hours)
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
transactions_table.vacuum(retentionHours=168)  # 7 days

# ─── OPTIMIZE: Compact Small Files ───
(
    transactions_table.optimize()
    .executeCompaction()
)

# Z-ORDER for multi-dimensional clustering
(
    transactions_table.optimize()
    .executeZOrderBy("customer_id", "product_category")
)

Example 3: Apache Iceberg Integration with PySpark

# Configure Spark for Iceberg
spark = (
    SparkSession.builder
    .appName("Iceberg-Lakehouse")
    .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.my_catalog.type", "hadoop")
    .config("spark.sql.catalog.my_catalog.warehouse", "/mnt/lakehouse/iceberg")
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.SparkSessionExtensions")
    .getOrCreate()
)

# Create Iceberg table with advanced features
spark.sql("""
    CREATE TABLE IF NOT EXISTS my_catalog.sales (
        transaction_id STRING,
        customer_id STRING,
        amount DECIMAL(18, 2),
        product_category STRING,
        region STRING,
        transaction_date DATE,
        _created_at TIMESTAMP
    )
    USING iceberg
    PARTITIONED BY (transaction_date)
    TBLPROPERTIES (
        'format-version' = '2',
        'write.format.default' = 'parquet',
        'write.parquet.compression-codec' = 'zstd',
        'write.target-file-size-bytes' = '134217728',
        'commit.manifest-merge.enabled' = 'true'
    )
""")

# Schema evolution: add a new column
spark.sql("""
    ALTER TABLE my_catalog.sales
    ADD COLUMN payment_method STRING AFTER region
""")

# Partition evolution (no data rewrite required)
spark.sql("""
    ALTER TABLE my_catalog.sales
    DROP PARTITION FIELD transaction_date
""")

spark.sql("""
    ALTER TABLE my_catalog.sales
    ADD PARTITION FIELD bucket(16, customer_id)
""")

# Snapshot-based time travel
spark.sql("""
    SELECT * FROM my_catalog.sales
    TIMESTAMP AS OF '2026-01-01 00:00:00'
    WHERE region = 'US'
""")

# Expire old snapshots
spark.sql("""
    CALL my_catalog.system.expire_snapshots(
        table => 'sales',
        older_than => TIMESTAMP '2025-12-01 00:00:00',
        retain_last => 5
    )
""")

Performance Metrics

MetricBefore OptimizationAfter OptimizationImprovement
Average File Size1 MB (small files)128 MB (optimized)128x fewer files
Query Latency (P50)45 seconds2.3 seconds95% reduction
Query Latency (P99)320 seconds8.7 seconds97% reduction
Storage Cost (per TB)23/month∣23/month |12/month48% reduction
Write Throughput50 MB/s450 MB/s9x improvement
Concurrent Writers2 (frequent conflicts)12 (rare conflicts)6x scaling
Time Travel QueryN/A3.1 secondsNew capability
Compaction DurationN/A8.5 min/100GBNew capability
Data Skipped (%)0%85% (Z-ORDER)New capability
Snapshot RetentionUnlimitedConfigurableSpace savings

Best Practices

  1. Always use the Medallion Architecture β€” Maintain clear separation between Bronze (raw), Silver (cleansed), and Gold (aggregated) layers with explicit quality gates between them.

  2. Enable Adaptive Query Execution (AQE) β€” Set spark.sql.adaptive.enabled=true to allow Spark to dynamically optimize shuffle partitions, join strategies, and skew handling at runtime.

  3. Configure file compaction proactively β€” Run OPTIMIZE regularly on Silver/Gold tables to merge small files; target file sizes of 128–512 MB for optimal read performance.

  4. Apply Z-ORDER clustering on frequently filtered columns β€” Use OPTIMIZE ZORDER BY on columns commonly used in WHERE clauses, JOINs, and GROUP BY operations for data skipping.

  5. Set appropriate vacuum retention β€” Configure vacuum retention to match your compliance requirements; use VACUUM to reclaim storage from old files while maintaining the ability to time-travel.

  6. Use MERGE for idempotent writes β€” Always use MERGE (upsert) instead of INSERT for Silver-layer writes to ensure idempotency and prevent duplicate records on job retries.

  7. Monitor file statistics β€” Use DESCRIBE DETAIL and DESCRIBE HISTORY to monitor table health, file counts, and commit frequency; set alerts for excessive small file counts.

  8. Partition wisely β€” Partition by low-cardinality, high-filtration columns (date, region); avoid partitioning by high-cardinality columns (customer_id, transaction_id) which create excessive small files.

  9. Leverage schema evolution carefully β€” Use mergeSchema option sparingly and always validate schema changes against downstream consumers before applying; maintain a schema registry for governance.

  10. Implement data quality gates β€” Use tools like Great Expectations, Deequ, or custom validation in the Silver layer to ensure data meets quality standards before promotion to Gold.

  11. Version your data models β€” Tag Gold-layer tables with version numbers and maintain backward compatibility; use schema evolution to add columns rather than modifying existing ones.

  12. Optimize for your workload β€” For read-heavy workloads, prioritize compaction and Z-ORDER; for write-heavy workloads, use larger file target sizes and consider liquid clustering for automatic optimization.

See also: Snowflake Time Travel (snowflake/02), Kafka CDC (kafka/04), Airflow DAGs (airflow/02)

Advertisement

Need Expert PySpark Help?

Get personalized Spark optimization, cluster tuning, or production data pipeline consulting.

Advertisement