14. Merge & Upsert Operations in PySpark

Free Lesson

Advertisement

14. Merge & Upsert Operations in PySpark

DfMerge (Upsert)

A merge operation combines INSERT, UPDATE, and DELETE operations in a single atomic transaction based on a match condition. It is the foundation of Change Data Capture (CDC) and slowly changing dimension (SCD) patterns in Delta Lake.

DfChange Data Capture (CDC)

CDC is a technique that captures and tracks changes (inserts, updates, deletes) made to data in a source system, enabling downstream systems to apply the same changes to maintain synchronized copies.

Tmerge=Tscan+Tmatch+Tapply+TwriteT_{merge} = T_{scan} + T_{match} + T_{apply} + T_{write}

Merge File Pruning Efficiency

Eprune=1βˆ’fracNfilesscannedNfilestotalE_{prune} = 1 - \\frac{N_{files\\_scanned}}{N_{files\\_total}}

Here,

  • EpruneE_{prune}=Fraction of files skipped via data skipping (target > 0.9)
  • Nfiles_scannedN_{files\_scanned}=Number of Parquet files that need to be read
  • Nfiles_totalN_{files\_total}=Total number of Parquet files in the target table

Delta Lake merge uses data skipping via file-level min/max statistics to prune irrelevant files. To maximize pruning, ensure your merge key is clustered in the target table using ZORDER BY.

For high-frequency upserts, use MERGE with OPTIMIZE ZORDER BY on the merge key to maintain file-level statistics. Run VACUUM periodically to remove old files and prevent metadata bloat.

ThMerge Atomicity

Theorem: Delta Lake merge is atomic at the file level β€” it reads existing files, applies transformations, writes new files, and atomically updates the Delta log in a single commit. If any step fails, the transaction is rolled back and the table remains in its previous consistent state.

  • Merge = atomic INSERT + UPDATE + DELETE based on match condition
  • CDC captures source changes for downstream synchronization
  • File pruning via min/max statistics is critical for merge performance
  • Use OPTIMIZE ZORDER BY on merge keys; use VACUUM to clean old files

πŸ—οΈ Merge Operation Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    MERGE OPERATION ARCHITECTURE                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    SOURCE DATA                                   β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚   β”‚
β”‚  β”‚  β”‚  New Records β”‚    β”‚  Updated     β”‚    β”‚  Deleted     β”‚       β”‚   β”‚
β”‚  β”‚  β”‚  (INSERT)    β”‚    β”‚  Records     β”‚    β”‚  Records     β”‚       β”‚   β”‚
β”‚  β”‚  β”‚              β”‚    β”‚  (UPDATE)    β”‚    β”‚  (DELETE)    β”‚       β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                              β”‚                                          β”‚
β”‚                              β–Ό                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    MERGE ENGINE                                  β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  1. MATCH DETECTION                                     β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ Join condition evaluation                         β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ Duplicate detection                               β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ Conflict resolution                               β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β”‚                         β”‚                                        β”‚   β”‚
β”‚  β”‚                         β–Ό                                        β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  2. ACTION DETERMINATION                                 β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ INSERT (new records)                               β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ UPDATE (matched records)                           β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ DELETE (matched records)                           β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ MATCHED BY SOURCE (for CDC)                        β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β”‚                         β”‚                                        β”‚   β”‚
β”‚  β”‚                         β–Ό                                        β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  3. TRANSACTION EXECUTION                               β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ Atomic write                                      β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ Version management                                β”‚    β”‚   β”‚
β”‚  β”‚  β”‚     β€’ Conflict resolution                               β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                              β”‚                                          β”‚
β”‚                              β–Ό                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    TARGET TABLE (DELTA)                          β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  Version N: [old data]                                  β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Version N+1: [merged data] (after merge operation)     β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  Version N+2: [further updates]                         β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ CDC Pattern Diagram

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CHANGE DATA CAPTURE (CDC) PATTERNS                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    CDC SOURCE (Database)                         β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  Operation Log                                          β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β”‚ INSERT  β”‚ UPDATE  β”‚ DELETE  β”‚ INSERT  β”‚ UPDATE  β”‚   β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β”‚ id=1    β”‚ id=2    β”‚ id=3    β”‚ id=4    β”‚ id=5    β”‚   β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                              β”‚                                          β”‚
β”‚                              β–Ό                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    CDC PROCESSING                                β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚   β”‚
β”‚  β”‚  β”‚  Extract        │───▢│  Transform      β”‚                     β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Log parsing  β”‚    β”‚  β€’ Schema map   β”‚                     β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Filtering    β”‚    β”‚  β€’ Enrichment   β”‚                     β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Validation   β”‚    β”‚  β€’ Dedup        β”‚                     β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚   β”‚
β”‚  β”‚                              β”‚                                   β”‚   β”‚
β”‚  β”‚                              β–Ό                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”‚   β”‚
β”‚  β”‚  β”‚  Load           │◀───│  Merge          β”‚                     β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Delta write  β”‚    β”‚  β€’ Match        β”‚                     β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Versioning   β”‚    β”‚  β€’ Apply        β”‚                     β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Cleanup      β”‚    β”‚  β€’ Resolve      β”‚                     β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    CDC DATA FLOW                                 β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  Source ──▢ Extract ──▢ Transform ──▢ Merge ──▢ Target          β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚   β”‚
β”‚  β”‚  β”‚ DB Log  β”‚  β”‚ Stream  β”‚  β”‚ Buffer  β”‚  β”‚ Delta   β”‚           β”‚   β”‚
β”‚  β”‚  β”‚ (WAL)   β”‚  β”‚ (Kafka) β”‚  β”‚ (Memory)β”‚  β”‚ Table   β”‚           β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  Latency:  seconds ──▢ minutes ──▢ seconds ──▢ seconds          β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

⚑ Upsert Strategies Diagram

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    UPSERT STRATEGIES COMPARISON                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    STRATEGY 1: MERGE (Default)                   β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  Source ──▢ [Join] ──▢ [Match] ──▢ [Apply Actions]             β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Matches on key                                       β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Applies INSERT/UPDATE/DELETE based on conditions     β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Handles all CDC operations                           β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Performance: Moderate (requires join)                β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    STRATEGY 2: INSERT OVERWRITE                  β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  Source ──▢ [Replace Partition] ──▢ [Write]                     β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Replaces entire partition                            β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ No partial updates                                   β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Simple but destructive                               β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Performance: Fast (no join required)                 β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    STRATEGY 3: APPEND + DEDUP                    β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  Source ──▢ [Append] ──▢ [Deduplicate] ──▢ [Compact]           β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Appends all records                                  β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Deduplicates on read                                 β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Requires dedup logic                                 β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  β€’ Performance: Fast write, slow read                   β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                    PERFORMANCE COMPARISON                        β”‚   β”‚
β”‚  β”‚                                                                   β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚   β”‚
β”‚  β”‚  β”‚  Strategy        β”‚ Write  β”‚ Read   β”‚ Storage β”‚ Complexityβ”‚    β”‚   β”‚
β”‚  β”‚  │──────────────────┼────────┼────────┼─────────┼───────────│    β”‚   β”‚
β”‚  β”‚  β”‚  MERGE           β”‚ Medium β”‚ Fast   β”‚ Low     β”‚ High      β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  INSERT OVERWRITEβ”‚ Fast   β”‚ Fast   β”‚ Medium  β”‚ Low       β”‚    β”‚   β”‚
β”‚  β”‚  β”‚  APPEND + DEDUP  β”‚ Fast   β”‚ Slow   β”‚ High    β”‚ Medium    β”‚    β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“š Detailed Explanation

Merge and upsert operations are fundamental patterns in modern data lakes, enabling efficient updates, inserts, and deletes on large datasets. In PySpark, these operations are primarily implemented through Delta Lake's merge functionality, which provides ACID transactions, versioning, and efficient change data capture (CDC) processing.

The merge operation is the most flexible and powerful upsert strategy. It allows you to specify match conditions and define separate actions for matched and unmatched records. This makes it ideal for CDC scenarios where you need to handle inserts, updates, and deletes in a single operation. The merge operation uses a join to match source records with target records, then applies the specified actions based on the match results.

CDC patterns represent a critical use case for merge operations. In CDC, changes from a source system (typically a database) are captured and applied to a target system (usually a data lake). The merge operation handles the complexity of applying these changes efficiently, including handling late-arriving data, resolving conflicts, and maintaining data consistency.

The choice of upsert strategy depends on several factors: data volume, update patterns, latency requirements, and complexity constraints. The merge operation is the most flexible but has higher overhead due to the join. Insert overwrite is faster but doesn't support partial updates. Append with dedup is the fastest for writes but has higher storage costs and slower reads.

Performance optimization for merge operations involves several techniques: partitioning the target table to reduce the join scope, using efficient join strategies, optimizing the merge conditions, and managing file sizes through compaction. The target table's partition strategy is particularly important, as it directly affects the join performance.

Conflict resolution is a critical aspect of merge operations. When multiple writers attempt to merge into the same table simultaneously, conflicts can occur. Delta Lake provides optimistic concurrency control to handle these conflicts, automatically retrying operations when conflicts are detected. Understanding and configuring conflict resolution is essential for production systems.

Data quality considerations include handling schema evolution during merge operations, validating data types and constraints, and ensuring referential integrity. Delta Lake's schema enforcement and evolution capabilities help maintain data quality during merge operations.

Monitoring and observability are important for maintaining merge operation health. Key metrics include merge duration, number of records processed, file compaction rates, and conflict frequencies. These metrics help identify performance bottlenecks and optimize the merge strategy.

Advanced techniques include conditional merges, where different actions are applied based on complex conditions, and merge with aggregation, where source records are aggregated before merging. These techniques enable sophisticated data processing patterns while maintaining the benefits of ACID transactions.

πŸ“Š Key Concepts Table

ConceptDescriptionUse Case
Merge OperationJoin-based upsert with conditional actionsCDC, complex updates
CDC (Change Data Capture)Capturing and applying database changesData synchronization
UpsertInsert or update based on existenceReal-time data ingestion
Partition PruningReducing join scope through partitioningPerformance optimization
Optimistic ConcurrencyConflict resolution for concurrent writesMulti-writer scenarios
Schema EvolutionHandling schema changes during mergeEvolving data structures
File CompactionMerging small files for better performanceStorage optimization
Time TravelAccessing historical versionsAuditing, debugging

πŸ’» Code Examples

Basic Merge Operation

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
from pyspark.sql.functions import *

spark = SparkSession.builder \
    .appName("MergeOperation") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read target Delta table
target_table = DeltaTable.forPath(spark, "/path/to/target/table")

# Read source data (e.g., from Kafka, files, etc.)
source_df = spark.read \
    .format("json") \
    .load("/path/to/source/data")

# Perform merge operation
target_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"  # Match condition
).whenMatchedUpdate(
    condition="target.last_updated < source.last_updated",  # Only update if source is newer
    set={
        "name": "source.name",
        "email": "source.email",
        "last_updated": "source.last_updated"
    }
).whenNotMatchedInsert(
    values={
        "id": "source.id",
        "name": "source.name",
        "email": "source.email",
        "last_updated": "source.last_updated",
        "created_at": current_timestamp()
    }
).execute()

print("Merge operation completed successfully")

Advanced Merge with Delete

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
from pyspark.sql.functions import *

spark = SparkSession.builder \
    .appName("AdvancedMerge") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read target Delta table
target_table = DeltaTable.forPath(spark, "/path/to/target/table")

# Read source data with operation column
source_df = spark.read \
    .format("delta") \
    .load("/path/to/cdc/source") \
    .withColumn("operation", col("op"))  # o=insert, u=update, d=delete

# Perform merge with delete operation
target_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id"
).whenMatchedUpdate(
    condition="source.operation IN ('o', 'u')",  # Insert or Update
    set={
        "name": "source.name",
        "email": "source.email",
        "status": "source.status",
        "last_updated": "source.timestamp"
    }
).whenMatchedDelete(
    condition="source.operation = 'd'"  # Delete operation
).whenNotMatchedInsert(
    condition="source.operation IN ('o', 'u')",  # Only insert for new records
    values={
        "id": "source.id",
        "name": "source.name",
        "email": "source.email",
        "status": "source.status",
        "last_updated": "source.timestamp",
        "created_at": "source.timestamp"
    }
).execute()

print("CDC merge operation completed successfully")

Merge with Partition Pruning

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
from pyspark.sql.functions import *

spark = SparkSession.builder \
    .appName("PartitionPruningMerge") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read target Delta table (partitioned by date)
target_table = DeltaTable.forPath(spark, "/path/to/partitioned/table")

# Read source data for specific date
source_df = spark.read \
    .format("json") \
    .load("/path/to/source/data") \
    .filter(col("event_date") == "2024-01-15")  # Filter to specific partition

# Perform merge with partition pruning
target_table.alias("target").merge(
    source_df.alias("source"),
    "target.id = source.id AND target.event_date = source.event_date"  # Include partition key
).whenMatchedUpdate(
    set={
        "value": "source.value",
        "updated_at": current_timestamp()
    }
).whenNotMatchedInsert(
    values={
        "id": "source.id",
        "event_date": "source.event_date",
        "value": "source.value",
        "created_at": current_timestamp(),
        "updated_at": current_timestamp()
    }
).execute()

print("Partition-pruned merge completed successfully")

Merge with Aggregation

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
from pyspark.sql.functions import *

spark = SparkSession.builder \
    .appName("MergeWithAggregation") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read target Delta table
target_table = DeltaTable.forPath(spark, "/path/to/target/table")

# Read source data and aggregate
source_df = spark.read \
    .format("json") \
    .load("/path/to/source/data") \
    .groupBy("user_id", "event_date") \
    .agg(
        sum("amount").alias("total_amount"),
        count("*").alias("event_count"),
        max("timestamp").alias("last_event_time")
    )

# Perform merge with aggregated data
target_table.alias("target").merge(
    source_df.alias("source"),
    "target.user_id = source.user_id AND target.event_date = source.event_date"
).whenMatchedUpdate(
    set={
        "total_amount": "target.total_amount + source.total_amount",
        "event_count": "target.event_count + source.event_count",
        "last_event_time": "GREATEST(target.last_event_time, source.last_event_time)",
        "updated_at": current_timestamp()
    }
).whenNotMatchedInsert(
    values={
        "user_id": "source.user_id",
        "event_date": "source.event_date",
        "total_amount": "source.total_amount",
        "event_count": "source.event_count",
        "last_event_time": "source.last_event_time",
        "created_at": current_timestamp(),
        "updated_at": current_timestamp()
    }
).execute()

print("Merge with aggregation completed successfully")

πŸ“ˆ Performance Metrics

MetricOptimizedTypicalPoorOptimization
Merge Duration< 30s30s-5min> 5minPartition pruning, file optimization
Records Processed/sec> 100K10K-100K< 10KParallelism, join optimization
File Compaction< 100 files100-1000 files> 1000 filesRegular compaction, optimize command
Conflict Rate< 1%1-5%> 5%Concurrency control, partitioning
Storage Efficiency> 80%60-80%< 60%Compaction, z-ordering

πŸ† Best Practices

  1. Partition strategically - Partition by high-cardinality columns used in join conditions
  2. Use partition pruning - Always include partition keys in merge conditions
  3. Optimize file sizes - Aim for 128MB-1GB files for optimal performance
  4. Regular compaction - Run OPTIMIZE command to merge small files
  5. Z-order indexing - Create z-order indexes on frequently filtered columns
  6. Monitor conflicts - Track and minimize merge conflicts
  7. Handle schema evolution - Use mergeSchema option for schema changes
  8. Test with realistic data - Validate performance with production-like volumes
  9. Use checkpointing - For streaming merges, configure appropriate checkpoint intervals
  10. Backup before major operations - Always backup data before large merge operations

πŸ”— Related Topics

  • 11-structured-streaming.mdx: Streaming architecture for real-time merges
  • 12-state-management.mdx: Stateful operations for merge patterns
  • 13-window-operations.mdx: Windowed aggregations before merge
  • 16-schema-evolution.mdx: Schema changes during merge operations

See Also

  • Kafka Streams (kafka/03): Stream-table duality for upsert semantics
  • Data Engineering Streaming (data-engineering/022): CDC merge patterns in streaming pipelines

Advertisement

Need Expert PySpark Help?

Get personalized Spark optimization, cluster tuning, or production data pipeline consulting.

Advertisement