πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Databricks Spark: Photon, Adaptive Query Execution & Optimization

Azure Data EngineeringDatabricks Spark⭐ Premium

Advertisement

Databricks Spark: Photon, Adaptive Query Execution & Optimization

Maximize Spark performance with Photon engine, AQE, and advanced optimization techniques

Spark Optimization Stack

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SPARK OPTIMIZATION STACK                          β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  LAYER 4: APPLICATION                                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ β€’ Partition pruning (predicate pushdown)                     β”‚   β”‚
β”‚  β”‚ β€’ Broadcast joins for small tables                           β”‚   β”‚
β”‚  β”‚ β€’ Avoid UDFs (use built-in functions)                       β”‚   β”‚
β”‚  β”‚ β€’ Cache frequently accessed DataFrames                       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                     β”‚
β”‚  LAYER 3: SPARK ENGINE                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ β€’ Adaptive Query Execution (AQE)                             β”‚   β”‚
β”‚  β”‚ β€’ Dynamic Partition Pruning                                   β”‚   β”‚
β”‚  β”‚ β€’ Skew Join Optimization                                      β”‚   β”‚
β”‚  β”‚ β€’ Coalesce Partitions                                         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                     β”‚
β”‚  LAYER 2: PHOTON ENGINE                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ β€’ Vectorized C++ execution                                   β”‚   β”‚
β”‚  β”‚ β€’ CPU cache-optimized data processing                        β”‚   β”‚
β”‚  β”‚ β€’ Multi-threaded execution                                   β”‚   β”‚
β”‚  β”‚ β€’ Automatic optimization for Spark SQL operations            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                     β”‚
β”‚  LAYER 1: DELTA LAKE                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ β€’ Data skipping (statistics-based)                           β”‚   β”‚
β”‚  β”‚ β€’ Z-ORDER indexing                                           β”‚   β”‚
β”‚  β”‚ β€’ Auto compaction (OPTIMIZE)                                 β”‚   β”‚
β”‚  β”‚ β€’ Delta Cache (local SSD caching)                            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Photon Engine

# Photon is enabled by default on DBR 11.3+
spark.conf.set("spark.databricks.photon.enabled", "true")

# Check Photon status
spark.sql("SET spark.databricks.photon.enabled").show()

# Monitor Photon metrics
spark.sql("""
    SELECT 
        operator,
        totalTimeMs,
        photonTimeMs,
        photonTimeMs / totalTimeMs * 100 AS photon_pct
    FROM system.query.operator_metrics
    WHERE queryId = 'your-query-id'
""")

Adaptive Query Execution (AQE)

# Enable AQE (default on DBR 11.3+)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")

# Monitor AQE decisions
spark.sql("""
    SELECT 
        queryId,
        optimizationReason,
        partitionsCoalesced,
        skewPartitionsOptimized
    FROM system.query.aqe_metrics
""")

Delta Cache

# Enable Delta Cache (default on DBR 11.3+)
spark.conf.set("spark.databricks.io.cache.enabled", "true")
spark.conf.set("spark.databricks.io.cache.maxDiskUsage", "50g")
spark.conf.set("spark.databricks.io.cache.maxCachedMetadataFileSize", "1g")

# Monitor cache hit rate
spark.sql("""
    SELECT 
        path,
        bytesRead,
        cacheHits,
        cacheMisses,
        cacheHits / (cacheHits + cacheMisses) * 100 AS hit_rate_pct
    FROM system.query.io_metrics
""")

Performance Monitoring

# Query history with execution details
spark.sql("""
    SELECT 
        query_id,
        query_text,
        start_time,
        duration / 1000 AS duration_sec,
        shuffle_bytes_written,
        scan_bytes_read
    FROM system.query.history
    WHERE start_time > current_timestamp() - interval 1 hour
    ORDER BY duration DESC
""")

# Cluster utilization
spark.sql("""
    SELECT 
        executor_id,
        total_tasks,
        failed_tasks,
        shuffle_write_bytes / 1024 / 1024 / 1024 AS shuffle_gb,
        gc_time / execution_time * 100 AS gc_pct
    FROM system.cluster.executor_metrics
""")

ℹ️

Pro Tip: Monitor the Spark UI (Spark tab in Databricks) to identify shuffle-heavy stages and optimize partition strategies. Use explain() to inspect query plans.

Interview Questions

Q1: How does Photon differ from standard Spark execution? A: Photon uses vectorized C++ execution (processing batches of rows) while standard Spark uses JVM-based row-at-a-time execution. Photon provides 2-8x performance improvement for SQL operations.

Q2: When should you disable AQE? A: Rarelyβ€”AQE is beneficial in most cases. Disable only when you need deterministic query plans for reproducibility (e.g., performance regression testing) or when AQE decisions are suboptimal for specific patterns.

Q3: How do you optimize Spark jobs for large datasets? A: 1) Use Delta Lake with Z-ORDER, 2) Enable Photon and AQE, 3) Optimize partition counts (aim for 128MB-256MB per partition), 4) Use broadcast joins for small tables, 5) Avoid UDFs, 6) Cache frequently accessed DataFrames.

Advertisement