πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

EMR Spark Deep Dive

AWS Data EngineeringDynamic Allocation & Shuffle Optimization⭐ Premium

Advertisement

πŸš€ EMR Spark Deep Dive

Master EMR Spark dynamic allocation, shuffle optimization, and performance tuning.

Module: AWS Data Engineering β€’ Topic 41 of 65 β€’ Premium Content

Spark on EMR Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SPARK ON EMR ARCHITECTURE                                  β”‚
β”‚                                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚  DRIVER β†’ EXECUTORS (distributed across core/task nodes)            β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  Driver: SparkContext, task scheduling, result aggregation         β”‚    β”‚
β”‚  β”‚  Executors: Task execution, data caching, shuffle                  β”‚    β”‚
β”‚  β”‚                                                                     β”‚    β”‚
β”‚  β”‚  DYNAMIC ALLOCATION:                                                β”‚    β”‚
β”‚  β”‚  β€’ Min/Max executors based on workload                              β”‚    β”‚
β”‚  β”‚  β€’ Scale-out when tasks are pending                                 β”‚    β”‚
β”‚  β”‚  β€’ Scale-in when executors are idle                                 β”‚    β”‚
β”‚  β”‚  β€’ External shuffle service for state preservation                 β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                             β”‚
β”‚  SHUFFLE OPTIMIZATION:                                                      β”‚
β”‚  β€’ Partition count: Default 200, tune based on data size                   β”‚
β”‚  β€’ Adaptive Query Execution (AQE): Auto-optimize at runtime                β”‚
β”‚  β€’ Skew join handling: Detect and split skewed partitions                   β”‚
β”‚  β€’ Coalesce partitions: Merge small partitions automatically               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Spark Configuration

spark_config = {
    # Dynamic Allocation
    'spark.dynamicAllocation.enabled': 'true',
    'spark.shuffle.service.enabled': 'true',
    'spark.dynamicAllocation.minExecutors': '2',
    'spark.dynamicAllocation.maxExecutors': '100',
    'spark.dynamicAllocation.executorIdleTimeout': '60s',
    
    # Adaptive Query Execution
    'spark.sql.adaptive.enabled': 'true',
    'spark.sql.adaptive.coalescePartitions.enabled': 'true',
    'spark.sql.adaptive.skewJoin.enabled': 'true',
    'spark.sql.adaptive.skewJoin.skewedPartitionFactor': '5',
    
    # Shuffle Optimization
    'spark.sql.shuffle.partitions': '200',
    'spark.shuffle.compress': 'true',
    'spark.shuffle.spill.compress': 'true',
    'spark.io.compression.codec': 'snappy',
    
    # Memory Management
    'spark.executor.memory': '16g',
    'spark.executor.memoryFraction': '0.8',
    'spark.executor.memoryOverhead': '4g',
    'spark.driver.memory': '8g',
    'spark.memory.fraction': '0.8'
}

Interview Q&A

Q1: How does dynamic allocation work?

Answer: Executors are added when tasks are pending and removed when idle. The external shuffle service preserves shuffle state when executors are removed.

Q2: What is the optimal shuffle partition count?

Answer: Start with 200. Tune based on data size: ~128MB per partition. Use AQE for automatic optimization.

Q3: What is Adaptive Query Execution?

Answer: AQE optimizes queries at runtime based on actual data statistics: auto-coalesce partitions, handle skew joins, optimize join strategies.

Summary

  • Dynamic Allocation: Auto-scale executors based on workload
  • AQE: Runtime query optimization based on data statistics
  • Shuffle: Tune partition count, enable compression
  • Memory: Proper memory allocation prevents OOM errors
  • Cost: Use spot instances for task nodes

Advertisement