Apache Spark: RDDs, DataFrames, and the Catalyst Optimizer

Apache Spark: Unified Analytics Engine for Big Data

Apache Spark is a unified analytics engine for large-scale data processing, providing APIs in Python (PySpark), Scala, Java, and R.

Why Spark Dominates Big Data Processing

Key Innovation:

Resilient Distributed Dataset (RDD) — immutable, partitioned, parallel collection
DataFrame/Dataset API with Catalyst query optimizer
SQL-like optimizations for arbitrary Python code

Spark vs MapReduce Advantages:

In-memory computation — eliminates disk I/O between stages
DAG execution engine — optimizes the physical plan
Catalyst optimizer — rewrites queries for performance
Unified API — batch, streaming, SQL, ML, and graph processing

Performance Comparison:

Feature	Apache Spark	MapReduce
Processing Model	In-memory	Disk-based (HDFS)
Speed	10-100x faster	Baseline
Ease of Use	High (Python, Scala, SQL)	Low (Java only)
Iterative Processing	Excellent (cache)	Poor (re-read disk)
Real-Time	Yes (Structured Streaming)	No
SQL Support	Spark SQL	Hive
ML Library	MLlib	Mahout
Fault Tolerance	Lineage-based	Replication-based

Key Insight: For typical workloads, Spark is 10-100x faster than MapReduce due to in-memory processing and optimized shuffles. |

Spark Application Architecture

Architecture Diagram

Key Concepts

Concept	Description	API
RDD	Immutable distributed collection	`sc.parallelize(data)`, `rdd.map(f)`
DataFrame	Distributed table with schema	`spark.read.parquet(path)`
Dataset	Type-safe DataFrame (Scala/Java only)	`ds.map(f)`
Transformation	Lazy operation building DAG	`.map()`, `.filter()`, `.join()`, `.groupBy()`
Action	Triggers computation	`.count()`, `.collect()`, `.save()`, `.show()`
Partition	Unit of parallelism	`df.repartition(n)`, `df.coalesce(n)`
Shuffle	Data redistribution across partitions	Triggered by `groupBy`, `join`, `repartition`
Broadcast	Send small DataFrame to all executors	`F.broadcast(small_df)`
Accumulator	Write-only shared variable	`sc.accumulator(0)`
Broadcast Variable	Read-only shared variable	`sc.broadcast(variable)`
Cache/Persist	Store DataFrame in memory/disk	`.cache()`, `.persist(StorageLevel.MEMORY_AND_DISK)`
Checkpoint	Write lineage to durable storage	`.checkpoint()`
SparkSession	Entry point for Spark operations	`SparkSession.builder.appName("app").getOrCreate()`
Catalog	Metadata store for tables, databases	`spark.catalog.listDatabases()`
UDF	User-defined function	`@udf(returnType=StringType())`
Pandas UDF	Vectorized UDF using Pandas	`@pandas_udf(IntegerType())`
Schema	StructType defining column types	`StructType([StructField("col", StringType())])`

Production Code

Optimized DataFrame ETL Pipeline

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
from pyspark.sql.window import Window
import logging

logger = logging.getLogger(__name__)


def create_spark_session(app_name: str = "ETL-Pipeline") -> SparkSession:
    """Create a production-optimized SparkSession."""
    return (
        SparkSession.builder
        .appName(app_name)
        .config("spark.sql.adaptive.enabled", "true")
        .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
        .config("spark.sql.adaptive.skewJoin.enabled", "true")
        .config("spark.sql.autoBroadcastJoinThreshold", str(10 * 1024 * 1024))  # 10MB
        .config("spark.sql.shuffle.partitions", "200")
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        .config("spark.sql.parquet.compression.codec", "snappy")
        .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
        .config("spark.dynamicAllocation.enabled", "true")
        .config("spark.dynamicAllocation.minExecutors", "2")
        .config("spark.dynamicAllocation.maxExecutors", "50")
        .config("spark.sql.execution.arrow.pyspark.enabled", "true")
        .getOrCreate()
    )


def run_optimized_etl(spark: SparkSession, input_path: str, output_path: str):
    """Run an optimized ETL pipeline with best practices."""
    # Read source data with schema enforcement
    schema = StructType([
        StructField("transaction_id", StringType(), False),
        StructField("customer_id", StringType(), False),
        StructField("product_id", StringType(), False),
        StructField("amount", DoubleType(), True),
        StructField("quantity", IntegerType(), True),
        StructField("event_date", StringType(), False),
    ])

    raw_df = (
        spark.read
        .schema(schema)
        .parquet(input_path)
        .repartition(F.col("customer_id"))  # Partition by join key
    )

    # Cache for reuse across multiple transformations
    raw_df.cache()
    logger.info(f"Raw record count: {raw_df.count()}")

    # Filter early to reduce data volume before joins
    filtered_df = (
        raw_df
        .filter(F.col("amount").isNotNull())
        .filter(F.col("amount") > 0)
        .filter(F.col("event_date") >= "2024-01-01")
    )

    # Broadcast join with small dimension table
    customer_df = (
        spark.read
        .parquet("s3://data-lake/dimensions/customers")
        .select("customer_id", "name", "tier", "segment")
    )
    enriched_df = filtered_df.join(
        F.broadcast(customer_df), on="customer_id", how="left"
    )

    # Windowed aggregation without shuffle (if partitioned by customer_id)
    window_spec = Window.partitionBy("customer_id").orderBy("event_date")
    result_df = (
        enriched_df
        .withColumn(
            "running_total",
            F.sum("amount").over(window_spec),
        )
        .withColumn(
            "rank_in_segment",
            F.row_number().over(
                Window.partitionBy("segment").orderBy(F.desc("amount"))
            ),
        )
    )

    # Write with dynamic partition overwrite
    (
        result_df
        .write
        .mode("overwrite")
        .partitionBy("event_date")
        .parquet(output_path)
    )

    raw_df.unpersist()
    logger.info(f"ETL completed. Output: {output_path}")


if __name__ == "__main__":
    spark = create_spark_session("production-etl")
    run_optimized_etl(spark, "s3://raw/transactions", "s3://curated/transactions_enriched")
    spark.stop()

Custom UDF and Pandas UDF Comparison

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, pandas_udf, col
from pyspark.sql.types import StringType, DoubleType
import pandas as pd
import numpy as np
from typing import Iterator

spark = create_spark_session("udf-benchmark")

# Row-at-a-time Python UDF (slow - serialization overhead)
@udf(returnType=StringType())
def categorize_amount_slow(amount: float) -> str:
    """Categorize transaction amount (row-at-a-time UDF)."""
    if amount is None:
        return "unknown"
    elif amount < 10:
        return "micro"
    elif amount < 100:
        return "small"
    elif amount < 1000:
        return "medium"
    else:
        return "large"


# Vectorized Pandas UDF (fast - Arrow serialization)
@pandas_udf(StringType())
def categorize_amount_fast(amounts: pd.Series) -> pd.Series:
    """Categorize transaction amount (vectorized Pandas UDF)."""
    def _categorize(a):
        if pd.isna(a):
            return "unknown"
        elif a < 10:
            return "micro"
        elif a < 100:
            return "small"
        elif a < 1000:
            return "medium"
        else:
            return "large"

    return amounts.apply(_categorize)


# Grouped Map Pandas UDF for complex transformations
@pandas_udf(
    schema="transaction_id string  customer_id string  amount double  z_score double",
    functionType=pandas_udf.GROUPED_MAP,
)
def compute_z_score(pdf: pd.DataFrame) -> pd.DataFrame:
    """Compute Z-score within each customer group."""
    pdf["z_score"] = (pdf["amount"] - pdf["amount"].mean()) / pdf["amount"].std()
    return pdf


# Benchmark: Compare UDF performance
df = spark.read.parquet("s3://raw/transactions")

# Slow: Row-at-a-time UDF
result_slow = df.withColumn("category", categorize_amount_slow(col("amount")))

# Fast: Vectorized Pandas UDF
result_fast = df.withColumn("category", categorize_amount_fast(col("amount")))

# Grouped transformation
result_zscore = df.groupBy("customer_id").apply(compute_z_score)

# Performance comparison (via Spark UI or time measurement)
import time

start = time.time()
result_slow.write.mode("overwrite").parquet("/tmp/output_slow")
slow_time = time.time() - start

start = time.time()
result_fast.write.mode("overwrite").parquet("/tmp/output_fast")
fast_time = time.time() - start

print(f"Row UDF time: {slow_time:.2f}s")
print(f"Pandas UDF time: {fast_time:.2f}s")
print(f"Speedup: {slow_time / fast_time:.1f}x")

Best Practices

Prefer DataFrame/Dataset API over RDDs — the Catalyst optimizer provides significant performance improvements with minimal developer effort.
Broadcast small DataFrames (< 10MB) for joins to avoid expensive shuffles on the large side.
Filter early — push predicates as close to the source as possible to minimize data processed in subsequent stages.
Use Pandas UDFs instead of Python UDFs when custom logic is required. Vectorized execution with Arrow provides 3-100x speedup.
Enable AQE (spark.sql.adaptive.enabled=true) for runtime optimization of shuffle partitions, join strategies, and skew handling.
Cache frequently reused DataFrames with .cache() or .persist(). Monitor cache usage in the Spark UI.
Set spark.sql.shuffle.partitions based on data volume. Target 128MB-256MB per partition after shuffle.
Checkpoint long lineages (every 10-20 transformations) to prevent stack overflow and reduce recovery time.
Use dynamic partition overwrite (spark.sql.sources.partitionOverwriteMode=dynamic) to avoid deleting unrelated partitions.
Monitor executor GC time in the Spark UI. If > 10%, increase executor memory or reduce cache usage.

Spark Configuration Quick Reference

Configuration	Default	Recommended	Impact
`spark.sql.shuffle.partitions`	200	50-500	More partitions = more parallelism
`spark.sql.autoBroadcastJoinThreshold`	10MB	10-50MB	Larger = more broadcast joins
`spark.sql.adaptive.enabled`	false	true	Runtime optimization
`spark.executor.memory`	1g	4-16g	Cache and shuffle memory
`spark.executor.cores`	1	4-8	Tasks per executor
`spark.driver.memory`	1g	2-8g	Driver-side operations
`spark.sql.execution.arrow.pyspark.enabled`	false	true	Vectorized UDFs
`spark.dynamicAllocation.enabled`	false	true	Auto-scale executors