CW

Spark Fundamentals for Big Data

Module 15: Data Engineering & MLOpsFree Lesson

Advertisement

Spark Fundamentals for Big Data

Apache Spark is the unified analytics engine for large-scale data processing. This module covers the core abstractions: RDDs, DataFrames, Spark SQL, and the Catalyst optimizer.

Spark Architecture

Spark ApplicationDriver Program (JVM)SparkContext / SparkSessionCluster Manager (YARN / Mesos / K8s / Standalone)Worker Node 1Executor (JVM)TaskTaskCache / BroadcastWorker Node 2Executor (JVM)TaskTaskCache / BroadcastWorker Node 3Executor (JVM)TaskTaskCache / BroadcastEach executor runs tasks in parallel on data partitions

1. RDDs (Resilient Distributed Datasets)

RDDs are Spark's foundational abstraction — immutable, partitioned collections of elements that can be operated on in parallel.

Key Properties

  • Resilient: Fault-tolerant via lineage graph
  • Distributed: Data split across partitions on multiple nodes
  • Dataset: Collection of partitioned data with primitives

Transformations vs Actions

from pyspark.sql import SparkSession
from pyspark import SparkContext

sc = SparkContext("local", "RDDExample")

# Transformation (lazy)
rdd = sc.parallelize([1, 2, 3, 4, 5])
squared = rdd.map(lambda x: x ** 2)
filtered = squared.filter(lambda x: x > 10)

# Action (triggers computation)
result = filtered.collect()  # [16, 25]
total = filtered.reduce(lambda a, b: a + b)  # 41

Narrow vs Wide Dependencies

Narrow DependencyEach partition → limited partitionsP0P1P2P0'P1'P2'map, filter, unionWide DependencyEach partition → many partitions (shuffle)P0P1P2P0'P1'P2'reduceByKey, groupByKey, join

2. DataFrames and Spark SQL

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, window, lag

spark = SparkSession.builder \
    .appName("DataEngineering") \
    .config("spark.sql.shuffle.partitions", 200) \
    .getOrCreate()

df = spark.read.parquet("s3://bucket/events/")

# DataFrame API
result = df \
    .filter(col("event_type") == "purchase") \
    .groupBy(window("timestamp", "1 hour")) \
    .agg(
        avg("amount").alias("avg_amount"),
        count("*").alias("total_purchases")
    )

# Spark SQL
df.createOrReplaceTempView("events")
spark.sql("""
    SELECT
        date_trunc('hour', timestamp) AS hour,
        AVG(amount) AS avg_amount,
        COUNT(*) AS total_purchases
    FROM events
    WHERE event_type = 'purchase'
    GROUP BY 1
    ORDER BY 1
""")

3. Lazy Evaluation and Catalyst Optimizer

Catalyst Optimizer PipelineUnresolved PlanAnalysis(Resolve refs)Logical Plan(Optimize)Physical Plan(Select algos)Code GenerationKey OptimizationsPredicate PushdownColumn PruningConstant FoldingJoin ReorderingWhole-Stage Code GenTungsten ExecutionTransformations are optimized before execution

4. Partitioning Strategy

# Repartition by key
df.repartition(100, "user_id")

# Coalesce to reduce partitions
df.coalesce(10)

# Bucketing for frequent joins
df.write \
    .bucketBy(256, "user_id") \
    .sortBy("user_id") \
    .saveAsTable("bucketed_events")

Partition Tuning Rules

ScenarioRecommendation
Small files problemCoalesce before write
Large shuffle operationsIncrease spark.sql.shuffle.partitions
Join on keyPartition by join key
Time-series rangePartition by date

5. Spark MLlib

from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline

assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="label_idx")
rf = RandomForestClassifier(featuresCol="features", labelCol="label_idx")

pipeline = Pipeline(stages=[indexer, assembler, rf])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)

6. Performance Tuning Checklist

  • Cache wisely: df.cache() for repeated use; df.persist() with storage level
  • Avoid shuffles: Use broadcast joins for small tables (broadcast(small_df))
  • Tune memory: spark.executor.memory, spark.driver.memory
  • AQE: Adaptive Query Execution (spark.sql.adaptive.enabled=true)
  • Broadcast threshold: spark.sql.autoBroadcastJoinThreshold

Key Takeaways

  • RDDs provide low-level control; DataFrames leverage Catalyst optimization
  • Lazy evaluation enables whole-plan optimization before execution
  • Partitioning is the primary lever for performance tuning
  • Shuffle is expensive — design pipelines to minimize data movement

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement