Spark Fundamentals for Big Data
Apache Spark is the unified analytics engine for large-scale data processing. This module covers the core abstractions: RDDs, DataFrames, Spark SQL, and the Catalyst optimizer.
Spark Architecture
1. RDDs (Resilient Distributed Datasets)
RDDs are Spark's foundational abstraction — immutable, partitioned collections of elements that can be operated on in parallel.
Key Properties
- Resilient: Fault-tolerant via lineage graph
- Distributed: Data split across partitions on multiple nodes
- Dataset: Collection of partitioned data with primitives
Transformations vs Actions
from pyspark.sql import SparkSession
from pyspark import SparkContext
sc = SparkContext("local", "RDDExample")
# Transformation (lazy)
rdd = sc.parallelize([1, 2, 3, 4, 5])
squared = rdd.map(lambda x: x ** 2)
filtered = squared.filter(lambda x: x > 10)
# Action (triggers computation)
result = filtered.collect() # [16, 25]
total = filtered.reduce(lambda a, b: a + b) # 41
Narrow vs Wide Dependencies
2. DataFrames and Spark SQL
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg, window, lag
spark = SparkSession.builder \
.appName("DataEngineering") \
.config("spark.sql.shuffle.partitions", 200) \
.getOrCreate()
df = spark.read.parquet("s3://bucket/events/")
# DataFrame API
result = df \
.filter(col("event_type") == "purchase") \
.groupBy(window("timestamp", "1 hour")) \
.agg(
avg("amount").alias("avg_amount"),
count("*").alias("total_purchases")
)
# Spark SQL
df.createOrReplaceTempView("events")
spark.sql("""
SELECT
date_trunc('hour', timestamp) AS hour,
AVG(amount) AS avg_amount,
COUNT(*) AS total_purchases
FROM events
WHERE event_type = 'purchase'
GROUP BY 1
ORDER BY 1
""")
3. Lazy Evaluation and Catalyst Optimizer
4. Partitioning Strategy
# Repartition by key
df.repartition(100, "user_id")
# Coalesce to reduce partitions
df.coalesce(10)
# Bucketing for frequent joins
df.write \
.bucketBy(256, "user_id") \
.sortBy("user_id") \
.saveAsTable("bucketed_events")
Partition Tuning Rules
| Scenario | Recommendation |
|---|---|
| Small files problem | Coalesce before write |
| Large shuffle operations | Increase spark.sql.shuffle.partitions |
| Join on key | Partition by join key |
| Time-series range | Partition by date |
5. Spark MLlib
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
indexer = StringIndexer(inputCol="label", outputCol="label_idx")
rf = RandomForestClassifier(featuresCol="features", labelCol="label_idx")
pipeline = Pipeline(stages=[indexer, assembler, rf])
model = pipeline.fit(train_df)
predictions = model.transform(test_df)
6. Performance Tuning Checklist
- Cache wisely:
df.cache()for repeated use;df.persist()with storage level - Avoid shuffles: Use broadcast joins for small tables (
broadcast(small_df)) - Tune memory:
spark.executor.memory,spark.driver.memory - AQE: Adaptive Query Execution (
spark.sql.adaptive.enabled=true) - Broadcast threshold:
spark.sql.autoBroadcastJoinThreshold
Key Takeaways
- RDDs provide low-level control; DataFrames leverage Catalyst optimization
- Lazy evaluation enables whole-plan optimization before execution
- Partitioning is the primary lever for performance tuning
- Shuffle is expensive — design pipelines to minimize data movement