PySpark RDD Fundamentals

Narrow vs Wide Transformation Partition Mapping

RDD Lineage Fault Tolerance

RDD Architecture Overview

RDD Transformation DAG

Narrow vs Wide Transformations

Code Examples

Basic RDD Operations

from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("RDD_Basics").setMaster("local[*]")
sc = SparkContext(conf=conf)

# Create RDD from collection
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = sc.parallelize(data, 4)  # 4 partitions
print(f"Partitions: {rdd.getNumPartitions()}")

# Narrow transformations (no shuffle)
mapped = rdd.map(lambda x: x * 2)
filtered = rdd.filter(lambda x: x > 5)
flattened = rdd.flatMap(lambda x: [x, x * 10])

# Wide transformation (shuffle)
pairs = rdd.map(lambda x: (x % 3, x))
grouped = pairs.groupByKey()
reduced = pairs.reduceByKey(lambda a, b: a + b)

# Actions (trigger execution)
print(f"Count: {rdd.count()}")
print(f"First: {rdd.first()}")
print(f"Take: {rdd.take(3)}")
print(f"Sum: {rdd.reduce(lambda a, b: a + b)}")

# Check lineage
print(rdd.toDebugString().decode())

sc.stop()

RDD Persistence Levels

from pyspark import StorageLevel

# Cache in memory (deserialized)
rdd.cache()  # Equivalent to persist(StorageLevel.MEMORY_ONLY)

# Persist with specific storage level
rdd.persist(StorageLevel.MEMORY_AND_DISK)
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
rdd.persist(StorageLevel.DISK_ONLY)
rdd.persist(StorageLevel.OFF_HEAP)

# Unpersist when done
rdd.unpersist()

Key Concepts Table

Concept	Description	Example
Partition	Logical chunk of data for parallel processing	`rdd.getNumPartitions()`
Lineage	DAG of transformations for fault tolerance	`rdd.toDebugString()`
Narrow Transform	1:1 partition mapping, no shuffle	`map()`, `filter()`, `flatMap()`
Wide Transform	M:N partition mapping, requires shuffle	`groupByKey()`, `reduceByKey()`, `join()`
Lazy Evaluation	Transforms built but not executed until action	Build DAG → Action triggers execution
Action	Triggers computation, returns result	`collect()`, `count()`, `first()`
Cache/Persist	Store RDD in memory/disk for reuse	`rdd.cache()` or `rdd.persist()`
Checkpoint	Write RDD to reliable storage, truncate lineage	`rdd.checkpoint()`
Broadcast	Read-only variable cached on each executor	`sc.broadcast(variable)`
Accumulator	Write-only variable for aggregations	`sc.accumulator(0)`

Best Practices

Prefer DataFrames over RDDs — DataFrames use Catalyst optimizer and Tungsten engine for automatic optimization
Use reduceByKey over groupByKey — reduceByKey combines locally before shuffle, reducing network I/O
Cache wisely — Only cache RDDs that are reused across multiple actions
Partition appropriately — Aim for 128MB–200MB per partition
Avoid collect() on large datasets — use take(n) or foreach() instead
Use coalesce() to reduce partitions — avoids full shuffle unlike repartition()
Enable Kryo serialization — 10x faster than Java serialization
Monitor shuffle spill — indicates memory pressure

PySpark RDD Fundamentals: Architecture, Transformations, and Actions

PySpark RDD Fundamentals

Narrow vs Wide Transformation Partition Mapping

RDD Lineage Fault Tolerance

RDD Architecture Overview

RDD Transformation DAG

Narrow vs Wide Transformations

Code Examples

Basic RDD Operations

RDD Persistence Levels

Key Concepts Table

Best Practices

Key Takeaways

See Also

Need Expert PySpark Help?