⚡ PySpark UDF Optimization

UDF Execution Flow: Row-by-Row vs Batch

🏗️ Architecture Diagram

📚 Detailed Explanation

1. What is a UDF?

A User Defined Function (UDF) is a function defined by the user that can be used in Spark SQL and DataFrame operations.

Types of UDFs in PySpark:

Python UDF: Standard Python function processed row-by-row
Pandas UDF: Vectorized function using Pandas/NumPy operations
Grouped Map UDF: Processes groups of rows as Pandas DataFrames

2. Python UDF Performance Characteristics

Python UDFs have significant performance overhead.

Serialization Overhead:

Each row is serialized from JVM to Python (Py4J)
Python processes the row
Result is serialized back to JVM
This happens for EVERY row

Process Switch Overhead:

Context switching between JVM and Python
Memory copying between processes
Garbage collection in both JVM and Python

Python Execution Overhead:

Python is interpreted (no JIT compilation)
GIL limits true parallelism
Object creation/destruction overhead

Performance Impact:

~10-100x slower than native Spark operations
Significant memory overhead (Python object duplication)
High GC pressure in both JVM and Python

3. Pandas UDF Performance Characteristics

Pandas UDFs (introduced in Spark 2.3) dramatically improve performance.

Vectorized Processing:

Process batches of rows at once (default: 1000 rows)
Use Pandas/NumPy for vectorized operations
Avoid per-row function call overhead

Arrow Serialization:

Zero-copy data transfer between JVM and Python
Columnar format (cache-friendly)
No per-row serialization/deserialization

Performance Improvement:

~10x faster than Python UDFs
Lower memory usage (Arrow format)
Better GC behavior

4. When to Use UDFs

Scenario	Recommendation
Built-in function available	Use built-in (fastest, JVM-native)
Complex logic, vectorizable	Use Pandas UDF (10x faster than Python UDF)
Simple row-level logic	Use Python UDF (slowest but flexible)
Need external services	Use Python UDF

Use Built-in Functions When Possible:

# GOOD: Built-in function
from pyspark.sql.functions import col, upper, concat
df.withColumn("name_upper", upper(col("name")))

# BAD: UDF for same thing
@udf(StringType())
def my_upper(s):
    return s.upper() if s else None
df.withColumn("name_upper", my_upper(col("name")))

5. Pandas UDF Types

Scalar Pandas UDF:

@pandas_udf(DoubleType())
def multiply_by_two(pdf: pd.Series) -> pd.Series:
    return pdf * 2

Grouped Map Pandas UDF:

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf['value'] = (pdf['value'] - pdf['value'].mean()) / pdf['value'].std()
    return pdf

Grouped Aggregate Pandas UDF:

@pandas_udf(DoubleType(), PandasUDFType.GROUPED_AGG)
def mean_udf(v: pd.Series) -> float:
    return v.mean()

6. Arrow Configuration

Apache Arrow enables efficient data transfer between JVM and Python.

Configuration:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1000")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

Batch Size Tuning:

Default: 1000 rows per batch
Larger batches: Better throughput, more memory
Smaller batches: Lower latency, less memory

7. UDF Testing and Debugging

Testing UDFs:

# Unit test Python UDF
def test_my_udf():
    assert my_udf("hello") == "HELLO"
    assert my_udf(None) is None

# Test with Spark
df = spark.createDataFrame([("hello",)], ["text"])
result = df.withColumn("upper", my_udf(col("text"))).collect()
assert result[0]["upper"] == "HELLO"

Debugging UDFs:

import logging
logging.basicConfig(level=logging.DEBUG)

@udf(StringType())
def debug_udf(x):
    logging.debug(f"Input: {x}")
    result = x.upper() if x else None
    logging.debug(f"Output: {result}")
    return result

8. Common UDF Pitfalls

Pitfall	Example	Solution
Using global variables	`LOOKUP = {...}` not serialized to executors	Use broadcast variables
Not handling nulls	`x * 2` fails if x is None	Add null checks
Using Python UDF when Pandas UDF works	Row-by-row processing	Use vectorized Pandas UDF

Pitfall 1: Using global variables

# BAD: Global variable not broadcast
LOOKUP = {"a": 1, "b": 2}

@udf(IntegerType())
def lookup_value(key):
    return LOOKUP.get(key)  # Error on executors!

# GOOD: Use broadcast variable
broadcast_lookup = spark.sparkContext.broadcast({"a": 1, "b": 2})

@udf(IntegerType())
def lookup_value(key):
    return broadcast_lookup.value.get(key)

Pitfall 2: Not handling nulls

# BAD: No null handling
@udf(DoubleType())
def process_value(x):
    return x * 2  # Error if x is None

# GOOD: Handle nulls
@udf(DoubleType())
def process_value(x):
    return x * 2 if x is not None else None

Golden Rule: Always prefer built-in functions. Use UDFs only when built-in functions cannot express the required logic.

🔑 Key Concepts Table

UDF Type	Processing	Serialization	Speed	Memory	Use Case
Built-in	JVM native	None	★★★★★	★★★★★	Always prefer
Pandas UDF	Batch (Vectorized)	Arrow	★★★★☆	★★★★☆	Complex logic
Python UDF	Row-by-row	Py4J	★★☆☆☆	★★☆☆☆	Simple logic
Grouped Map	Group batch	Arrow	★★★★☆	★★★★☆	Group operations

💻 Code Examples

Example 1: Python UDF vs Pandas UDF

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, pandas_udf, col
from pyspark.sql.types import DoubleType
import pandas as pd

spark = SparkSession.builder.appName("UDFOptimization").getOrCreate()

# Create test data
df = spark.range(1000000).withColumn("value", col("id") * 1.0)

# Python UDF (slow)
# Parameter: returnType — Spark SQL return type
# Processes one row at a time via Py4J serialization
@udf(DoubleType())
def python_double(x):
    return x * 2 if x is not None else None

# Pandas UDF (fast)
# Parameter: returnType — Spark SQL return type
# Processes batches via Arrow, vectorized pandas operations
@pandas_udf(DoubleType())
def pandas_double(pdf: pd.Series) -> pd.Series:
    return pdf * 2

# Benchmark
import time

# Python UDF
start = time.time()
result_python = df.withColumn("doubled", python_double(col("value")))
result_python.count()
python_time = time.time() - start

# Pandas UDF
start = time.time()
result_pandas = df.withColumn("doubled", pandas_double(col("value")))
result_pandas.count()
pandas_time = time.time() - start

print(f"Python UDF: {python_time:.2f}s")
print(f"Pandas UDF: {pandas_time:.2f}s")
print(f"Speedup: {python_time / pandas_time:.1f}x")

Example 2: Grouped Map Pandas UDF

from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
import pandas as pd

# Define schema
# Parameter: StructType — defines the output schema
schema = StructType([
    StructField("group", IntegerType()),
    StructField("value", DoubleType()),
    StructField("normalized", DoubleType())
])

# Grouped Map UDF
# Parameters:
#   schema — output schema for each group
#   PandasUDFType.GROUPED_MAP — processes each group as a DataFrame
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def normalize(pdf: pd.DataFrame) -> pd.DataFrame:
    pdf['normalized'] = (pdf['value'] - pdf['value'].mean()) / pdf['value'].std()
    return pdf

# Create data
df = spark.range(100000).withColumn(
    "group", col("id") % 10
).withColumn(
    "value", col("id") * 1.0
)

# Apply UDF
# groupby("group") partitions data by group
# apply(normalize) applies UDF to each group
result = df.groupby("group").apply(normalize)
result.show()

Example 3: Arrow Configuration

# Enable Arrow for better performance
# Parameter: spark.sql.execution.arrow.enabled — enable Arrow for all UDFs
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
# Parameter: spark.sql.execution.arrow.maxRecordsPerBatch — batch size (default 1000)
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1000")
# Parameter: spark.sql.execution.arrow.pyspark.enabled — enable for PySpark
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")

# Pandas UDF with Arrow
@pandas_udf(DoubleType())
def efficient_udf(pdf: pd.Series) -> pd.Series:
    return pdf * 2

# This will use Arrow for efficient data transfer
result = df.withColumn("doubled", efficient_udf(col("value")))

Example 4: UDF with Broadcast Variable

from pyspark.sql.functions import udf, broadcast
from pyspark.sql.types import IntegerType

# Create lookup table
lookup_data = {i: i * 10 for i in range(100)}
# Broadcast to all executors
broadcast_lookup = spark.sparkContext.broadcast(lookup_data)

# UDF using broadcast
@udf(IntegerType())
def lookup_value(key):
    # Access broadcast value via .value
    return broadcast_lookup.value.get(key)

# Apply UDF
df = spark.range(1000).withColumn("key", col("id") % 100)
result = df.withColumn("lookup_result", lookup_value(col("key")))
result.show()

📊 Performance Metrics

Method	1M Rows	10M Rows	Memory	GC	Complexity
Built-in	100ms	800ms	50MB	Low	N/A
Pandas UDF	500ms	4s	100MB	Medium	Medium
Python UDF	5s	50s	200MB	High	High
Arrow Batch	400ms	3.5s	80MB	Low	Medium
Grouped Map	800ms	7s	150MB	Medium	High

✅ Best Practices

1. Prefer Built-in Functions

# GOOD: Use built-in functions
from pyspark.sql.functions import upper, concat, when
df.withColumn("name_upper", upper(col("name")))

# BAD: UDF for same thing
@udf(StringType())
def my_upper(s):
    return s.upper() if s else None

2. Use Pandas UDF When Possible

# BAD: Python UDF
@udf(DoubleType())
def process(x):
    return x * 2

# GOOD: Pandas UDF
@pandas_udf(DoubleType())
def process(pdf: pd.Series) -> pd.Series:
    return pdf * 2

3. Enable Arrow

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "1000")

4. Handle Nulls

@pandas_udf(DoubleType())
def safe_process(pdf: pd.Series) -> pd.Series:
    return pdf * 2  # Pandas handles NaN automatically

5. Use Broadcast for Large Lookups

lookup = spark.sparkContext.broadcast(large_dict)

@udf(IntegerType())
def lookup_value(key):
    return lookup.value.get(key)

6. Tune Batch Size

# Larger batches for better throughput
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "5000")

# Smaller batches for lower latency
spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "100")

PySpark UDF Optimization: UDFs, Pandas UDFs, and Performance