Partitioning & Bucketing in Data Lakes
Optimizing data layout for query performance and cost
Interview Question
"You have a 50TB Parquet dataset in S3 storing clickstream data. Queries typically filter by user_id and timestamp. Users often ask 'show me all clicks for user X in the last 7 days.' Design an optimal partitioning and bucketing strategy. What are the tradeoffs?"
Difficulty: Hard | Frequently asked at Databricks, AWS, Netflix, Uber
Theoretical Foundation
Why Partitioning Matters
Partitioning divides data into logical groups based on column values. The goal is partition pruningβreading only relevant files during queries.
Without partitioning:
Query: WHERE user_id = 'abc' AND date = '2024-01-15'
β Must scan ALL 50TB β
With partitioning:
Query: WHERE user_id = 'abc' AND date = '2024-01-15'
β Reads only 1 directory β
Types of Partitioning
1. Directory-Based Partitioning (Hive-Style)
Data is physically stored in directories named by partition column values:
Pros:
- Simple to understand and debug
- Works with all file formats (Parquet, ORC, JSON)
- Compatible with Spark, Hive, Presto, Athena
Cons:
- Limited partition columns (too many = too many small files)
- Partition values are fixed at write time
- Requires explicit partition discovery
2. Iceberg Partitioning
Apache Iceberg uses hidden partitioning with partition transforms:
-- Iceberg table with partition transforms
CREATE TABLE catalog.db.clickstream (
event_id BIGINT,
user_id STRING,
event_time TIMESTAMP,
event_type STRING,
page_url STRING
) USING iceberg
PARTITIONED BY (
days(event_time), -- Daily partitioning
user_id -- Bucket by user_id
);
Partition transforms:
bucket(N, col)- Hash into N bucketstruncate(L, col)- Truncate to L charactersyears(col)- Partition by yearmonths(col)- Partition by monthdays(col)- Partition by dayhours(col)- Partition by hour
Pros:
- Partition evolution (change partitioning without rewrite)
- Hidden partitions (users don't need to know partition columns)
- Automatic partition pruning
- Time-travel with partition pruning
Cons:
- Requires Iceberg format
- More complex metadata management
3. Delta Lake Partitioning
Delta Lake supports the same Hive-style partitioning but adds:
- Z-Ordering (multi-dimensional clustering)
- Optimize command (compaction + sorting)
- Liquid Clustering (automatic optimization)
# Delta Lake: Optimize with Z-Order
spark.sql("""
OPTIMIZE clickstream
ZORDER BY (user_id, event_time)
""")
4. Kafka Partitioning
Kafka partitions data by key:
# Producer: Partition by user_id
producer = KafkaProducer(
bootstrap_servers='kafka:9092',
key_serializer=lambda k: k.encode('utf-8'),
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# Partition by user_id ensures same user goes to same partition
producer.send('clickstream', key=user_id, value=click_event)
Hash Partitioning
Uses a hash function to distribute data across partitions:
where is the number of partitions.
Properties:
- Even distribution (uniform hash)
- Same key always goes to same partition
- Adding/removing partitions requires rehashing (expensive)
Hash functions:
- MurmurHash3: Fast, good distribution (Spark default)
- CityHash: Fast, good for strings
- MD5/SHA: Cryptographic, slower, uniform
Range Partitioning
Partitions data based on value ranges:
Example:
partition_0: date < 2024-01-01
partition_1: 2024-01-01 <= date < 2024-02-01
partition_2: 2024-02-01 <= date < 2024-03-01
Pros:
- Natural for time-series data
- Easy range queries
- Partition pruning for date filters
Cons:
- Data skew (recent data is "hot")
- Requires knowledge of value distribution
Bucketing
Bucketing distributes data into a fixed number of files based on a hash of the bucket column:
Key difference from partitioning:
- Partitioning: Creates directories
- Bucketing: Creates files within directories
When to use bucketing:
- High-cardinality columns (user_id with millions of values)
- Frequent JOINs on the bucket column
- Preventing data skew in aggregations
Mathematical Analysis: Partition Size
For a dataset with:
- = total data size (50TB)
- = query selectivity (fraction of data read)
- = number of partitions
- = average file size
Without partitioning:
With partitioning on column with cardinality :
For date partitioning (365 days, uniform distribution):
For user_id partitioning (10M users):
βΉοΈ
Key Insight: The optimal partition key is the column most frequently used in WHERE clauses. For clickstream data, user_id is typically better than date because: (1) higher cardinality = smaller partitions, (2) user_id is more selective than date range.
The Small Files Problem
Partitioning too aggressively creates too many small files:
For 50TB with 128MB target file size:
If partitioned by user_id (10M users) and date (365 days):
This is way too many partitions! The NameNode (HDFS) or S3 listing would be overwhelmed.
Optimal partition count: 1,000 - 100,000 partitions
β οΈ
Common Interview Trap: Candidates often recommend partitioning by both user_id AND date for clickstream data. This creates a combinatorial explosion: 10M users Γ 365 days = 3.65B partitions. Each partition would have ~14KB of data, resulting in millions of tiny files. Instead, bucket by user_id and partition by date.
Partition Pruning
Query engines use partition pruning to skip irrelevant partitions:
-- Query with partition pruning
SELECT COUNT(*)
FROM clickstream
WHERE date = '2024-01-15'
AND user_id = 'abc';
-- Without pruning: Reads 50TB
-- With date partition pruning: Reads 137GB (1 day)
-- With user_id bucket pruning: Reads ~5MB (1 user, 1 day)
How pruning works:
- Parse WHERE clause
- Match columns to partition columns
- Calculate which partition values satisfy the filter
- Only read those partitions
Multi-Dimensional Clustering (Z-Order)
Z-Order interleaves bits of multiple columns to create a space-filling curve:
Z-Order curve for (x, y):
3---7
| |
2---6
| |
1---5
| |
0---4
Implementation in Delta Lake:
# Z-Order by user_id and event_time
spark.sql("""
OPTIMIZE clickstream
ZORDER BY (user_id, event_time)
""")
Benefits:
- Data is sorted by multiple dimensions simultaneously
- Queries filtering on Z-ordered columns read fewer files
- Automatic in Databricks Photon
Code Implementation
Partitioning Strategy for Clickstream Data
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName("ClickstreamPartitioning") \
.config("spark.sql.shuffle.partitions", "200") \
.config("spark.sql.files.maxPartitionBytes", "134217728") # 128MB
.getOrCreate()
# Read raw clickstream data
raw_clicks = spark.read.parquet("s3://data-lake/raw/clickstream/")
# ============================================================
# STRATEGY 1: Hive-Style Partitioning (Date + Bucket by User)
# ============================================================
# Step 1: Add date partition column
clicks_with_date = raw_clicks \
.withColumn("event_date", F.to_date("event_time"))
# Step 2: Bucket by user_id (1000 buckets for 10M users)
# This ensures same user's events are co-located
clicks_bucketed = clicks_with_date \
.repartition(1000, "user_id") \
.sortWithinPartitions("user_id", "event_time")
# Step 3: Write partitioned by date
clicks_bucketed.write \
.format("parquet") \
.partitionBy("event_date") \
.option("compression", "snappy") \
.mode("overwrite") \
.save("s3://data-warehouse/clickstream_partitioned/")
# Result structure:
# s3://data-warehouse/clickstream_partitioned/
# βββ event_date=2024-01-01/
# β βββ part-00000.parquet (multiple users, same bucket)
# β βββ part-00001.parquet
# β βββ ... (1000 files per date)
# βββ event_date=2024-01-02/
# βββ ... (365 directories)
Delta Lake with Z-Order Optimization
# ============================================================
# STRATEGY 2: Delta Lake with Z-Order
# ============================================================
# Write as Delta Lake
raw_clicks.write \
.format("delta") \
.partitionBy("event_date") \
.mode("overwrite") \
.save("s3://data-warehouse/clickstream_delta/")
# Optimize with Z-Order on frequently queried columns
spark.sql("""
OPTIMIZE delta.`s3://data-warehouse/clickstream_delta/`
ZORDER BY (user_id, event_time)
""")
# Result: Data is both partitioned by date AND clustered by user_id
# Queries like "user X in last 7 days" are extremely fast
Iceberg with Hidden Partitioning
# ============================================================
# STRATEGY 3: Apache Iceberg
# ============================================================
spark.sql("""
CREATE TABLE catalog.db.clickstream (
event_id BIGINT,
user_id STRING,
event_time TIMESTAMP,
event_type STRING,
page_url STRING,
session_id STRING,
device_type STRING
) USING iceberg
PARTITIONED BY (
days(event_time), -- Daily partitions
bucket(1000, user_id) -- Hash into 1000 buckets
)
TBLPROPERTIES (
'write.target-file-size-bytes' = '134217728', -- 128MB
'write.parquet.compression-codec' = 'snappy'
)
""")
# Insert data
spark.sql("""
INSERT INTO catalog.db.clickstream
SELECT
event_id,
user_id,
event_time,
event_type,
page_url,
session_id,
device_type
FROM raw_clickstream
""")
Analyzing Partition Distribution
# Analyze partition sizes to detect skew
partition_stats = spark.sql("""
SELECT
event_date,
COUNT(*) as row_count,
SUM(file_size) / 1024 / 1024 / 1024 as size_gb,
COUNT(DISTINCT file_path) as file_count
FROM (
SELECT
input_file_name() as file_path,
size(input_file_name()) as file_size,
event_date
FROM clickstream
)
GROUP BY event_date
ORDER BY event_date
""")
partition_stats.show(100, truncate=False)
# Find partitions with skew (size > 2x average)
avg_size = partition_stats.agg(F.avg("size_gb")).collect()[0][0]
skewed = partition_stats.filter(F.col("size_gb") > 2 * avg_size)
skewed.show()
Repartitioning Strategy
# ============================================================
# REPARTITIONING FOR OPTIMAL PERFORMANCE
# ============================================================
# Problem: Data arrives with random partitioning
# Solution: Repartition to optimal layout
def optimal_repartition(df, partition_col, bucket_col=None, n_buckets=1000):
"""
Repartition DataFrame for optimal query performance.
Args:
df: Input DataFrame
partition_col: Column to partition by (e.g., date)
bucket_col: Column to bucket by (e.g., user_id)
n_buckets: Number of buckets
"""
if bucket_col:
# Bucket first, then partition
df = df.repartition(n_buckets, bucket_col) \
.sortWithinPartitions(bucket_col)
# Then partition by the partition column
df = df.repartition(partition_col)
return df
# Apply to clickstream data
optimized_clicks = optimal_repartition(
raw_clicks,
partition_col="event_date",
bucket_col="user_id",
n_buckets=1000
)
optimized_clicks.write \
.format("delta") \
.partitionBy("event_date") \
.mode("overwrite") \
.save("s3://data-warehouse/clickstream_optimized/")
Monitoring Partition Health
# Monitor partition health over time
def monitor_partitions(table_path):
"""Monitor partition distribution and skew"""
# Get partition stats
stats = spark.sql(f"""
SELECT
event_date,
COUNT(*) as records,
COUNT(DISTINCT input_file_name()) as files,
SUM(LENGTH(input_file_name())) as total_bytes
FROM delta.`{table_path}`
GROUP BY event_date
""")
# Calculate metrics
metrics = stats.select(
F.count("*").alias("total_partitions"),
F.avg("records").alias("avg_records_per_partition"),
F.stddev("records").alias("std_records"),
F.min("records").alias("min_records"),
F.max("records").alias("max_records"),
F.avg("files").alias("avg_files_per_partition")
).collect()[0]
# Calculate coefficient of variation (skew metric)
cv = metrics.std_records / metrics.avg_records_per_partition
print(f"Total partitions: {metrics.total_partitions}")
print(f"Avg records/partition: {metrics.avg_records_per_partition:,.0f}")
print(f"Skew (CV): {cv:.2f}")
print(f"Min/Max: {metrics.min_records:,.0f} / {metrics.max_records:,.0f}")
if cv > 0.5:
print("WARNING: High skew detected. Consider repartitioning.")
return metrics
monitor_partitions("s3://data-warehouse/clickstream_delta/")
π‘
Production Tip: Use Delta Lake's DESCRIBE DETAIL command to monitor file sizes and partition distribution:
DESCRIBE DETAIL delta.`s3://data-warehouse/clickstream_delta/`;
This shows: numFiles, sizeInBytes, min/max file sizes, and partition columns.
Dynamic Partition Pruning
# Enable dynamic partition pruning in Spark
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")
# Query that benefits from dynamic partition pruning
spark.sql("""
WITH recent_users AS (
SELECT DISTINCT user_id
FROM clicks
WHERE event_date >= DATE_SUB(CURRENT_DATE(), 7)
)
SELECT
c.event_date,
COUNT(DISTINCT c.user_id) as active_users
FROM clicks c
JOIN recent_users ru ON c.user_id = ru.user_id
WHERE c.event_date >= DATE_SUB(CURRENT_DATE(), 30)
GROUP BY c.event_date
""")
# Without DPP: Reads 30 days of data
# With DPP: Only reads partitions for users active in last 7 days
β οΈ
Critical Consideration: Dynamic partition pruning requires the filtering subquery to run first. In distributed systems, this can cause a bottleneck. For very large tables, consider materializing the filter result in a separate table.
Common Follow-Up Questions
Q1: How do you choose between partitioning and bucketing?
| Scenario | Recommendation |
|---|---|
| Low cardinality (< 10K values) | Partition |
| High cardinality (> 1M values) | Bucket |
| Time-series data | Partition by date, bucket by entity |
| Frequent JOINs on column | Bucket by JOIN key |
| Range queries on column | Partition by column |
Q2: How do you handle skew in partitioned data?
# Handle skew with salting
def salt_join(df1, df2, join_key, n_salt=10):
"""
Salt a skewed join to distribute load evenly.
"""
# Add salt to both DataFrames
import random
salted1 = df1.withColumn("salt", (F.rand() * n_salt).cast("int"))
salted2 = df2.withColumn("salt_key", F.array([F.lit(i) for i in range(n_salt)])) \
.withColumn("salt", F.explode("salt_key"))
# Join on original key + salt
return salted1.join(salted2,
(salted1[join_key] == salted2[join_key]) &
(salted1.salt == salted2.salt))
Q3: What's the difference between partition pruning and predicate pushdown?
- Partition pruning: Skips entire partitions (directories)
- Predicate pushdown: Filters data within files using column statistics (min/max)
Both work together:
Query: WHERE date = '2024-01-15' AND user_id = 'abc'
1. Partition pruning: Skip all dates except 2024-01-15
2. Predicate pushdown: Within 2024-01-15, skip files where
user_id min > 'abc' OR user_id max < 'abc'
Q4: How does partitioning affect write performance?
Each partition creates a separate file writer. Too many partitions = too many small files = slow writes.
Optimal write strategy:
- Repartition data in Spark before writing
- Use
coalesce()to reduce partitions if needed - Write in append mode for streaming
- Compact small files periodically
π‘
Interview Tip: Always mention the tradeoffs: partitioning improves read performance but can hurt write performance and create small files problems. The optimal strategy depends on: (1) query patterns, (2) data volume, (3) cardinality of partition columns, and (4) write frequency.
Company-Specific Tips
Databricks Interview Tips
- Discuss Delta Lake and Liquid Clustering
- Mention Photon engine's automatic optimization
- Be ready to explain Z-Order and Hilbert curves
- Talk about auto-optimize and auto-compaction
AWS Interview Tips
- Focus on Athena and Redshift partitioning
- Discuss S3 prefix optimization
- Mention Glue partition indexes
- Talk about cost implications of partition strategies
Netflix Interview Tips
- Discuss content recommendation partitioning strategies
- Explain how viewing history affects partition design
- Mention A/B testing data partitioning
- Talk about multi-tenant partition isolation
βΉοΈ
Final Takeaway: There's no one-size-fits-all partitioning strategy. Start with your query patterns, understand your data distribution, and iterate. Monitor partition health regularly and optimize as data grows. The best partitioning strategy is one that matches your most frequent query patterns.