πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Partitioning & Bucketing in Data Lakes

Data EngineeringData Architecture⭐ Premium

Advertisement

Databricks & AWS Interview

Partitioning & Bucketing in Data Lakes

Optimizing data layout for query performance and cost

Interview Question

"You have a 50TB Parquet dataset in S3 storing clickstream data. Queries typically filter by user_id and timestamp. Users often ask 'show me all clicks for user X in the last 7 days.' Design an optimal partitioning and bucketing strategy. What are the tradeoffs?"

Difficulty: Hard | Frequently asked at Databricks, AWS, Netflix, Uber


Theoretical Foundation

Why Partitioning Matters

Partitioning divides data into logical groups based on column values. The goal is partition pruningβ€”reading only relevant files during queries.

Without partitioning:

s3://data-lake/clickstream/β”œβ”€β”€ part-00000.parquet (500GB)β”œβ”€β”€ part-00001.parquet (500GB)└── ... (100 files) β†’ Must scan ALL 50TB

Query: WHERE user_id = 'abc' AND date = '2024-01-15' β†’ Must scan ALL 50TB ❌

With partitioning:

s3://data-lake/clickstream/β”œβ”€β”€ date=2024-01-01/β”‚ β”œβ”€β”€ user_id=abc/ (only user abc's data)β”‚ └── user_id=def/β”œβ”€β”€ date=2024-01-02/β”‚ └── ... (365 directories)

Query: WHERE user_id = 'abc' AND date = '2024-01-15' β†’ Reads only 1 directory βœ“

Types of Partitioning

1. Directory-Based Partitioning (Hive-Style)

Data is physically stored in directories named by partition column values:

table_name/β”œβ”€β”€ partition_column=value1/β”‚ └── data filesβ”œβ”€β”€ partition_column=value2/β”‚ └── data files

Pros:

  • Simple to understand and debug
  • Works with all file formats (Parquet, ORC, JSON)
  • Compatible with Spark, Hive, Presto, Athena

Cons:

  • Limited partition columns (too many = too many small files)
  • Partition values are fixed at write time
  • Requires explicit partition discovery

2. Iceberg Partitioning

Apache Iceberg uses hidden partitioning with partition transforms:

-- Iceberg table with partition transforms
CREATE TABLE catalog.db.clickstream (
    event_id BIGINT,
    user_id STRING,
    event_time TIMESTAMP,
    event_type STRING,
    page_url STRING
) USING iceberg
PARTITIONED BY (
    days(event_time),      -- Daily partitioning
    user_id                -- Bucket by user_id
);

Partition transforms:

  • bucket(N, col) - Hash into N buckets
  • truncate(L, col) - Truncate to L characters
  • years(col) - Partition by year
  • months(col) - Partition by month
  • days(col) - Partition by day
  • hours(col) - Partition by hour

Pros:

  • Partition evolution (change partitioning without rewrite)
  • Hidden partitions (users don't need to know partition columns)
  • Automatic partition pruning
  • Time-travel with partition pruning

Cons:

  • Requires Iceberg format
  • More complex metadata management

3. Delta Lake Partitioning

Delta Lake supports the same Hive-style partitioning but adds:

  • Z-Ordering (multi-dimensional clustering)
  • Optimize command (compaction + sorting)
  • Liquid Clustering (automatic optimization)
# Delta Lake: Optimize with Z-Order
spark.sql("""
    OPTIMIZE clickstream
    ZORDER BY (user_id, event_time)
""")

4. Kafka Partitioning

Kafka partitions data by key:

# Producer: Partition by user_id
producer = KafkaProducer(
    bootstrap_servers='kafka:9092',
    key_serializer=lambda k: k.encode('utf-8'),
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Partition by user_id ensures same user goes to same partition
producer.send('clickstream', key=user_id, value=click_event)

Hash Partitioning

Uses a hash function to distribute data across partitions:

partition=hash(key)mod  N\text{partition} = \text{hash}(key) \mod N

where NN is the number of partitions.

Properties:

  • Even distribution (uniform hash)
  • Same key always goes to same partition
  • Adding/removing partitions requires rehashing (expensive)

Hash functions:

  • MurmurHash3: Fast, good distribution (Spark default)
  • CityHash: Fast, good for strings
  • MD5/SHA: Cryptographic, slower, uniform

Range Partitioning

Partitions data based on value ranges:

partitioni={x:loweri≀x<upperi}\text{partition}_i = \{x : \text{lower}_i \leq x < \text{upper}_i\}

Example:

Architecture Diagram
partition_0: date < 2024-01-01
partition_1: 2024-01-01 <= date < 2024-02-01
partition_2: 2024-02-01 <= date < 2024-03-01

Pros:

  • Natural for time-series data
  • Easy range queries
  • Partition pruning for date filters

Cons:

  • Data skew (recent data is "hot")
  • Requires knowledge of value distribution

Bucketing

Bucketing distributes data into a fixed number of files based on a hash of the bucket column:

bucket=hash(col)mod  Nbuckets\text{bucket} = \text{hash}(col) \mod N_{\text{buckets}}

Key difference from partitioning:

  • Partitioning: Creates directories
  • Bucketing: Creates files within directories
clickstream/β”œβ”€β”€ date=2024-01-15/β”‚ β”œβ”€β”€ bucket-00000.parquetβ”‚ β”œβ”€β”€ bucket-00001.parquet

When to use bucketing:

  • High-cardinality columns (user_id with millions of values)
  • Frequent JOINs on the bucket column
  • Preventing data skew in aggregations

Mathematical Analysis: Partition Size

For a dataset with:

  • DD = total data size (50TB)
  • QQ = query selectivity (fraction of data read)
  • PP = number of partitions
  • FF = average file size

Without partitioning:

DataΒ read=D=50TB\text{Data read} = D = 50\text{TB}

With partitioning on column with cardinality CC:

DataΒ read=DΓ—1C\text{Data read} = D \times \frac{1}{C}

For date partitioning (365 days, uniform distribution):

DataΒ read=50TBΓ—1365β‰ˆ137GB\text{Data read} = 50\text{TB} \times \frac{1}{365} \approx 137\text{GB}

For user_id partitioning (10M users):

DataΒ read=50TBΓ—110Mβ‰ˆ5MB\text{Data read} = 50\text{TB} \times \frac{1}{10M} \approx 5\text{MB}

ℹ️

Key Insight: The optimal partition key is the column most frequently used in WHERE clauses. For clickstream data, user_id is typically better than date because: (1) higher cardinality = smaller partitions, (2) user_id is more selective than date range.

The Small Files Problem

Partitioning too aggressively creates too many small files:

Nfiles=DFtargetN_{\text{files}} = \frac{D}{F_{\text{target}}}

For 50TB with 128MB target file size:

Nfiles=50Γ—106MB128MBβ‰ˆ390,000Β filesN_{\text{files}} = \frac{50 \times 10^6 \text{MB}}{128 \text{MB}} \approx 390,000 \text{ files}

If partitioned by user_id (10M users) and date (365 days):

Npartitions=10MΓ—365=3.65BΒ partitionsN_{\text{partitions}} = 10M \times 365 = 3.65B \text{ partitions}

This is way too many partitions! The NameNode (HDFS) or S3 listing would be overwhelmed.

Optimal partition count: 1,000 - 100,000 partitions

⚠️

Common Interview Trap: Candidates often recommend partitioning by both user_id AND date for clickstream data. This creates a combinatorial explosion: 10M users Γ— 365 days = 3.65B partitions. Each partition would have ~14KB of data, resulting in millions of tiny files. Instead, bucket by user_id and partition by date.

Partition Pruning

Query engines use partition pruning to skip irrelevant partitions:

-- Query with partition pruning
SELECT COUNT(*) 
FROM clickstream 
WHERE date = '2024-01-15' 
  AND user_id = 'abc';

-- Without pruning: Reads 50TB
-- With date partition pruning: Reads 137GB (1 day)
-- With user_id bucket pruning: Reads ~5MB (1 user, 1 day)

How pruning works:

  1. Parse WHERE clause
  2. Match columns to partition columns
  3. Calculate which partition values satisfy the filter
  4. Only read those partitions

Multi-Dimensional Clustering (Z-Order)

Z-Order interleaves bits of multiple columns to create a space-filling curve:

Architecture Diagram
Z-Order curve for (x, y):
    3---7
    |   |
    2---6
    |   |
    1---5
    |   |
    0---4

Implementation in Delta Lake:

# Z-Order by user_id and event_time
spark.sql("""
    OPTIMIZE clickstream
    ZORDER BY (user_id, event_time)
""")

Benefits:

  • Data is sorted by multiple dimensions simultaneously
  • Queries filtering on Z-ordered columns read fewer files
  • Automatic in Databricks Photon

Code Implementation

Partitioning Strategy for Clickstream Data

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder \
    .appName("ClickstreamPartitioning") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.sql.files.maxPartitionBytes", "134217728")  # 128MB
    .getOrCreate()

# Read raw clickstream data
raw_clicks = spark.read.parquet("s3://data-lake/raw/clickstream/")

# ============================================================
# STRATEGY 1: Hive-Style Partitioning (Date + Bucket by User)
# ============================================================

# Step 1: Add date partition column
clicks_with_date = raw_clicks \
    .withColumn("event_date", F.to_date("event_time"))

# Step 2: Bucket by user_id (1000 buckets for 10M users)
# This ensures same user's events are co-located
clicks_bucketed = clicks_with_date \
    .repartition(1000, "user_id") \
    .sortWithinPartitions("user_id", "event_time")

# Step 3: Write partitioned by date
clicks_bucketed.write \
    .format("parquet") \
    .partitionBy("event_date") \
    .option("compression", "snappy") \
    .mode("overwrite") \
    .save("s3://data-warehouse/clickstream_partitioned/")

# Result structure:
# s3://data-warehouse/clickstream_partitioned/
# β”œβ”€β”€ event_date=2024-01-01/
# β”‚   β”œβ”€β”€ part-00000.parquet (multiple users, same bucket)
# β”‚   β”œβ”€β”€ part-00001.parquet
# β”‚   └── ... (1000 files per date)
# β”œβ”€β”€ event_date=2024-01-02/
# └── ... (365 directories)

Delta Lake with Z-Order Optimization

# ============================================================
# STRATEGY 2: Delta Lake with Z-Order
# ============================================================

# Write as Delta Lake
raw_clicks.write \
    .format("delta") \
    .partitionBy("event_date") \
    .mode("overwrite") \
    .save("s3://data-warehouse/clickstream_delta/")

# Optimize with Z-Order on frequently queried columns
spark.sql("""
    OPTIMIZE delta.`s3://data-warehouse/clickstream_delta/`
    ZORDER BY (user_id, event_time)
""")

# Result: Data is both partitioned by date AND clustered by user_id
# Queries like "user X in last 7 days" are extremely fast

Iceberg with Hidden Partitioning

# ============================================================
# STRATEGY 3: Apache Iceberg
# ============================================================

spark.sql("""
    CREATE TABLE catalog.db.clickstream (
        event_id BIGINT,
        user_id STRING,
        event_time TIMESTAMP,
        event_type STRING,
        page_url STRING,
        session_id STRING,
        device_type STRING
    ) USING iceberg
    PARTITIONED BY (
        days(event_time),      -- Daily partitions
        bucket(1000, user_id)  -- Hash into 1000 buckets
    )
    TBLPROPERTIES (
        'write.target-file-size-bytes' = '134217728',  -- 128MB
        'write.parquet.compression-codec' = 'snappy'
    )
""")

# Insert data
spark.sql("""
    INSERT INTO catalog.db.clickstream
    SELECT 
        event_id,
        user_id,
        event_time,
        event_type,
        page_url,
        session_id,
        device_type
    FROM raw_clickstream
""")

Analyzing Partition Distribution

# Analyze partition sizes to detect skew
partition_stats = spark.sql("""
    SELECT 
        event_date,
        COUNT(*) as row_count,
        SUM(file_size) / 1024 / 1024 / 1024 as size_gb,
        COUNT(DISTINCT file_path) as file_count
    FROM (
        SELECT 
            input_file_name() as file_path,
            size(input_file_name()) as file_size,
            event_date
        FROM clickstream
    )
    GROUP BY event_date
    ORDER BY event_date
""")

partition_stats.show(100, truncate=False)

# Find partitions with skew (size > 2x average)
avg_size = partition_stats.agg(F.avg("size_gb")).collect()[0][0]
skewed = partition_stats.filter(F.col("size_gb") > 2 * avg_size)
skewed.show()

Repartitioning Strategy

# ============================================================
# REPARTITIONING FOR OPTIMAL PERFORMANCE
# ============================================================

# Problem: Data arrives with random partitioning
# Solution: Repartition to optimal layout

def optimal_repartition(df, partition_col, bucket_col=None, n_buckets=1000):
    """
    Repartition DataFrame for optimal query performance.
    
    Args:
        df: Input DataFrame
        partition_col: Column to partition by (e.g., date)
        bucket_col: Column to bucket by (e.g., user_id)
        n_buckets: Number of buckets
    """
    if bucket_col:
        # Bucket first, then partition
        df = df.repartition(n_buckets, bucket_col) \
               .sortWithinPartitions(bucket_col)
    
    # Then partition by the partition column
    df = df.repartition(partition_col)
    
    return df

# Apply to clickstream data
optimized_clicks = optimal_repartition(
    raw_clicks,
    partition_col="event_date",
    bucket_col="user_id",
    n_buckets=1000
)

optimized_clicks.write \
    .format("delta") \
    .partitionBy("event_date") \
    .mode("overwrite") \
    .save("s3://data-warehouse/clickstream_optimized/")

Monitoring Partition Health

# Monitor partition health over time
def monitor_partitions(table_path):
    """Monitor partition distribution and skew"""
    
    # Get partition stats
    stats = spark.sql(f"""
        SELECT 
            event_date,
            COUNT(*) as records,
            COUNT(DISTINCT input_file_name()) as files,
            SUM(LENGTH(input_file_name())) as total_bytes
        FROM delta.`{table_path}`
        GROUP BY event_date
    """)
    
    # Calculate metrics
    metrics = stats.select(
        F.count("*").alias("total_partitions"),
        F.avg("records").alias("avg_records_per_partition"),
        F.stddev("records").alias("std_records"),
        F.min("records").alias("min_records"),
        F.max("records").alias("max_records"),
        F.avg("files").alias("avg_files_per_partition")
    ).collect()[0]
    
    # Calculate coefficient of variation (skew metric)
    cv = metrics.std_records / metrics.avg_records_per_partition
    
    print(f"Total partitions: {metrics.total_partitions}")
    print(f"Avg records/partition: {metrics.avg_records_per_partition:,.0f}")
    print(f"Skew (CV): {cv:.2f}")
    print(f"Min/Max: {metrics.min_records:,.0f} / {metrics.max_records:,.0f}")
    
    if cv > 0.5:
        print("WARNING: High skew detected. Consider repartitioning.")
    
    return metrics

monitor_partitions("s3://data-warehouse/clickstream_delta/")

πŸ’‘

Production Tip: Use Delta Lake's DESCRIBE DETAIL command to monitor file sizes and partition distribution:

DESCRIBE DETAIL delta.`s3://data-warehouse/clickstream_delta/`;

This shows: numFiles, sizeInBytes, min/max file sizes, and partition columns.

Dynamic Partition Pruning

# Enable dynamic partition pruning in Spark
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")

# Query that benefits from dynamic partition pruning
spark.sql("""
    WITH recent_users AS (
        SELECT DISTINCT user_id
        FROM clicks
        WHERE event_date >= DATE_SUB(CURRENT_DATE(), 7)
    )
    SELECT 
        c.event_date,
        COUNT(DISTINCT c.user_id) as active_users
    FROM clicks c
    JOIN recent_users ru ON c.user_id = ru.user_id
    WHERE c.event_date >= DATE_SUB(CURRENT_DATE(), 30)
    GROUP BY c.event_date
""")

# Without DPP: Reads 30 days of data
# With DPP: Only reads partitions for users active in last 7 days

⚠️

Critical Consideration: Dynamic partition pruning requires the filtering subquery to run first. In distributed systems, this can cause a bottleneck. For very large tables, consider materializing the filter result in a separate table.


Common Follow-Up Questions

Q1: How do you choose between partitioning and bucketing?

ScenarioRecommendation
Low cardinality (< 10K values)Partition
High cardinality (> 1M values)Bucket
Time-series dataPartition by date, bucket by entity
Frequent JOINs on columnBucket by JOIN key
Range queries on columnPartition by column

Q2: How do you handle skew in partitioned data?

# Handle skew with salting
def salt_join(df1, df2, join_key, n_salt=10):
    """
    Salt a skewed join to distribute load evenly.
    """
    # Add salt to both DataFrames
    import random
    
    salted1 = df1.withColumn("salt", (F.rand() * n_salt).cast("int"))
    salted2 = df2.withColumn("salt_key", F.array([F.lit(i) for i in range(n_salt)])) \
                 .withColumn("salt", F.explode("salt_key"))
    
    # Join on original key + salt
    return salted1.join(salted2, 
                        (salted1[join_key] == salted2[join_key]) & 
                        (salted1.salt == salted2.salt))

Q3: What's the difference between partition pruning and predicate pushdown?

  • Partition pruning: Skips entire partitions (directories)
  • Predicate pushdown: Filters data within files using column statistics (min/max)

Both work together:

Architecture Diagram
Query: WHERE date = '2024-01-15' AND user_id = 'abc'

1. Partition pruning: Skip all dates except 2024-01-15
2. Predicate pushdown: Within 2024-01-15, skip files where 
   user_id min > 'abc' OR user_id max < 'abc'

Q4: How does partitioning affect write performance?

Each partition creates a separate file writer. Too many partitions = too many small files = slow writes.

Optimal write strategy:

  1. Repartition data in Spark before writing
  2. Use coalesce() to reduce partitions if needed
  3. Write in append mode for streaming
  4. Compact small files periodically

πŸ’‘

Interview Tip: Always mention the tradeoffs: partitioning improves read performance but can hurt write performance and create small files problems. The optimal strategy depends on: (1) query patterns, (2) data volume, (3) cardinality of partition columns, and (4) write frequency.


Company-Specific Tips

Databricks Interview Tips

  • Discuss Delta Lake and Liquid Clustering
  • Mention Photon engine's automatic optimization
  • Be ready to explain Z-Order and Hilbert curves
  • Talk about auto-optimize and auto-compaction

AWS Interview Tips

  • Focus on Athena and Redshift partitioning
  • Discuss S3 prefix optimization
  • Mention Glue partition indexes
  • Talk about cost implications of partition strategies

Netflix Interview Tips

  • Discuss content recommendation partitioning strategies
  • Explain how viewing history affects partition design
  • Mention A/B testing data partitioning
  • Talk about multi-tenant partition isolation

ℹ️

Final Takeaway: There's no one-size-fits-all partitioning strategy. Start with your query patterns, understand your data distribution, and iterate. Monitor partition health regularly and optimize as data grows. The best partitioning strategy is one that matches your most frequent query patterns.

Advertisement