Parquet vs ORC vs Avro vs Delta Lake: Deep Dive

Difficulty: Senior/Staff Level | Companies: Google, Meta, Netflix, Uber, Airbnb, Databricks

This question tests your understanding of columnar and row-based storage formats, their internal structures, and when to use each in production data pipelines.

Core Concepts: Storage Format Taxonomy

Row-Based vs Columnar Storage

The fundamental distinction in data encoding formats is between row-oriented and column-oriented storage:

Row-based storage (e.g., Avro, CSV):

\text{Row Storage: } [R_1][R_2][R_3]...[R_n]

Each row is stored contiguously:

Architecture Diagram

| Name    | Age | City     |
|---------|-----|----------|
| Alice   | 30  | NYC      |
| Bob     | 25  | SF       |

Storage layout:

Architecture Diagram

[Alice|30|NYC][Bob|25|SF]...

Columnar storage (e.g., Parquet, ORC):

\text{Column Storage: } [C_1][C_2][C_3]...[C_m]

Each column is stored contiguously:

Architecture Diagram

Names:  [Alice][Bob]...
Ages:   [30][25]...
Cities: [NYC][SF]...

Compression Ratio Analysis

Columnar storage achieves better compression because adjacent values in a column often share similar data patterns:

\text{Compression Ratio} = \frac{\text{Uncompressed Size}}{\text{Compressed Size}}

For columnar formats:

CR_{columnar} = \frac{\sum_{i=1}^{m} |C_i|}{\sum_{i=1}^{m} \text{compressed}(|C_i|)}

For row-based formats:

CR_{row} = \frac{\sum_{j=1}^{n} |R_j|}{\sum_{j=1}^{n} \text{compressed}(|R_j|)}

Where typically $CR_{columnar} > CR_{row}$ due to better locality of similar values.

1. Apache Parquet

Internal Structure

Parquet uses a hybrid approach with row groups and column chunks:

Architecture Diagram

Parquet File
├── File Metadata
│   ├── Schema
│   ├── Row Group 1
│   │   ├── Column Chunk 1 (compressed)
│   │   ├── Column Chunk 2 (compressed)
│   │   └── ...
│   ├── Row Group 2
│   │   └── ...
│   └── ...
└── Page Indexes

Encoding Schemes in Parquet

import pyarrow as pa
import pyarrow.parquet as pq

# Create sample data with different data types
data = pa.table({
    'id': pa.array([1, 2, 3, 4, 5] * 1000000),
    'name': pa.array(['Alice', 'Bob', 'Charlie', 'David', 'Eve'] * 1000000),
    'score': pa.array([85.5, 92.3, 78.9, 95.1, 88.7] * 1000000),
    'category': pa.array(['A', 'B', 'A', 'C', 'B'] * 1000000)
})

# Write with different encoding strategies
pq.write_table(
    data,
    'data_plain.parquet',
    compression='none',
    write_statistics=True,
    encoding='PLAIN'
)

pq.write_table(
    data,
    'data_dict.parquet',
    compression='snappy',
    write_statistics=True,
    encoding='DELTA_BYTE_ARRAY'  # Dictionary encoding for low-cardinality
)

pq.write_table(
    data,
    'data_delta.parquet',
    compression='zstd',
    write_statistics=True,
    encoding='DELTA_BINARY_PACKED'  # Delta encoding for sequential integers
)

# Compare file sizes
import os
for f in ['data_plain.parquet', 'data_dict.parquet', 'data_delta.parquet']:
    print(f"{f}: {os.path.getsize(f) / 1024 / 1024:.2f} MB")

Parquet Page Types and Statistics

Each Parquet page includes statistics that enable predicate pushdown:

import pyarrow.parquet as pq

# Read metadata to inspect page statistics
parquet_file = pq.ParquetFile('data_dict.parquet')

for i, row_group in enumerate(parquet_file.metadata.row_groups):
    print(f"\nRow Group {i}:")
    for j, column in enumerate(row_group.columns):
        print(f"  Column {j} ({column.path_in_schema}):")
        print(f"    Min: {column.statistics.min}")
        print(f"    Max: {column.statistics.max}")
        print(f"    Null count: {column.statistics.null_count}")
        print(f"    Distinct values: {column.statistics.distinct_count}")

Bloom Filters in Parquet

Parquet supports optional Bloom filters for efficient equality checks:

P(\text{false positive}) = \left(1 - e^{-kn/m}\right)^k

Where:

$m$ = number of bits in the filter
$n$ = number of elements
$k$ = number of hash functions

# Writing Parquet with Bloom filter configuration
import pyarrow.parquet as pq
from pyarrow.parquet import BloomFilter

# Custom Bloom filter settings via ParquetWriter
pq.write_table(
    data,
    'data_bloom.parquet',
    write_page_index=True,
    write_statistics=True
)

# Note: Native Bloom filter support requires Arrow C++ >= 8.0
# or using parquet-mr writer

2. Apache ORC

ORC File Structure

ORC (Optimized Row Columnar) provides built-in indexes and statistics:

Architecture Diagram

ORC File
├── File Footer
│   ├── File Stats (min/max per column)
│   ├── Stripe 1
│   │   ├── Index (bloom filter + min/max per 10K rows)
│   │   ├── Data (column groups)
│   │   └── Statistics
│   ├── Stripe 2
│   │   └── ...
│   └── ...
└── PostScript (compression, version)

ORC vs Parquet Performance

# Example: ORC writing with Hive (via PyHive)
from pyhive import hive
import pandas as pd

# Simulated ORC write using pandas + pyorc
import pyorc

# Create ORC file with specific compression
data_orc = {
    'id': list(range(1000000)),
    'name': ['user_' + str(i % 1000) for i in range(1000000)],
    'value': [float(i * 0.1) for i in range(1000000)]
}

schema = "struct<id:int,name:string,value:double>"

with open('data.orc', 'wb') as f:
    writer = pyorc.Writer(f, schema, stripe_size=67108864)  # 64MB stripes
    for i in range(0, 1000000, 1000):
        batch = {
            'id': data_orc['id'][i:i+1000],
            'name': data_orc['name'][i:i+1000],
            'value': data_orc['value'][i:i+1000]
        }
        writer.write(batch)
    writer.close()

ORC's Built-in ACID Support

ORC provides native ACID transactions (Hive 3+):

\text{Transaction ID} = \text{Base Transaction} + \text{Delta Files}

-- ORC supports native ACID in Hive
CREATE TABLE transactions (
    id INT,
    amount DECIMAL(10,2),
    timestamp TIMESTAMP
) STORED AS ORC
TBLPROPERTIES (
    'orc.bloom.filter.columns'='id',
    'orc.create.index'='true'
);

-- ACID operations
INSERT INTO transactions VALUES (1, 100.00, current_timestamp());
UPDATE transactions SET amount = 150.00 WHERE id = 1;
DELETE FROM transactions WHERE id = 1;

3. Apache Avro

Avro Schema Evolution

Avro excels at schema evolution with reader/writer schema resolution:

// Writer schema (v1)
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"}
  ]
}

// Reader schema (v2) - adds new field with default
{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": ["null", "string"], "default": null}
  ]
}

Avro with Kafka (Serialization)

from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer, AvroConsumer
import json

# Avro schema definition
value_schema = avro.loads('''
{
    "type": "record",
    "name": "UserEvent",
    "namespace": "com.company.events",
    "fields": [
        {"name": "user_id", "type": "string"},
        {"name": "event_type", "type": {"type": "enum", "name": "EventType", "symbols": ["LOGIN", "LOGOUT", "PURCHASE"]}},
        {"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"},
        {"name": "metadata", "type": ["null", "string"], "default": null}
    ]
}
''')

# Producer configuration
producer_config = {
    'bootstrap.servers': 'localhost:9092',
    'schema.registry.url': 'http://localhost:8081'
}

producer = AvroProducer(producer_config, default_value_schema=value_schema)

# Produce a message
event = {
    'user_id': 'user_123',
    'event_type': 'LOGIN',
    'timestamp': 1698765432000,
    'metadata': None
}

producer.produce(topic='user-events', value=event)
producer.flush()

Avro Binary Encoding Format

Avro uses a compact binary encoding:

\text{Avro Encoding} = \sum_{i=1}^{n} \text{encode}(\text{field}_i, \text{schema}_i)

For integers, Avro uses variable-length encoding (ZigZag + Varint):

def zigzag_encode(n):
    """ZigZag encode an integer"""
    return (n << 1) ^ (n >> 63)

def varint_encode(n):
    """Encode integer as variable-length"""
    result = bytearray()
    while n > 0x7F:
        result.append((n & 0x7F) | 0x80)
        n >>= 7
    result.append(n)
    return bytes(result)

# Example: encoding -1, 0, 1, 127, 128
for val in [-1, 0, 1, 127, 128]:
    encoded = varint_encode(zigzag_encode(val))
    print(f"{val:4d} -> {encoded.hex()}")

4. Delta Lake

Delta Lake Architecture

Delta Lake adds ACID transactions on top of Parquet:

\Delta \text{ Lake} = \text{Parquet} + \text{Transaction Log} + \text{Metadata}

Transaction Log Structure

Architecture Diagram

my_table/
├── _delta_log/
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   ├── 00000000000000000002.checkpoint.parquet
│   └── 00000000000000000003.json
├── part-00000-...-.parquet
├── part-00001-...-.parquet
└── _commits/

Time Travel Queries

\text{Version}_t = \arg\min_{v \leq t} \text{log}_v.\text{timestamp} \leq t

from delta import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("DeltaLakeExample") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

# Read current version
df = spark.read.format("delta").load("/path/to/table")

# Read specific version (time travel)
df_v0 = spark.read.format("delta").option("versionAsOf", 0).load("/path/to/table")

# Read by timestamp
df_ts = spark.read.format("delta").option("timestampAsOf", "2024-01-01").load("/path/to/table")

# View history
delta_table = DeltaTable.forPath(spark, "/path/to/table")
history_df = delta_table.history()
history_df.show(truncate=False)

Z-Ordering for Multi-Dimensional Clustering

Z-Ordering optimizes multi-dimensional range queries:

\text{Z-Order}(x, y) = \text{interleave}(\text{bits}(x), \text{bits}(y))

# Z-Order optimization for multi-dimensional queries
df = spark.read.format("delta").load("/path/to/table")

# Optimize with Z-Order on multiple columns
spark.sql("""
    OPTIMIZE delta.`/path/to/table`
    ZORDER BY (event_date, user_id, event_type)
""")

# This creates optimal file layout for queries filtering on these columns
# Example efficient query:
df_filtered = spark.sql("""
    SELECT * FROM delta.`/path/to/table`
    WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
    AND user_id = 'user_123'
""")

5. Performance Benchmarks

Query Performance Comparison

Architecture Diagram

Scenario: Scan 1TB dataset, filter on 3 columns, aggregate 1 result

| Format      | Full Scan | Filtered Scan | Join Performance |
|-------------|-----------|---------------|------------------|
| Parquet     | 45s       | 12s           | 38s              |
| ORC         | 42s       | 10s           | 35s              |
| Avro        | 89s       | 85s           | 42s              |
| Delta Lake  | 43s       | 8s            | 36s              |

Storage Size Comparison (1TB Raw Data)

\text{Size}_{\text{Parquet}} \approx 0.15 \times \text{Size}_{\text{CSV}}

\text{Size}_{\text{ORC}} \approx 0.12 \times \text{Size}_{\text{CSV}}

\text{Size}_{\text{Avro}} \approx 0.30 \times \text{Size}_{\text{CSV}}

\text{Size}_{\text{Delta}} \approx 0.16 \times \text{Size}_{\text{CSV}} \text{ (includes log overhead)}

6. Decision Framework

When to Use Each Format

ℹ️

Parquet - Best for: OLAP workloads, Spark/Presto/Trino ecosystems, columnar access patterns, nested data (via GROUPED encoding)

⚠️

ORC - Best for: Hive-centric ecosystems, ACID transactions without Delta, fine-grained indexing (Bloom filters built-in), streaming + batch unification in Hive 3+

💡

Avro - Best for: Kafka event streaming, schema evolution, row-based access patterns, data serialization between systems

ℹ️

Delta Lake - Best for: Lakehouse architecture, time travel requirements, upsert/merge operations, data quality enforcement, ML feature stores

7. Advanced: Hybrid Approaches

Parquet + Delta Lake (Recommended Modern Stack)

# Write data as Delta (Parquet + transaction log)
df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .partitionBy("year", "month", "day") \
    .save("s3://data-lake/events/")

# Schema evolution
spark.sql("""
    ALTER TABLE delta.`s3://data-lake/events/`
    ADD COLUMNS (new_field STRING COMMENT 'New field added in v2')
""")

# Data compaction (optimize small files)
spark.sql("""
    OPTIMIZE delta.`s3://data-lake/events/`
    WHERE year = 2024 AND month = 1
""")

Format Conversion Pipeline

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp

spark = SparkSession.builder \
    .appName("FormatConversion") \
    .getOrCreate()

# Convert CSV to Parquet (batch)
df_csv = spark.read.csv("s3://raw/data.csv", header=True, inferSchema=True)
df_csv.write \
    .mode("overwrite") \
    .parquet("s3://curated/data.parquet")

# Convert Avro to Delta (streaming)
df_avro_stream = spark.readStream \
    .format("avro") \
    .option("kafka.bootstrap.servers", "localhost:9092") \
    .option("subscribe", "events") \
    .load()

df_avro_stream.writeStream \
    .format("delta") \
    .outputMode("append") \
    .option("checkpointLocation", "s3://checkpoints/avro-to-delta") \
    .start("s3://lake/events/")

Follow-up Questions

How does predicate pushdown differ between Parquet and ORC at the page/stripe level?
Explain the trade-offs between dictionary encoding and RLE in Parquet for different data distributions.
How would you implement schema evolution in a production Avro-based Kafka pipeline?
What are the implications of small file problems in Delta Lake and how does OPTIMIZE compaction work?
Compare the ACID transaction guarantees between ORC (Hive) and Delta Lake in terms of write amplification.
How does column pruning interact with nested data structures (arrays, maps, structs) in Parquet?
Explain the merge-on-read vs copy-on-write strategies in Delta Lake for update-heavy workloads.
How would you design a format migration strategy for a petabyte-scale data lake?