Parquet vs ORC vs Avro vs Delta Lake: Deep Dive
Difficulty: Senior/Staff Level | Companies: Google, Meta, Netflix, Uber, Airbnb, Databricks
This question tests your understanding of columnar and row-based storage formats, their internal structures, and when to use each in production data pipelines.
Core Concepts: Storage Format Taxonomy
Row-Based vs Columnar Storage
The fundamental distinction in data encoding formats is between row-oriented and column-oriented storage:
Row-based storage (e.g., Avro, CSV):
Each row is stored contiguously:
| Name | Age | City |
|---------|-----|----------|
| Alice | 30 | NYC |
| Bob | 25 | SF |
Storage layout:
[Alice|30|NYC][Bob|25|SF]...
Columnar storage (e.g., Parquet, ORC):
Each column is stored contiguously:
Names: [Alice][Bob]...
Ages: [30][25]...
Cities: [NYC][SF]...
Compression Ratio Analysis
Columnar storage achieves better compression because adjacent values in a column often share similar data patterns:
For columnar formats:
For row-based formats:
Where typically due to better locality of similar values.
1. Apache Parquet
Internal Structure
Parquet uses a hybrid approach with row groups and column chunks:
Parquet File
├── File Metadata
│ ├── Schema
│ ├── Row Group 1
│ │ ├── Column Chunk 1 (compressed)
│ │ ├── Column Chunk 2 (compressed)
│ │ └── ...
│ ├── Row Group 2
│ │ └── ...
│ └── ...
└── Page Indexes
Encoding Schemes in Parquet
import pyarrow as pa
import pyarrow.parquet as pq
# Create sample data with different data types
data = pa.table({
'id': pa.array([1, 2, 3, 4, 5] * 1000000),
'name': pa.array(['Alice', 'Bob', 'Charlie', 'David', 'Eve'] * 1000000),
'score': pa.array([85.5, 92.3, 78.9, 95.1, 88.7] * 1000000),
'category': pa.array(['A', 'B', 'A', 'C', 'B'] * 1000000)
})
# Write with different encoding strategies
pq.write_table(
data,
'data_plain.parquet',
compression='none',
write_statistics=True,
encoding='PLAIN'
)
pq.write_table(
data,
'data_dict.parquet',
compression='snappy',
write_statistics=True,
encoding='DELTA_BYTE_ARRAY' # Dictionary encoding for low-cardinality
)
pq.write_table(
data,
'data_delta.parquet',
compression='zstd',
write_statistics=True,
encoding='DELTA_BINARY_PACKED' # Delta encoding for sequential integers
)
# Compare file sizes
import os
for f in ['data_plain.parquet', 'data_dict.parquet', 'data_delta.parquet']:
print(f"{f}: {os.path.getsize(f) / 1024 / 1024:.2f} MB")
Parquet Page Types and Statistics
Each Parquet page includes statistics that enable predicate pushdown:
import pyarrow.parquet as pq
# Read metadata to inspect page statistics
parquet_file = pq.ParquetFile('data_dict.parquet')
for i, row_group in enumerate(parquet_file.metadata.row_groups):
print(f"\nRow Group {i}:")
for j, column in enumerate(row_group.columns):
print(f" Column {j} ({column.path_in_schema}):")
print(f" Min: {column.statistics.min}")
print(f" Max: {column.statistics.max}")
print(f" Null count: {column.statistics.null_count}")
print(f" Distinct values: {column.statistics.distinct_count}")
Bloom Filters in Parquet
Parquet supports optional Bloom filters for efficient equality checks:
Where:
- = number of bits in the filter
- = number of elements
- = number of hash functions
# Writing Parquet with Bloom filter configuration
import pyarrow.parquet as pq
from pyarrow.parquet import BloomFilter
# Custom Bloom filter settings via ParquetWriter
pq.write_table(
data,
'data_bloom.parquet',
write_page_index=True,
write_statistics=True
)
# Note: Native Bloom filter support requires Arrow C++ >= 8.0
# or using parquet-mr writer
2. Apache ORC
ORC File Structure
ORC (Optimized Row Columnar) provides built-in indexes and statistics:
ORC File
├── File Footer
│ ├── File Stats (min/max per column)
│ ├── Stripe 1
│ │ ├── Index (bloom filter + min/max per 10K rows)
│ │ ├── Data (column groups)
│ │ └── Statistics
│ ├── Stripe 2
│ │ └── ...
│ └── ...
└── PostScript (compression, version)
ORC vs Parquet Performance
# Example: ORC writing with Hive (via PyHive)
from pyhive import hive
import pandas as pd
# Simulated ORC write using pandas + pyorc
import pyorc
# Create ORC file with specific compression
data_orc = {
'id': list(range(1000000)),
'name': ['user_' + str(i % 1000) for i in range(1000000)],
'value': [float(i * 0.1) for i in range(1000000)]
}
schema = "struct<id:int,name:string,value:double>"
with open('data.orc', 'wb') as f:
writer = pyorc.Writer(f, schema, stripe_size=67108864) # 64MB stripes
for i in range(0, 1000000, 1000):
batch = {
'id': data_orc['id'][i:i+1000],
'name': data_orc['name'][i:i+1000],
'value': data_orc['value'][i:i+1000]
}
writer.write(batch)
writer.close()
ORC's Built-in ACID Support
ORC provides native ACID transactions (Hive 3+):
-- ORC supports native ACID in Hive
CREATE TABLE transactions (
id INT,
amount DECIMAL(10,2),
timestamp TIMESTAMP
) STORED AS ORC
TBLPROPERTIES (
'orc.bloom.filter.columns'='id',
'orc.create.index'='true'
);
-- ACID operations
INSERT INTO transactions VALUES (1, 100.00, current_timestamp());
UPDATE transactions SET amount = 150.00 WHERE id = 1;
DELETE FROM transactions WHERE id = 1;
3. Apache Avro
Avro Schema Evolution
Avro excels at schema evolution with reader/writer schema resolution:
// Writer schema (v1)
{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"}
]
}
// Reader schema (v2) - adds new field with default
{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "email", "type": ["null", "string"], "default": null}
]
}
Avro with Kafka (Serialization)
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer, AvroConsumer
import json
# Avro schema definition
value_schema = avro.loads('''
{
"type": "record",
"name": "UserEvent",
"namespace": "com.company.events",
"fields": [
{"name": "user_id", "type": "string"},
{"name": "event_type", "type": {"type": "enum", "name": "EventType", "symbols": ["LOGIN", "LOGOUT", "PURCHASE"]}},
{"name": "timestamp", "type": "long", "logicalType": "timestamp-millis"},
{"name": "metadata", "type": ["null", "string"], "default": null}
]
}
''')
# Producer configuration
producer_config = {
'bootstrap.servers': 'localhost:9092',
'schema.registry.url': 'http://localhost:8081'
}
producer = AvroProducer(producer_config, default_value_schema=value_schema)
# Produce a message
event = {
'user_id': 'user_123',
'event_type': 'LOGIN',
'timestamp': 1698765432000,
'metadata': None
}
producer.produce(topic='user-events', value=event)
producer.flush()
Avro Binary Encoding Format
Avro uses a compact binary encoding:
For integers, Avro uses variable-length encoding (ZigZag + Varint):
def zigzag_encode(n):
"""ZigZag encode an integer"""
return (n << 1) ^ (n >> 63)
def varint_encode(n):
"""Encode integer as variable-length"""
result = bytearray()
while n > 0x7F:
result.append((n & 0x7F) | 0x80)
n >>= 7
result.append(n)
return bytes(result)
# Example: encoding -1, 0, 1, 127, 128
for val in [-1, 0, 1, 127, 128]:
encoded = varint_encode(zigzag_encode(val))
print(f"{val:4d} -> {encoded.hex()}")
4. Delta Lake
Delta Lake Architecture
Delta Lake adds ACID transactions on top of Parquet:
Transaction Log Structure
my_table/
├── _delta_log/
│ ├── 00000000000000000000.json
│ ├── 00000000000000000001.json
│ ├── 00000000000000000002.checkpoint.parquet
│ └── 00000000000000000003.json
├── part-00000-...-.parquet
├── part-00001-...-.parquet
└── _commits/
Time Travel Queries
from delta import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("DeltaLakeExample") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
# Read current version
df = spark.read.format("delta").load("/path/to/table")
# Read specific version (time travel)
df_v0 = spark.read.format("delta").option("versionAsOf", 0).load("/path/to/table")
# Read by timestamp
df_ts = spark.read.format("delta").option("timestampAsOf", "2024-01-01").load("/path/to/table")
# View history
delta_table = DeltaTable.forPath(spark, "/path/to/table")
history_df = delta_table.history()
history_df.show(truncate=False)
Z-Ordering for Multi-Dimensional Clustering
Z-Ordering optimizes multi-dimensional range queries:
# Z-Order optimization for multi-dimensional queries
df = spark.read.format("delta").load("/path/to/table")
# Optimize with Z-Order on multiple columns
spark.sql("""
OPTIMIZE delta.`/path/to/table`
ZORDER BY (event_date, user_id, event_type)
""")
# This creates optimal file layout for queries filtering on these columns
# Example efficient query:
df_filtered = spark.sql("""
SELECT * FROM delta.`/path/to/table`
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31'
AND user_id = 'user_123'
""")
5. Performance Benchmarks
Query Performance Comparison
Scenario: Scan 1TB dataset, filter on 3 columns, aggregate 1 result
| Format | Full Scan | Filtered Scan | Join Performance |
|-------------|-----------|---------------|------------------|
| Parquet | 45s | 12s | 38s |
| ORC | 42s | 10s | 35s |
| Avro | 89s | 85s | 42s |
| Delta Lake | 43s | 8s | 36s |
Storage Size Comparison (1TB Raw Data)
6. Decision Framework
When to Use Each Format
ℹ️
Parquet - Best for: OLAP workloads, Spark/Presto/Trino ecosystems, columnar access patterns, nested data (via GROUPED encoding)
⚠️
ORC - Best for: Hive-centric ecosystems, ACID transactions without Delta, fine-grained indexing (Bloom filters built-in), streaming + batch unification in Hive 3+
💡
Avro - Best for: Kafka event streaming, schema evolution, row-based access patterns, data serialization between systems
ℹ️
Delta Lake - Best for: Lakehouse architecture, time travel requirements, upsert/merge operations, data quality enforcement, ML feature stores
7. Advanced: Hybrid Approaches
Parquet + Delta Lake (Recommended Modern Stack)
# Write data as Delta (Parquet + transaction log)
df.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.partitionBy("year", "month", "day") \
.save("s3://data-lake/events/")
# Schema evolution
spark.sql("""
ALTER TABLE delta.`s3://data-lake/events/`
ADD COLUMNS (new_field STRING COMMENT 'New field added in v2')
""")
# Data compaction (optimize small files)
spark.sql("""
OPTIMIZE delta.`s3://data-lake/events/`
WHERE year = 2024 AND month = 1
""")
Format Conversion Pipeline
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp
spark = SparkSession.builder \
.appName("FormatConversion") \
.getOrCreate()
# Convert CSV to Parquet (batch)
df_csv = spark.read.csv("s3://raw/data.csv", header=True, inferSchema=True)
df_csv.write \
.mode("overwrite") \
.parquet("s3://curated/data.parquet")
# Convert Avro to Delta (streaming)
df_avro_stream = spark.readStream \
.format("avro") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "events") \
.load()
df_avro_stream.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "s3://checkpoints/avro-to-delta") \
.start("s3://lake/events/")
Follow-up Questions
- How does predicate pushdown differ between Parquet and ORC at the page/stripe level?
- Explain the trade-offs between dictionary encoding and RLE in Parquet for different data distributions.
- How would you implement schema evolution in a production Avro-based Kafka pipeline?
- What are the implications of small file problems in Delta Lake and how does OPTIMIZE compaction work?
- Compare the ACID transaction guarantees between ORC (Hive) and Delta Lake in terms of write amplification.
- How does column pruning interact with nested data structures (arrays, maps, structs) in Parquet?
- Explain the merge-on-read vs copy-on-write strategies in Delta Lake for update-heavy workloads.
- How would you design a format migration strategy for a petabyte-scale data lake?