Distribution, Sort Keys
** Distribution, Sort Keys** is a foundational topic for data engineers building scalable analytics platforms.
Why It Matters
Modern data warehouses handle petabytes of data and serve thousands of analysts. Designing them well is both an art and a science.
Architecture Overview
-- Example: Star schema fact table
CREATE TABLE fact_orders (
order_id BIGINT PRIMARY KEY,
customer_key INT REFERENCES dim_customer(customer_key),
product_key INT REFERENCES dim_product(product_key),
date_key INT REFERENCES dim_date(date_key),
quantity INT,
unit_price DECIMAL(10,2),
total_amount DECIMAL(10,2),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
PARTITION BY RANGE (date_key);
-- Dimension table
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50) UNIQUE,
full_name VARCHAR(200),
email VARCHAR(200),
country VARCHAR(100),
segment VARCHAR(50),
valid_from DATE,
valid_to DATE,
is_current BOOLEAN DEFAULT TRUE
);
Key Design Principles
- Choose the right grain for fact tables
- Use surrogate keys for all dimensions
- Implement Slowly Changing Dimensions (SCD) appropriately
- Partition data for query performance
Summary
Good warehouse design directly impacts query performance, cost, and developer productivity.