π Data Warehouse on AWS
Master Redshift schema design, distribution strategies, and data warehouse best practices.
Module: AWS Data Engineering β’ Topic 21 of 65 β’ Premium Content
Star Schema Design
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STAR SCHEMA DESIGN β
β β
β βββββββββββββββββββ β
β β dim_date β β
β βββββββββββββββββββ β
β β date_key (PK) β β
β β full_date β β
β β year β β
β β quarter β β
β β month β β
β ββββββββββ¬βββββββββ β
β β β
β βββββββββββββββββββ β βββββββββββββββββββ β
β β dim_customer β β β dim_product β β
β βββββββββββββββββββ β βββββββββββββββββββ β
β β customer_key(PK)β β β product_key(PK) β β
β β customer_id β β β product_id β β
β β name βββββββββΌββββββββ€ name β β
β β segment β β β category β β
β βββββββββββββββββββ β ββββββββββ¬βββββββββ β
β β β β
β βββββββββ΄βββββββββββββββββ΄ββββββββ β
β β fact_sales β β
β βββββββββββββββββββββββββββββββββββ β
β β sale_key (PK) β β
β β date_key (FK) β β
β β customer_key (FK) β β
β β product_key (FK) β β
β β quantity β β
β β amount β β
β β discount β β
β βββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Distribution Strategies
| Strategy | Best For | Performance | Example |
|---|---|---|---|
| KEY | Large fact tables joined on column | Best join performance | fact_sales DISTKEY(customer_id) |
| EVEN | Even distribution without joins | Good for staging | staging_data DISTSTYLE EVEN |
| ALL | Small dimension tables (<2GB) | Fastest scans | dim_products DISTSTYLE ALL |
| AUTO | Unknown patterns | Adaptive | Most tables |
Sort Keys
-- Compound sort for time-series queries
CREATE TABLE fact_sales (
sale_id BIGINT IDENTITY(1,1),
date_key INT DISTKEY,
customer_key INT,
product_key INT,
quantity INT,
amount DECIMAL(10,2)
)
DISTSTYLE KEY
COMPOUND SORTKEY(date_key, customer_key);
-- Interleaved for multi-column queries
CREATE TABLE dim_customer (
customer_key INT IDENTITY(1,1),
customer_id VARCHAR(50),
name VARCHAR(100),
segment VARCHAR(50),
region VARCHAR(50)
)
DISTSTYLE ALL
INTERLEAVED SORTKEY(customer_id, segment, region);
Interview Q&A
Q1: Star vs Snowflake schema?
Answer: Star schema has denormalized dimensions (faster queries). Snowflake normalizes dimensions (less storage). Use star for data warehouses.
Q2: How to choose a distribution key?
Answer: Choose the column most frequently used in JOINs. Large tables benefit most from KEY distribution on join columns.
Q3: What is data skew and how to handle it?
Answer: Uneven data distribution across nodes. Handle by choosing better distribution keys, using EVEN for staging, or salting keys.
Summary
- Star Schema: Denormalized dimensions, faster queries
- Distribution: KEY for facts, ALL for small dimensions, AUTO for most
- Sort Keys: Compound for range queries, Interleaved for multi-column
- Performance: VACUUM, ANALYZE, and proper indexing