📊 Data Warehouse on AWS

Master Redshift schema design, distribution strategies, and data warehouse best practices.

Module: AWS Data Engineering • Topic 21 of 65 • Premium Content

Star Schema Design

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    STAR SCHEMA DESIGN                                         │
│                                                                             │
│                    ┌─────────────────┐                                      │
│                    │  dim_date       │                                      │
│                    │─────────────────│                                      │
│                    │ date_key (PK)   │                                      │
│                    │ full_date       │                                      │
│                    │ year            │                                      │
│                    │ quarter         │                                      │
│                    │ month           │                                      │
│                    └────────┬────────┘                                      │
│                             │                                               │
│  ┌─────────────────┐       │       ┌─────────────────┐                     │
│  │  dim_customer   │       │       │  dim_product    │                     │
│  │─────────────────│       │       │─────────────────│                     │
│  │ customer_key(PK)│       │       │ product_key(PK) │                     │
│  │ customer_id     │       │       │ product_id      │                     │
│  │ name            ├───────┼───────┤ name            │                     │
│  │ segment         │       │       │ category        │                     │
│  └─────────────────┘       │       └────────┬────────┘                     │
│                            │                │                               │
│                    ┌───────┴────────────────┴───────┐                      │
│                    │        fact_sales               │                      │
│                    │─────────────────────────────────│                      │
│                    │ sale_key (PK)                   │                      │
│                    │ date_key (FK)                   │                      │
│                    │ customer_key (FK)               │                      │
│                    │ product_key (FK)                │                      │
│                    │ quantity                        │                      │
│                    │ amount                          │                      │
│                    │ discount                        │                      │
│                    └─────────────────────────────────┘                      │
└─────────────────────────────────────────────────────────────────────────────┘

Distribution Strategies

Strategy	Best For	Performance	Example
KEY	Large fact tables joined on column	Best join performance	fact_sales DISTKEY(customer_id)
EVEN	Even distribution without joins	Good for staging	staging_data DISTSTYLE EVEN
ALL	Small dimension tables (<2GB)	Fastest scans	dim_products DISTSTYLE ALL
AUTO	Unknown patterns	Adaptive	Most tables

Sort Keys

-- Compound sort for time-series queries
CREATE TABLE fact_sales (
    sale_id BIGINT IDENTITY(1,1),
    date_key INT DISTKEY,
    customer_key INT,
    product_key INT,
    quantity INT,
    amount DECIMAL(10,2)
)
DISTSTYLE KEY
COMPOUND SORTKEY(date_key, customer_key);

-- Interleaved for multi-column queries
CREATE TABLE dim_customer (
    customer_key INT IDENTITY(1,1),
    customer_id VARCHAR(50),
    name VARCHAR(100),
    segment VARCHAR(50),
    region VARCHAR(50)
)
DISTSTYLE ALL
INTERLEAVED SORTKEY(customer_id, segment, region);

Interview Q&A

Q1: Star vs Snowflake schema?

Answer: Star schema has denormalized dimensions (faster queries). Snowflake normalizes dimensions (less storage). Use star for data warehouses.

Q2: How to choose a distribution key?

Answer: Choose the column most frequently used in JOINs. Large tables benefit most from KEY distribution on join columns.

Q3: What is data skew and how to handle it?

Answer: Uneven data distribution across nodes. Handle by choosing better distribution keys, using EVEN for staging, or salting keys.

Summary

Star Schema: Denormalized dimensions, faster queries
Distribution: KEY for facts, ALL for small dimensions, AUTO for most
Sort Keys: Compound for range queries, Interleaved for multi-column
Performance: VACUUM, ANALYZE, and proper indexing