Amazon & Netflix Interview

Star Schema vs Snowflake Schema: When to Use What?

Understanding dimensional modeling for analytical workloads

Interview Question

"You're designing a data warehouse for an e-commerce company. The business needs to analyze sales by product category, region, and time period. The product hierarchy has 5 levels (category → subcategory → brand → model → SKU). Should you use a star schema or snowflake schema? Justify your answer and discuss the tradeoffs."

Difficulty: Medium | Frequently asked at Amazon, Netflix, Airbnb

Theoretical Foundation

What is Dimensional Modeling?

Dimensional modeling is a design technique optimized for querying and analyzing data. It organizes data into fact tables (numeric measurements) and dimension tables (descriptive attributes).

The two primary schemas are:

Star Schema

In a star schema, dimension tables are denormalized—each dimension is a single, flat table directly connected to the fact table.

Key characteristics:

Denormalized dimension tables (all attributes in one table)
Simple JOINs (fact joins directly to each dimension)
Fewer tables overall
Faster queries due to fewer JOINs

Snowflake Schema

In a snowflake schema, dimension tables are normalized—hierarchical attributes are split into separate tables.

Key characteristics:

Normalized dimension tables (3NF)
More JOINs required for queries
More tables overall
Less data redundancy

Mathematical Comparison

The storage efficiency can be quantified. Consider a dimension with:

n rows
d attributes
Average attribute length l bytes
Attribute cardinality c (average distinct values per attribute)

Star Schema storage:

S_{star} = n \times d \times l

Snowflake Schema storage:

S_{snow} = \sum_{i=1}^{d} (c_i \times l_i + n \times k_i)

where $k_i$ is the foreign key length (typically 4-8 bytes).

For a product dimension with 1M products, 5 attributes (avg 20 chars, avg 1000 cardinality):

Star: $1M \times 5 \times 20 = 100MB$

Snowflake: $5 \times (1000 \times 20) + 1M \times 5 \times 4 = 100KB + 20MB = 20.1MB$

The snowflake saves ~80% storage but requires 5 JOINs instead of 1.

ℹ️

Key Insight: The storage savings of snowflake schema are often negligible compared to the query complexity overhead. Modern columnar databases (Snowflake, BigQuery, Redshift) compress data so effectively that denormalization rarely causes storage issues.

Query Performance Analysis

Consider a query: "Total sales by category and quarter for 2024"

Star Schema query:

SELECT 
    p.category,
    d.quarter,
    SUM(f.amount) as total_sales
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2024
GROUP BY p.category, d.quarter;

Snowflake Schema query:

SELECT 
    cat.category_name,
    q.quarter_name,
    SUM(f.amount) as total_sales
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_subcategory sc ON p.subcategory_key = sc.subcategory_key
JOIN dim_category cat ON sc.category_key = cat.category_key
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_month m ON d.month_key = m.month_key
JOIN dim_quarter q ON m.quarter_key = q.quarter_key
WHERE d.year = 2024
GROUP BY cat.category_name, q.quarter_name;

The star schema uses 2 JOINs; the snowflake uses 6 JOINs. In columnar databases, JOINs are expensive because they require:

Building hash tables
Shuffle operations across nodes
Memory allocation for hash tables

Performance ratio: For complex queries with many dimensions, star schema can be 2-10x faster.

⚠️

Common Interview Trap: Candidates often argue that snowflake is "better" because it's normalized (following database design best practices for OLTP). But data warehouses are OLAP systems where denormalization is intentionally used to optimize read performance. Normalization rules from OLTP don't apply to OLAP.

When to Use Each Schema

Use Star Schema when:

Query performance is critical (BI dashboards, ad-hoc analysis)
Dimension hierarchies are shallow (3-4 levels)
Business users need simple, intuitive data models
You're using columnar storage (Snowflake, BigQuery, Redshift)
ETL/ELT simplicity is important

Use Snowflake Schema when:

Dimension hierarchies are deep (5+ levels) and complex
Storage costs are a primary concern (rare)
You need to support multiple conformed dimensions across facts
Data governance requires strict normalization
Dimension tables are extremely large (billions of rows)

Use a Hybrid (Constellation Schema) when:

You have multiple fact tables sharing dimensions
Some dimensions benefit from normalization, others don't
You need to balance performance and storage

The Role of Slowly Changing Dimensions (SCDs)

The choice of schema affects how you handle SCDs:

Star Schema with SCD Type 2:

-- Dimension table with versioning
CREATE TABLE dim_product (
    product_key        BIGINT PRIMARY KEY,
    product_id         VARCHAR(50),
    product_name       VARCHAR(200),
    category           VARCHAR(100),
    effective_date     DATE,
    expiry_date        DATE,
    is_current         BOOLEAN
);

-- Query for current version only
SELECT * FROM dim_product WHERE is_current = TRUE;

Snowflake Schema with SCD Type 2:

-- Each level gets its own versioning
CREATE TABLE dim_category (
    category_key       BIGINT PRIMARY KEY,
    category_name      VARCHAR(100),
    effective_date     DATE,
    expiry_date        DATE,
    is_current         BOOLEAN
);

CREATE TABLE dim_subcategory (
    subcategory_key    BIGINT PRIMARY KEY,
    subcategory_name   VARCHAR(100),
    category_key       BIGINT REFERENCES dim_category(category_key),
    effective_date     DATE,
    expiry_date        DATE,
    is_current         BOOLEAN
);

💡

Production Tip: When using SCD Type 2 in a star schema, consider adding a version_number column instead of effective_date/expiry_date. This simplifies queries and improves performance for point-in-time analysis.

Modern Alternatives: Data Vault and Anchor Modeling

Data Vault is gaining popularity for enterprise data warehouses:

Data Vault advantages:

Audit trail built-in
Parallel loading possible
Business key (BK) separate from surrogate key
Satellite tables store historical attributes

Data Vault disadvantages:

Complex queries (many JOINs)
Requires specialized tools (dbt, DataVault Loader)
Overkill for small-medium warehouses

Code Implementation

Building a Star Schema ETL Pipeline

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

spark = SparkSession.builder \
    .appName("StarSchemaETL") \
    .config("spark.sql.shuffle.partitions", "200") \
    .getOrCreate()

# ============================================================
# 1. DIMENSION TABLES
# ============================================================

# Read raw product data
raw_products = spark.read.parquet("s3://data-lake/raw/products/")

# Build Dim_Product (denormalized - star schema)
dim_product = raw_products.select(
    F.monotonically_increasing_id().alias("product_key"),
    F.col("product_id").alias("business_product_id"),
    "product_name",
    "category",
    "subcategory",
    "brand",
    "model",
    "sku",
    F.current_timestamp().alias("effective_date"),
    F.lit(None).cast("timestamp").alias("expiry_date"),
    F.lit(True).alias("is_current")
)

# Build Dim_Date
dim_date = spark.sql("""
    SELECT 
        CAST(date_format(date, 'yyyyMMdd') AS INT) as date_key,
        date,
        day(date) as day_of_month,
        dayofweek(date) as day_of_week,
        dayname(date) as day_name,
        weekofyear(date) as week_of_year,
        month(date) as month,
        monthname(date) as month_name,
        quarter(date) as quarter,
        year(date) as year,
        concat(year(date), '-Q', quarter(date)) as year_quarter,
        concat(year(date), '-', lpad(month(date), 2, '0')) as year_month
    FROM (
        SELECT explode(sequence(
            to_date('2020-01-01'), 
            to_date('2030-12-31'), 
            interval 1 day
        )) as date
    )
""")

# Build Dim_Customer
dim_customer = spark.read.parquet("s3://data-lake/raw/customers/") \
    .select(
        F.monotonically_increasing_id().alias("customer_key"),
        "customer_id",
        "customer_name",
        "email",
        "segment",
        "region",
        "country",
        "registration_date"
    )

# ============================================================
# 2. FACT TABLE
# ============================================================

raw_orders = spark.read.parquet("s3://data-lake/raw/orders/")

# Join to get surrogate keys
fact_sales = raw_orders \
    .join(dim_product, raw_orders.product_id == dim_product.business_product_id, "left") \
    .join(dim_date, F.date_format(F.col("order_date"), "yyyyMMdd").cast("INT") == dim_date.date_key, "left") \
    .join(dim_customer, raw_orders.customer_id == dim_customer.customer_id, "left") \
    .select(
        dim_date.date_key,
        dim_product.product_key,
        dim_customer.customer_key,
        "quantity",
        "unit_price",
        F.col("quantity") * F.col("unit_price").alias("amount"),
        F.col("quantity") * F.col("cost").alias("cost"),
        F.col("quantity") * F.col("unit_price") - F.col("quantity") * F.col("cost").alias("profit")
    )

# Write to Delta Lake
fact_sales.write.format("delta") \
    .mode("overwrite") \
    .partitionBy("date_key") \
    .save("s3://data-warehouse/fact_sales/")

dim_product.write.format("delta") \
    .mode("overwrite") \
    .save("s3://data-warehouse/dim_product/")

dim_date.write.format("delta") \
    .mode("overwrite") \
    .save("s3://data-warehouse/dim_date/")

dim_customer.write.format("delta") \
    .mode("overwrite") \
    .save("s3://data-warehouse/dim_customer/")

Building a Snowflake Schema ETL Pipeline

# ============================================================
# SNOWFLAKE SCHEMA - Normalized Dimensions
# ============================================================

# Dim_Category (Level 1)
dim_category = raw_products.select(
    F.monotonically_increasing_id().alias("category_key"),
    F.col("category").alias("category_name")
).distinct()

# Dim_Subcategory (Level 2)
dim_subcategory = raw_products.select(
    F.monotonically_increasing_id().alias("subcategory_key"),
    F.col("subcategory").alias("subcategory_name"),
    F.col("category")
).distinct() \
.join(dim_category, "category", "left") \
.select("subcategory_key", "subcategory_name", "category_key")

# Dim_Brand (Level 3)
dim_brand = raw_products.select(
    F.monotonically_increasing_id().alias("brand_key"),
    F.col("brand").alias("brand_name"),
    F.col("subcategory")
).distinct() \
.join(dim_subcategory, "subcategory", "left") \
.select("brand_key", "brand_name", "subcategory_key")

# Dim_Product (Level 4) - now normalized
dim_product_snow = raw_products.select(
    F.monotonically_increasing_id().alias("product_key"),
    F.col("product_id").alias("business_product_id"),
    "product_name",
    "sku",
    F.col("brand")
).distinct() \
.join(dim_brand, raw_products.brand == dim_brand.brand_name, "left") \
.select("product_key", "business_product_id", "product_name", "sku", "brand_key")

# Fact table references normalized keys
fact_sales_snow = raw_orders \
    .join(dim_product_snow, raw_orders.product_id == dim_product_snow.business_product_id, "left") \
    .select(
        "date_key",
        "product_key",
        "customer_key",
        "quantity",
        "unit_price",
        F.col("quantity") * F.col("unit_price").alias("amount")
    )

Query Comparison: Star vs Snowflake

-- STAR SCHEMA QUERY (2 JOINs)
-- "Total sales by category and quarter for 2024"
SELECT 
    p.category,
    d.quarter,
    SUM(f.amount) as total_sales,
    COUNT(DISTINCT f.customer_key) as unique_customers
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2024
GROUP BY p.category, d.quarter
ORDER BY p.category, d.quarter;

-- SNOWFLAKE SCHEMA QUERY (6 JOINs)
SELECT 
    cat.category_name,
    q.quarter_name,
    SUM(f.amount) as total_sales,
    COUNT(DISTINCT f.customer_key) as unique_customers
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_brand b ON p.brand_key = b.brand_key
JOIN dim_subcategory sc ON b.subcategory_key = sc.subcategory_key
JOIN dim_category cat ON sc.category_key = cat.category_key
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_month m ON d.month_key = m.month_key
JOIN dim_quarter q ON m.quarter_key = q.quarter_key
WHERE d.year = 2024
GROUP BY cat.category_name, q.quarter_name
ORDER BY cat.category_name, q.quarter_name;

ℹ️

Best Practice: When using a snowflake schema, create materialized views for common query patterns. This pre-computes the JOINs and gives you star-schema-like performance with snowflake-schema storage efficiency.

Performance Benchmark

import time

# Benchmark: Star Schema vs Snowflake Schema
queries = {
    "simple_query": """
        -- Both schemas: Same logical query
        SELECT category, SUM(amount) 
        FROM fact_sales JOIN dim_product ON ...
        GROUP BY category
    """,
    "complex_query": """
        -- Star: 3 JOINs | Snowflake: 8 JOINs
        SELECT cat.category, r.region, d.year_quarter,
               SUM(f.amount), COUNT(DISTINCT f.customer_key)
        FROM fact_sales f
        JOIN dim_product p ON ...
        JOIN dim_date d ON ...
        JOIN dim_customer c ON ...
        -- Snowflake adds 5 more JOINs for normalization
        GROUP BY cat.category, r.region, d.year_quarter
    """
}

# Expected results (typical):
# Simple query: Star ~1.2s, Snowflake ~1.8s (1.5x slower)
# Complex query: Star ~3.5s, Snowflake ~12.0s (3.4x slower)

⚠️

Critical Consideration: The performance gap widens as queries become more complex. For dashboards with multiple filter dimensions, snowflake schema can be 5-10x slower without materialized views.

Common Follow-Up Questions

Q1: What about a "flat" table (pre-joined) vs star schema?

A flat table (all dimensions in the fact table) is the most denormalized approach:

Flat table: Fastest queries, highest storage, hardest to maintain
Star schema: Good balance, easy to understand
Snowflake schema: Normalized, more JOINs, lower storage

For most use cases, star schema provides the best balance.

Q2: How do you handle many-to-many relationships?

Use a bridge table between the fact and dimension:

-- Product can belong to multiple categories
CREATE TABLE bridge_product_category (
    product_key INT,
    category_key INT,
    probability DECIMAL(5,4)  -- Weighting factor
);

-- Query with weighted aggregation
SELECT 
    c.category_name,
    SUM(f.amount * b.probability) as weighted_sales
FROM fact_sales f
JOIN bridge_product_category b ON f.product_key = b.product_key
JOIN dim_category c ON b.category_key = c.category_key
GROUP BY c.category_name;

Q3: How does schema choice affect data governance?

Snowflake schema is better for governance because:

Each attribute has a single source of truth
Access control can be applied at the table level
Data lineage is clearer
Changes propagate automatically

Star schema requires updating denormalized attributes in multiple places.

Q4: What about materialized views in modern warehouses?

Materialized views bridge the gap:

-- Snowflake: Create materialized view for common queries
CREATE MATERIALIZED VIEW mv_sales_by_category_quarter AS
SELECT cat.category_name, q.quarter_name, SUM(f.amount)
FROM fact_sales f
JOIN dim_product p ON ...
JOIN dim_category cat ON ...
JOIN dim_date d ON ...
JOIN dim_quarter q ON ...
GROUP BY cat.category_name, q.quarter_name;

💡

Interview Tip: Always ask clarifying questions before answering: What's the query pattern? What's the data volume? What's the latency requirement? The "right" answer depends on these factors.

Company-Specific Tips

Amazon Interview Tips

Emphasize query performance for BI dashboards
Discuss partitioning strategy alongside schema design
Mention conformed dimensions across business units
Be ready to draw the schema on a whiteboard

Netflix Interview Tips

Focus on content recommendation use cases
Discuss how viewing history dimensions change over time
Mention data vault for audit requirements
Talk about real-time vs batch schema considerations

Airbnb Interview Tips

Discuss geographic hierarchies (country → city → neighborhood)
Explain how search dimensions affect schema design
Mention A/B testing fact tables
Talk about cost optimization in cloud warehouses

ℹ️

Final Takeaway: There's no universally "better" schema. Star schema is the default choice for most analytical workloads. Use snowflake schema only when you have strong normalization requirements or extremely large dimension tables. Modern cloud warehouses make denormalization nearly free, so optimize for query simplicity and performance.