Star Schema vs Snowflake Schema: When to Use What?
Understanding dimensional modeling for analytical workloads
Interview Question
"You're designing a data warehouse for an e-commerce company. The business needs to analyze sales by product category, region, and time period. The product hierarchy has 5 levels (category → subcategory → brand → model → SKU). Should you use a star schema or snowflake schema? Justify your answer and discuss the tradeoffs."
Difficulty: Medium | Frequently asked at Amazon, Netflix, Airbnb
Theoretical Foundation
What is Dimensional Modeling?
Dimensional modeling is a design technique optimized for querying and analyzing data. It organizes data into fact tables (numeric measurements) and dimension tables (descriptive attributes).
The two primary schemas are:
Star Schema
In a star schema, dimension tables are denormalized—each dimension is a single, flat table directly connected to the fact table.
Key characteristics:
- Denormalized dimension tables (all attributes in one table)
- Simple JOINs (fact joins directly to each dimension)
- Fewer tables overall
- Faster queries due to fewer JOINs
Snowflake Schema
In a snowflake schema, dimension tables are normalized—hierarchical attributes are split into separate tables.
Key characteristics:
- Normalized dimension tables (3NF)
- More JOINs required for queries
- More tables overall
- Less data redundancy
Mathematical Comparison
The storage efficiency can be quantified. Consider a dimension with:
nrowsdattributes- Average attribute length
lbytes - Attribute cardinality
c(average distinct values per attribute)
Star Schema storage:
Snowflake Schema storage:
where is the foreign key length (typically 4-8 bytes).
For a product dimension with 1M products, 5 attributes (avg 20 chars, avg 1000 cardinality):
Star:
Snowflake:
The snowflake saves ~80% storage but requires 5 JOINs instead of 1.
ℹ️
Key Insight: The storage savings of snowflake schema are often negligible compared to the query complexity overhead. Modern columnar databases (Snowflake, BigQuery, Redshift) compress data so effectively that denormalization rarely causes storage issues.
Query Performance Analysis
Consider a query: "Total sales by category and quarter for 2024"
Star Schema query:
SELECT
p.category,
d.quarter,
SUM(f.amount) as total_sales
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2024
GROUP BY p.category, d.quarter;
Snowflake Schema query:
SELECT
cat.category_name,
q.quarter_name,
SUM(f.amount) as total_sales
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_subcategory sc ON p.subcategory_key = sc.subcategory_key
JOIN dim_category cat ON sc.category_key = cat.category_key
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_month m ON d.month_key = m.month_key
JOIN dim_quarter q ON m.quarter_key = q.quarter_key
WHERE d.year = 2024
GROUP BY cat.category_name, q.quarter_name;
The star schema uses 2 JOINs; the snowflake uses 6 JOINs. In columnar databases, JOINs are expensive because they require:
- Building hash tables
- Shuffle operations across nodes
- Memory allocation for hash tables
Performance ratio: For complex queries with many dimensions, star schema can be 2-10x faster.
⚠️
Common Interview Trap: Candidates often argue that snowflake is "better" because it's normalized (following database design best practices for OLTP). But data warehouses are OLAP systems where denormalization is intentionally used to optimize read performance. Normalization rules from OLTP don't apply to OLAP.
When to Use Each Schema
Use Star Schema when:
- Query performance is critical (BI dashboards, ad-hoc analysis)
- Dimension hierarchies are shallow (3-4 levels)
- Business users need simple, intuitive data models
- You're using columnar storage (Snowflake, BigQuery, Redshift)
- ETL/ELT simplicity is important
Use Snowflake Schema when:
- Dimension hierarchies are deep (5+ levels) and complex
- Storage costs are a primary concern (rare)
- You need to support multiple conformed dimensions across facts
- Data governance requires strict normalization
- Dimension tables are extremely large (billions of rows)
Use a Hybrid (Constellation Schema) when:
- You have multiple fact tables sharing dimensions
- Some dimensions benefit from normalization, others don't
- You need to balance performance and storage
The Role of Slowly Changing Dimensions (SCDs)
The choice of schema affects how you handle SCDs:
Star Schema with SCD Type 2:
-- Dimension table with versioning
CREATE TABLE dim_product (
product_key BIGINT PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(200),
category VARCHAR(100),
effective_date DATE,
expiry_date DATE,
is_current BOOLEAN
);
-- Query for current version only
SELECT * FROM dim_product WHERE is_current = TRUE;
Snowflake Schema with SCD Type 2:
-- Each level gets its own versioning
CREATE TABLE dim_category (
category_key BIGINT PRIMARY KEY,
category_name VARCHAR(100),
effective_date DATE,
expiry_date DATE,
is_current BOOLEAN
);
CREATE TABLE dim_subcategory (
subcategory_key BIGINT PRIMARY KEY,
subcategory_name VARCHAR(100),
category_key BIGINT REFERENCES dim_category(category_key),
effective_date DATE,
expiry_date DATE,
is_current BOOLEAN
);
💡
Production Tip: When using SCD Type 2 in a star schema, consider adding a version_number column instead of effective_date/expiry_date. This simplifies queries and improves performance for point-in-time analysis.
Modern Alternatives: Data Vault and Anchor Modeling
Data Vault is gaining popularity for enterprise data warehouses:
Data Vault advantages:
- Audit trail built-in
- Parallel loading possible
- Business key (BK) separate from surrogate key
- Satellite tables store historical attributes
Data Vault disadvantages:
- Complex queries (many JOINs)
- Requires specialized tools (dbt, DataVault Loader)
- Overkill for small-medium warehouses
Code Implementation
Building a Star Schema ETL Pipeline
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window
spark = SparkSession.builder \
.appName("StarSchemaETL") \
.config("spark.sql.shuffle.partitions", "200") \
.getOrCreate()
# ============================================================
# 1. DIMENSION TABLES
# ============================================================
# Read raw product data
raw_products = spark.read.parquet("s3://data-lake/raw/products/")
# Build Dim_Product (denormalized - star schema)
dim_product = raw_products.select(
F.monotonically_increasing_id().alias("product_key"),
F.col("product_id").alias("business_product_id"),
"product_name",
"category",
"subcategory",
"brand",
"model",
"sku",
F.current_timestamp().alias("effective_date"),
F.lit(None).cast("timestamp").alias("expiry_date"),
F.lit(True).alias("is_current")
)
# Build Dim_Date
dim_date = spark.sql("""
SELECT
CAST(date_format(date, 'yyyyMMdd') AS INT) as date_key,
date,
day(date) as day_of_month,
dayofweek(date) as day_of_week,
dayname(date) as day_name,
weekofyear(date) as week_of_year,
month(date) as month,
monthname(date) as month_name,
quarter(date) as quarter,
year(date) as year,
concat(year(date), '-Q', quarter(date)) as year_quarter,
concat(year(date), '-', lpad(month(date), 2, '0')) as year_month
FROM (
SELECT explode(sequence(
to_date('2020-01-01'),
to_date('2030-12-31'),
interval 1 day
)) as date
)
""")
# Build Dim_Customer
dim_customer = spark.read.parquet("s3://data-lake/raw/customers/") \
.select(
F.monotonically_increasing_id().alias("customer_key"),
"customer_id",
"customer_name",
"email",
"segment",
"region",
"country",
"registration_date"
)
# ============================================================
# 2. FACT TABLE
# ============================================================
raw_orders = spark.read.parquet("s3://data-lake/raw/orders/")
# Join to get surrogate keys
fact_sales = raw_orders \
.join(dim_product, raw_orders.product_id == dim_product.business_product_id, "left") \
.join(dim_date, F.date_format(F.col("order_date"), "yyyyMMdd").cast("INT") == dim_date.date_key, "left") \
.join(dim_customer, raw_orders.customer_id == dim_customer.customer_id, "left") \
.select(
dim_date.date_key,
dim_product.product_key,
dim_customer.customer_key,
"quantity",
"unit_price",
F.col("quantity") * F.col("unit_price").alias("amount"),
F.col("quantity") * F.col("cost").alias("cost"),
F.col("quantity") * F.col("unit_price") - F.col("quantity") * F.col("cost").alias("profit")
)
# Write to Delta Lake
fact_sales.write.format("delta") \
.mode("overwrite") \
.partitionBy("date_key") \
.save("s3://data-warehouse/fact_sales/")
dim_product.write.format("delta") \
.mode("overwrite") \
.save("s3://data-warehouse/dim_product/")
dim_date.write.format("delta") \
.mode("overwrite") \
.save("s3://data-warehouse/dim_date/")
dim_customer.write.format("delta") \
.mode("overwrite") \
.save("s3://data-warehouse/dim_customer/")
Building a Snowflake Schema ETL Pipeline
# ============================================================
# SNOWFLAKE SCHEMA - Normalized Dimensions
# ============================================================
# Dim_Category (Level 1)
dim_category = raw_products.select(
F.monotonically_increasing_id().alias("category_key"),
F.col("category").alias("category_name")
).distinct()
# Dim_Subcategory (Level 2)
dim_subcategory = raw_products.select(
F.monotonically_increasing_id().alias("subcategory_key"),
F.col("subcategory").alias("subcategory_name"),
F.col("category")
).distinct() \
.join(dim_category, "category", "left") \
.select("subcategory_key", "subcategory_name", "category_key")
# Dim_Brand (Level 3)
dim_brand = raw_products.select(
F.monotonically_increasing_id().alias("brand_key"),
F.col("brand").alias("brand_name"),
F.col("subcategory")
).distinct() \
.join(dim_subcategory, "subcategory", "left") \
.select("brand_key", "brand_name", "subcategory_key")
# Dim_Product (Level 4) - now normalized
dim_product_snow = raw_products.select(
F.monotonically_increasing_id().alias("product_key"),
F.col("product_id").alias("business_product_id"),
"product_name",
"sku",
F.col("brand")
).distinct() \
.join(dim_brand, raw_products.brand == dim_brand.brand_name, "left") \
.select("product_key", "business_product_id", "product_name", "sku", "brand_key")
# Fact table references normalized keys
fact_sales_snow = raw_orders \
.join(dim_product_snow, raw_orders.product_id == dim_product_snow.business_product_id, "left") \
.select(
"date_key",
"product_key",
"customer_key",
"quantity",
"unit_price",
F.col("quantity") * F.col("unit_price").alias("amount")
)
Query Comparison: Star vs Snowflake
-- STAR SCHEMA QUERY (2 JOINs)
-- "Total sales by category and quarter for 2024"
SELECT
p.category,
d.quarter,
SUM(f.amount) as total_sales,
COUNT(DISTINCT f.customer_key) as unique_customers
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.year = 2024
GROUP BY p.category, d.quarter
ORDER BY p.category, d.quarter;
-- SNOWFLAKE SCHEMA QUERY (6 JOINs)
SELECT
cat.category_name,
q.quarter_name,
SUM(f.amount) as total_sales,
COUNT(DISTINCT f.customer_key) as unique_customers
FROM fact_sales f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_brand b ON p.brand_key = b.brand_key
JOIN dim_subcategory sc ON b.subcategory_key = sc.subcategory_key
JOIN dim_category cat ON sc.category_key = cat.category_key
JOIN dim_date d ON f.date_key = d.date_key
JOIN dim_month m ON d.month_key = m.month_key
JOIN dim_quarter q ON m.quarter_key = q.quarter_key
WHERE d.year = 2024
GROUP BY cat.category_name, q.quarter_name
ORDER BY cat.category_name, q.quarter_name;
ℹ️
Best Practice: When using a snowflake schema, create materialized views for common query patterns. This pre-computes the JOINs and gives you star-schema-like performance with snowflake-schema storage efficiency.
Performance Benchmark
import time
# Benchmark: Star Schema vs Snowflake Schema
queries = {
"simple_query": """
-- Both schemas: Same logical query
SELECT category, SUM(amount)
FROM fact_sales JOIN dim_product ON ...
GROUP BY category
""",
"complex_query": """
-- Star: 3 JOINs | Snowflake: 8 JOINs
SELECT cat.category, r.region, d.year_quarter,
SUM(f.amount), COUNT(DISTINCT f.customer_key)
FROM fact_sales f
JOIN dim_product p ON ...
JOIN dim_date d ON ...
JOIN dim_customer c ON ...
-- Snowflake adds 5 more JOINs for normalization
GROUP BY cat.category, r.region, d.year_quarter
"""
}
# Expected results (typical):
# Simple query: Star ~1.2s, Snowflake ~1.8s (1.5x slower)
# Complex query: Star ~3.5s, Snowflake ~12.0s (3.4x slower)
⚠️
Critical Consideration: The performance gap widens as queries become more complex. For dashboards with multiple filter dimensions, snowflake schema can be 5-10x slower without materialized views.
Common Follow-Up Questions
Q1: What about a "flat" table (pre-joined) vs star schema?
A flat table (all dimensions in the fact table) is the most denormalized approach:
- Flat table: Fastest queries, highest storage, hardest to maintain
- Star schema: Good balance, easy to understand
- Snowflake schema: Normalized, more JOINs, lower storage
For most use cases, star schema provides the best balance.
Q2: How do you handle many-to-many relationships?
Use a bridge table between the fact and dimension:
-- Product can belong to multiple categories
CREATE TABLE bridge_product_category (
product_key INT,
category_key INT,
probability DECIMAL(5,4) -- Weighting factor
);
-- Query with weighted aggregation
SELECT
c.category_name,
SUM(f.amount * b.probability) as weighted_sales
FROM fact_sales f
JOIN bridge_product_category b ON f.product_key = b.product_key
JOIN dim_category c ON b.category_key = c.category_key
GROUP BY c.category_name;
Q3: How does schema choice affect data governance?
Snowflake schema is better for governance because:
- Each attribute has a single source of truth
- Access control can be applied at the table level
- Data lineage is clearer
- Changes propagate automatically
Star schema requires updating denormalized attributes in multiple places.
Q4: What about materialized views in modern warehouses?
Materialized views bridge the gap:
-- Snowflake: Create materialized view for common queries
CREATE MATERIALIZED VIEW mv_sales_by_category_quarter AS
SELECT cat.category_name, q.quarter_name, SUM(f.amount)
FROM fact_sales f
JOIN dim_product p ON ...
JOIN dim_category cat ON ...
JOIN dim_date d ON ...
JOIN dim_quarter q ON ...
GROUP BY cat.category_name, q.quarter_name;
💡
Interview Tip: Always ask clarifying questions before answering: What's the query pattern? What's the data volume? What's the latency requirement? The "right" answer depends on these factors.
Company-Specific Tips
Amazon Interview Tips
- Emphasize query performance for BI dashboards
- Discuss partitioning strategy alongside schema design
- Mention conformed dimensions across business units
- Be ready to draw the schema on a whiteboard
Netflix Interview Tips
- Focus on content recommendation use cases
- Discuss how viewing history dimensions change over time
- Mention data vault for audit requirements
- Talk about real-time vs batch schema considerations
Airbnb Interview Tips
- Discuss geographic hierarchies (country → city → neighborhood)
- Explain how search dimensions affect schema design
- Mention A/B testing fact tables
- Talk about cost optimization in cloud warehouses
ℹ️
Final Takeaway: There's no universally "better" schema. Star schema is the default choice for most analytical workloads. Use snowflake schema only when you have strong normalization requirements or extremely large dimension tables. Modern cloud warehouses make denormalization nearly free, so optimize for query simplicity and performance.