πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Topic: SQL Data Modeling for FAANG Interviews

SQL AdvancedData Modeling⭐ Premium

Advertisement

πŸ—οΈ SQL Data Modeling

Amazon & Microsoft Interview Deep Dive

🏒 Amazon🏒 Microsoft⚑ Difficulty: Hard⏱️ 50 min

πŸ“‹ Interview Question

β„ΉοΈπŸ”΄ Amazon/Microsoft Interview Question

"Design a data warehouse schema for an e-commerce platform. The system needs to track: products, customers, orders, order items, and inventory. Consider: 1) Star vs Snowflake schema trade-offs, 2) Slowly Changing Dimensions (SCD), 3) Fact table granularity, 4) Partitioning strategy for 1B+ rows."

Companies: Amazon, Microsoft | Difficulty: Hard | Time: 50 minutes

πŸ“Š Part 1: Star Schema Design

β„ΉοΈπŸ” Star Schema

Star schema is the foundation of dimensional modeling:

  • Fact table: Contains measurements/metrics (center)
  • Dimension tables: Contain descriptive attributes (points of star)
  • Benefits: Simple queries, fewer JOINs, optimal for OLAP

Dimension Tables

-- Dimension: Products
CREATE TABLE dim_product (
    product_key SERIAL PRIMARY KEY,  -- Surrogate key
    product_id VARCHAR(50),          -- Natural/Business key
    product_name VARCHAR(200),
    category VARCHAR(100),
    subcategory VARCHAR(100),
    brand VARCHAR(100),
    unit_cost DECIMAL(12, 2),
    unit_price DECIMAL(12, 2),
    supplier_id VARCHAR(50),
    supplier_name VARCHAR(200),
    is_active BOOLEAN DEFAULT true,
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

-- Dimension: Customers
CREATE TABLE dim_customer (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(200),
    email VARCHAR(200),
    phone VARCHAR(50),
    address_line1 VARCHAR(200),
    address_line2 VARCHAR(200),
    city VARCHAR(100),
    state VARCHAR(50),
    country VARCHAR(100),
    postal_code VARCHAR(20),
    segment VARCHAR(50),  -- Premium, Regular, New
    loyalty_tier VARCHAR(20),  -- Gold, Silver, Bronze
    registration_date DATE,
    is_active BOOLEAN DEFAULT true,
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP
);

-- Dimension: Time (Date dimension)
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,  -- YYYYMMDD format
    full_date DATE UNIQUE,
    year INT,
    quarter INT,
    month INT,
    month_name VARCHAR(20),
    week INT,
    day_of_month INT,
    day_of_week INT,
    day_name VARCHAR(20),
    is_weekend BOOLEAN,
    is_holiday BOOLEAN,
    holiday_name VARCHAR(100),
    fiscal_year INT,
    fiscal_quarter INT
);

-- Dimension: Stores/Channels
CREATE TABLE dim_store (
    store_key SERIAL PRIMARY KEY,
    store_id VARCHAR(50),
    store_name VARCHAR(200),
    store_type VARCHAR(50),  -- Online, Retail, Marketplace
    region VARCHAR(100),
    country VARCHAR(100),
    city VARCHAR(100),
    manager VARCHAR(200),
    square_footage INT,
    is_active BOOLEAN DEFAULT true,
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP
);

-- Dimension: Promotions
CREATE TABLE dim_promotion (
    promotion_key SERIAL PRIMARY KEY,
    promotion_id VARCHAR(50),
    promotion_name VARCHAR(200),
    promotion_type VARCHAR(50),
    discount_percent DECIMAL(5, 2),
    start_date DATE,
    end_date DATE,
    is_active BOOLEAN DEFAULT true
);

Fact Table

-- Fact Table: Orders (Transaction grain)
CREATE TABLE fact_orders (
    order_key BIGSERIAL PRIMARY KEY,
    order_id VARCHAR(50),           -- Natural key
    product_key INT REFERENCES dim_product(product_key),
    customer_key INT REFERENCES dim_customer(customer_key),
    date_key INT REFERENCES dim_date(date_key),
    store_key INT REFERENCES dim_store(store_key),
    promotion_key INT REFERENCES dim_promotion(promotion_key),
    -- Measures
    quantity INT,
    unit_price DECIMAL(12, 2),
    unit_cost DECIMAL(12, 2),
    discount_amount DECIMAL(12, 2),
    tax_amount DECIMAL(12, 2),
    shipping_amount DECIMAL(12, 2),
    total_amount DECIMAL(12, 2),    -- quantity * unit_price - discount + tax + shipping
    profit DECIMAL(12, 2),          -- total_amount - (quantity * unit_cost)
    -- Status
    order_status VARCHAR(20),       -- Pending, Shipped, Delivered, Cancelled
    payment_method VARCHAR(30),
    -- Timestamps
    order_timestamp TIMESTAMP,
    ship_date TIMESTAMP,
    delivery_date TIMESTAMP,
    -- Metadata
    etl_batch_id INT,
    etl_load_timestamp TIMESTAMP DEFAULT NOW()
) PARTITION BY RANGE (date_key);

-- Create partitions by year
CREATE TABLE fact_orders_2023 PARTITION OF fact_orders
FOR VALUES FROM (20230101) TO (20240101);

CREATE TABLE fact_orders_2024 PARTITION OF fact_orders
FOR VALUES FROM (20240101) TO (20250101);

❄️ Part 2: Snowflake Schema

β„ΉοΈπŸ” Snowflake Schema

Snowflake schema normalizes dimension tables into sub-dimensions:

  • Benefits: Less data redundancy, smaller dimension tables
  • Drawbacks: More JOINs, more complex queries
  • Use when: Dimension tables are very large with many attributes
-- Snowflake: Normalize Product dimension
CREATE TABLE dim_category (
    category_key SERIAL PRIMARY KEY,
    category_name VARCHAR(100),
    category_description TEXT
);

CREATE TABLE dim_subcategory (
    subcategory_key SERIAL PRIMARY KEY,
    category_key INT REFERENCES dim_category(category_key),
    subcategory_name VARCHAR(100)
);

CREATE TABLE dim_brand (
    brand_key SERIAL PRIMARY KEY,
    brand_name VARCHAR(100),
    brand_description TEXT
);

CREATE TABLE dim_supplier (
    supplier_key SERIAL PRIMARY KEY,
    supplier_name VARCHAR(200),
    contact_name VARCHAR(200),
    phone VARCHAR(50),
    email VARCHAR(200)
);

-- Modified Product dimension (snowflaked)
CREATE TABLE dim_product_snowflake (
    product_key SERIAL PRIMARY KEY,
    product_id VARCHAR(50),
    product_name VARCHAR(200),
    subcategory_key INT REFERENCES dim_subcategory(subcategory_key),
    brand_key INT REFERENCES dim_brand(brand_key),
    supplier_key INT REFERENCES dim_supplier(supplier_key),
    unit_cost DECIMAL(12, 2),
    unit_price DECIMAL(12, 2),
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP
);

-- Query comparison
-- Star schema: 1 JOIN for product info
SELECT p.product_name, p.category, f.total_amount
FROM fact_orders f
JOIN dim_product p ON f.product_key = p.product_key;

-- Snowflake schema: 3 JOINs for same info
SELECT
    p.product_name,
    sc.subcategory_name,
    c.category_name,
    f.total_amount
FROM fact_orders f
JOIN dim_product_snowflake p ON f.product_key = p.product_key
JOIN dim_subcategory sc ON p.subcategory_key = sc.subcategory_key
JOIN dim_category c ON sc.category_key = c.category_key;

🏦 Part 3: Data Vault Modeling

β„ΉοΈπŸ” Data Vault

Data Vault is designed for auditability and flexibility:

  • Hubs: Business keys (entities)
  • Links: Relationships between hubs
  • Satellites: Descriptive attributes and history
  • Benefits: Full historical tracking, parallel loading, audit trail
-- Data Vault: Hubs (Business Keys)
CREATE TABLE hub_customer (
    hub_customer_hk CHAR(32) PRIMARY KEY,  -- Hash key
    customer_id VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

CREATE TABLE hub_product (
    hub_product_hk CHAR(32) PRIMARY KEY,
    product_id VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

CREATE TABLE hub_order (
    hub_order_hk CHAR(32) PRIMARY KEY,
    order_id VARCHAR(50),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

-- Data Vault: Links (Relationships)
CREATE TABLE link_order_customer (
    link_order_customer_hk CHAR(32) PRIMARY KEY,
    hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
    hub_customer_hk CHAR(32) REFERENCES hub_customer(hub_customer_hk),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

CREATE TABLE link_order_product (
    link_order_product_hk CHAR(32) PRIMARY KEY,
    hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
    hub_product_hk CHAR(32) REFERENCES hub_product(hub_product_hk),
    load_date TIMESTAMP,
    record_source VARCHAR(100)
);

-- Data Vault: Satellites (Attributes & History)
CREATE TABLE sat_customer (
    sat_customer_hk CHAR(32) PRIMARY KEY,
    hub_customer_hk CHAR(32) REFERENCES hub_customer(hub_customer_hk),
    load_date TIMESTAMP,
    load_end_date TIMESTAMP,
    hash_diff CHAR(32),  -- For detecting changes
    customer_name VARCHAR(200),
    email VARCHAR(200),
    phone VARCHAR(50),
    address VARCHAR(500),
    segment VARCHAR(50),
    record_source VARCHAR(100)
);

CREATE TABLE sat_product (
    sat_product_hk CHAR(32) PRIMARY KEY,
    hub_product_hk CHAR(32) REFERENCES hub_product(hub_product_hk),
    load_date TIMESTAMP,
    load_end_date TIMESTAMP,
    hash_diff CHAR(32),
    product_name VARCHAR(200),
    category VARCHAR(100),
    brand VARCHAR(100),
    unit_cost DECIMAL(12, 2),
    unit_price DECIMAL(12, 2),
    record_source VARCHAR(100)
);

CREATE TABLE sat_order (
    sat_order_hk CHAR(32) PRIMARY KEY,
    hub_order_hk CHAR(32) REFERENCES hub_order(hub_order_hk),
    load_date TIMESTAMP,
    load_end_date TIMESTAMP,
    hash_diff CHAR(32),
    order_date TIMESTAMP,
    status VARCHAR(20),
    total_amount DECIMAL(12, 2),
    record_source VARCHAR(100)
);

Hash Function for Data Vault

-- Generate hash keys using MD5
CREATE OR REPLACE FUNCTION generate_hash_key(VARIADIC args TEXT[])
RETURNS CHAR(32) AS $$
BEGIN
    RETURN MD5(ARRAY_TO_STRING(args, '||'));
END;
$$ LANGUAGE plpgsql;

-- Example usage
INSERT INTO hub_customer (hub_customer_hk, customer_id, load_date, record_source)
VALUES (
    generate_hash_key('CUST-001'),
    'CUST-001',
    NOW(),
    'source_system_a'
);

πŸ”„ Part 4: Slowly Changing Dimensions (SCD)

SCD Type 1: Overwrite

-- Type 1: No history, just update current value
CREATE TABLE dim_customer_scd1 (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(200),
    email VARCHAR(200),
    city VARCHAR(100),
    segment VARCHAR(50),
    last_updated TIMESTAMP DEFAULT NOW()
);

-- Update procedure
CREATE OR REPLACE PROCEDURE update_customer_scd1(
    p_customer_id VARCHAR,
    p_name VARCHAR,
    p_email VARCHAR,
    p_city VARCHAR,
    p_segment VARCHAR
)
LANGUAGE plpgsql
AS $$
BEGIN
    UPDATE dim_customer_scd1
    SET customer_name = p_name,
        email = p_email,
        city = p_city,
        segment = p_segment,
        last_updated = NOW()
    WHERE customer_id = p_customer_id;

    IF NOT FOUND THEN
        INSERT INTO dim_customer_scd1 (customer_id, customer_name, email, city, segment)
        VALUES (p_customer_id, p_name, p_email, p_city, p_segment);
    END IF;
END;
$$;

SCD Type 2: Full History

-- Type 2: Keep full history with effective dates
CREATE TABLE dim_customer_scd2 (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(200),
    email VARCHAR(200),
    city VARCHAR(100),
    segment VARCHAR(50),
    effective_date TIMESTAMP,
    expiration_date TIMESTAMP,
    is_current BOOLEAN DEFAULT true,
    version INT DEFAULT 1
);

-- Insert/Update procedure for SCD Type 2
CREATE OR REPLACE PROCEDURE update_customer_scd2(
    p_customer_id VARCHAR,
    p_name VARCHAR,
    p_email VARCHAR,
    p_city VARCHAR,
    p_segment VARCHAR
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_current_key INT;
    v_hash_diff CHAR(32);
BEGIN
    -- Calculate hash of new values
    v_hash_diff := MD5(p_name || p_email || p_city || p_segment);

    -- Check if record exists and has changed
    SELECT customer_key INTO v_current_key
    FROM dim_customer_scd2
    WHERE customer_id = p_customer_id
    AND is_current = true;

    IF v_current_key IS NOT NULL THEN
        -- Check if values changed
        IF (SELECT hash_diff FROM dim_customer_scd2 WHERE customer_key = v_current_key) != v_hash_diff THEN
            -- Expire current record
            UPDATE dim_customer_scd2
            SET is_current = false,
                expiration_date = NOW()
            WHERE customer_key = v_current_key;

            -- Insert new version
            INSERT INTO dim_customer_scd2 (
                customer_id, customer_name, email, city, segment,
                effective_date, expiration_date, is_current, version
            )
            SELECT
                customer_id, p_name, p_email, p_city, p_segment,
                NOW(), NULL, true, version + 1
            FROM dim_customer_scd2
            WHERE customer_key = v_current_key;
        END IF;
    ELSE
        -- Insert new record
        INSERT INTO dim_customer_scd2 (
            customer_id, customer_name, email, city, segment,
            effective_date, expiration_date, is_current, version
        )
        VALUES (
            p_customer_id, p_name, p_email, p_city, p_segment,
            NOW(), NULL, true, 1
        );
    END IF;
END;
$$;

SCD Type 3: Limited History

-- Type 3: Keep previous and current values
CREATE TABLE dim_customer_scd3 (
    customer_key SERIAL PRIMARY KEY,
    customer_id VARCHAR(50),
    customer_name VARCHAR(200),
    email VARCHAR(200),
    city_current VARCHAR(100),
    city_previous VARCHAR(100),
    segment_current VARCHAR(50),
    segment_previous VARCHAR(50),
    effective_date TIMESTAMP,
    previous_effective_date TIMESTAMP
);

πŸ“Š Part 5: Aggregate Fact Tables

-- Daily aggregate fact table
CREATE TABLE fact_orders_daily (
    date_key INT REFERENCES dim_date(date_key),
    product_key INT REFERENCES dim_product(product_key),
    store_key INT REFERENCES dim_store(store_key),
    -- Aggregated measures
    order_count INT,
    quantity_sum INT,
    total_amount_sum DECIMAL(15, 2),
    profit_sum DECIMAL(15, 2),
    avg_order_value DECIMAL(12, 2),
    unique_customers INT,
    PRIMARY KEY (date_key, product_key, store_key)
);

-- Populate aggregate from transaction fact
INSERT INTO fact_orders_daily (
    date_key, product_key, store_key,
    order_count, quantity_sum, total_amount_sum,
    profit_sum, avg_order_value, unique_customers
)
SELECT
    date_key,
    product_key,
    store_key,
    COUNT(DISTINCT order_id) AS order_count,
    SUM(quantity) AS quantity_sum,
    SUM(total_amount) AS total_amount_sum,
    SUM(profit) AS profit_sum,
    AVG(total_amount) AS avg_order_value,
    COUNT(DISTINCT customer_key) AS unique_customers
FROM fact_orders
GROUP BY date_key, product_key, store_key;

πŸ“ˆ Part 6: Comparison Table

πŸ’‘βœ… Schema Selection Guide

FeatureStar SchemaSnowflakeData Vault
SimplicityHighMediumLow
Query PerformanceFastestSlower (more JOINs)Slowest (many JOINs)
Storage EfficiencyLowerHigherHighest
Historical TrackingLimitedLimitedFull
Audit TrailNoNoYes
Load PerformanceGoodGoodBest (parallel)
FlexibilityMediumMediumHigh
Best ForOLAP/BINormalized dimensionsEnterprise DW

🎯 Quiz Section

πŸ† Best Practices for Interviews

πŸ’‘βœ… Data Modeling Best Practices

1. Define Grain First: Always clarify what each fact table row represents before designing the schema.

2. Use Surrogate Keys:

-- BAD: Using natural keys in fact table
order_id VARCHAR(50)  -- May change, not unique across sources

-- GOOD: Surrogate key from dimension
product_key SERIAL  -- Always unique, never changes

3. Conformed Dimensions: Ensure dimensions are consistent across fact tables so you can JOIN different facts.

4. Handle NULLs Appropriately:

-- Create Unknown member for dimensions
INSERT INTO dim_product (product_key, product_name, ...)
VALUES (-1, 'Unknown', ...);

-- Use -1 as foreign key for NULL dimension references

5. Consider Partitioning Early: For large fact tables, design partitioning strategy from the start (by date, region, etc.).

πŸ”— Common Follow-Up Questions

  1. "How do you handle late-arriving facts?" β€” Buffer period, ETL procedures, SCD handling
  2. "When would you use a factless fact table?" β€” Coverage tables, event tracking without measures
  3. "How do you handle many-to-many in dimensional modeling?" β€” Bridge tables, junk dimensions
  4. "What is a junk dimension?" β€” Combines low-cardinality flags/statuses into one dimension

⚠️⚠️ Interview Tip

When discussing data modeling, always mention:

  1. Business requirements driving the design
  2. Query patterns you're optimizing for
  3. Data volume and growth expectations
  4. ETL/ELT complexity considerations

Advertisement