Data Warehouse: Synapse Dedicated Pools & CTAS
Enterprise data warehousing with Synapse dedicated pools, CTAS patterns, and star schema design
Data Warehouse Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SYNAPSE DEDICATED POOL ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SYNAPSE WORKSPACE β β
β β β β
β β DEDICATED SQL POOL (DW1000c) β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β β
β β β CONTROL NODE COMPUTE NODES (6) β β β
β β β βββββββββββββββ βββββββββββββββββββββββββββββββ β β β
β β β β Query β β Node 1 β Node 2 β ... β Node 6β β β β
β β β β Processing ββββ>β β β β β β β β
β β β β & Merging β β 2TB β 2TB β β 2TB β β β β
β β β βββββββββββββββ βββββββββββββββββββββββββββββββ β β β
β β β β β β
β β β DISTRIBUTIONS (per compute node): β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β Dist 1 β Dist 2 β Dist 3 β ... β Dist 60 β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β STAR SCHEMA DESIGN: β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β dim_dates β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β date_key β full_date β year β month β day β β β β
β β β βββββββββββββββββββββββββββββββββββββββββββββββ β β β
β β β β β β β β
β β β β β β β β
β β β ββββββββΌββββββββ ββββββββΌββββββββ β β β
β β β β dim_customersβ β dim_products β β β β
β β β β β β β β β β
β β β β customer_key β β product_key β β β β
β β β β name β β name β β β β
β β β β segment β β category β β β β
β β β ββββββββ¬ββββββββ ββββββββ¬ββββββββ β β β
β β β β β β β β
β β β ββββββββββ¬ββββββββββββ β β β
β β β β β β β
β β β ββββββββββΌβββββββββ β β β
β β β β fact_sales β β β β
β β β β β β β β
β β β β sale_key β β β β
β β β β date_key (FK) β β β β
β β β β customer_key(FK)β β β β
β β β β product_key(FK) β β β β
β β β β quantity β β β β
β β β β total_amount β β β β
β β β βββββββββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CTAS Pattern Examples
-- Create fact table with CTAS
CREATE TABLE [dbo].[FactSales]
WITH
(
DISTRIBUTION = HASH(date_key),
CLUSTERED COLUMNSTORE INDEX,
PARTITION = (date_key RANGE RIGHT FOR VALUES
(20240101, 20240201, 20240301, 20240401,
20240501, 20240601, 20240701, 20240801,
20240901, 20241001, 20241101, 20241201))
)
AS
SELECT
s.sale_id,
d.date_key,
c.customer_key,
p.product_key,
s.quantity,
s.unit_price,
s.quantity * s.unit_price AS total_amount,
s.region,
GETDATE() AS load_date
FROM [staging].[Sales] s
INNER JOIN [dbo].[DimDates] d
ON s.sale_date = d.full_date
INNER JOIN [dbo].[DimCustomers] c
ON s.customer_id = c.customer_id
INNER JOIN [dbo].[DimProducts] p
ON s.product_id = p.product_id;
-- Create dimension table with replicated distribution
CREATE TABLE [dbo].[DimCustomers]
WITH
(
DISTRIBUTION = REPLICATE,
CLUSTERED INDEX (customer_key)
)
AS
SELECT
customer_id AS customer_key,
customer_name,
email,
segment,
region,
GETDATE() AS load_date
FROM [staging].[Customers];
-- Create aggregate table
CREATE TABLE [dbo].[AggDailySalesByRegion]
WITH
(
DISTRIBUTION = HASH(region),
CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT
date_key,
region,
SUM(total_amount) AS daily_revenue,
COUNT(*) AS transaction_count,
COUNT(DISTINCT customer_key) AS unique_customers
FROM [dbo].[FactSales]
GROUP BY date_key, region;
Distribution Strategy Guide
| Table Type | Recommended Distribution | Reason |
|---|---|---|
| Large fact tables (>1GB) | HASH (join column) | Even distribution, efficient joins |
| Small dimension tables (<2GB) | REPLICATE | Avoid shuffling for joins |
| Staging tables | ROUND ROBIN | Fast load, no skew |
| Aggregate tables | HASH (group by column) | Even distribution |
βΉοΈ
Pro Tip: Use DBCC PDW_SHOWSPACEUSED('table_name') to check distribution skew. If any distribution is significantly larger than others, consider changing the distribution key.
Interview Questions
Q1: How do you choose between Hash and Round Robin distribution? A: Hash for tables that are frequently joined (even distribution on join key). Round Robin for staging tables or when no clear join key exists (fastest load, even distribution).
Q2: What is the benefit of Clustered Columnstore Indexes in Synapse? A: CCI provides optimal compression (up to 10x), columnar storage for analytics, and batch-mode execution. It's the default and recommended index type for fact tables.
Q3: How do you handle slowly changing dimensions in Synapse? A: Use MERGE statement for SCD Type 1 (overwrite). For SCD Type 2, create history tables with effective dates and status flags. Use Synapse Serverless for staging changes before merging.