π Redshift Spectrum
Master Redshift Spectrum for querying S3 data lakes directly from Redshift.
Module: AWS Data Engineering β’ Topic 37 of 65 β’ Premium Content
Spectrum Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β REDSHIFT SPECTRUM ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β REDSHIFT CLUSTER β β
β β Leader Node: Query planning β Spectrum requests β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SPECTRUM LAYER (Serverless, auto-scaling to 100 nodes) β β
β β β’ Columnar processing β β
β β β’ Predicate pushdown to S3 β β
β β β’ Partition pruning β β
β β β’ $5 per TB scanned β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β S3 DATA LAKE β β
β β Parquet/ORC (Recommended) | JSON/CSV (Supported) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
External Table Creation
-- Create external schema
CREATE EXTERNAL SCHEMA spectrum_data
FROM DATA CATALOG
DATABASE 'data_lake_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/SpectrumRole';
-- Create partitioned external table
CREATE EXTERNAL TABLE spectrum_data.sales (
sale_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
product_category VARCHAR(50)
)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://data-lake-processed/silver/sales/'
TABLE PROPERTIES ('parquet.compression'='SNAPPY');
-- Add partitions
ALTER TABLE spectrum_data.sales ADD PARTITION (year=2024, month=1, day=15)
LOCATION 's3://data-lake-processed/silver/sales/year=2024/month=1/day=15/';
-- Query with partition pruning
SELECT customer_id, SUM(amount) as total
FROM spectrum_data.sales
WHERE year = 2024 AND month = 1 AND day = 15
GROUP BY customer_id;
-- Join internal and external tables
SELECT c.name, SUM(s.amount) as total_sales
FROM dev.customers c
JOIN spectrum_data.sales s ON c.id = s.customer_id
WHERE s.year = 2024
GROUP BY c.name;
Interview Q&A
Q1: How much does Spectrum cost?
Answer: $5 per TB scanned. Minimize cost by using columnar formats (Parquet), partitioning, and predicate pushdown.
Q2: Can Spectrum join internal Redshift tables with external tables?
Answer: Yes. Spectrum federates queries between Redshift-internal tables and S3-external tables seamlessly.
Q3: What file formats does Spectrum support?
Answer: Parquet, ORC (best for columnar), Avro, JSON, CSV, Sequence, and Text.
Summary
- Architecture: Redshift β Spectrum Layer β S3 Data Lake
- Cost: $5 per TB scanned, optimize with partitioning
- Formats: Parquet/ORC recommended for columnar access
- Federation: Query internal Redshift + external S3 tables
- Performance: Partition pruning, predicate pushdown, auto-scaling