🔍 Redshift Spectrum

Master Redshift Spectrum for querying S3 data lakes directly from Redshift.

Module: AWS Data Engineering • Topic 37 of 65 • Premium Content

Spectrum Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    REDSHIFT SPECTRUM ARCHITECTURE                             │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  REDSHIFT CLUSTER                                                    │    │
│  │  Leader Node: Query planning → Spectrum requests                   │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  SPECTRUM LAYER (Serverless, auto-scaling to 100 nodes)             │    │
│  │  • Columnar processing                                               │    │
│  │  • Predicate pushdown to S3                                          │    │
│  │  • Partition pruning                                                 │    │
│  │  • $5 per TB scanned                                                │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  S3 DATA LAKE                                                        │    │
│  │  Parquet/ORC (Recommended) | JSON/CSV (Supported)                   │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

External Table Creation

-- Create external schema
CREATE EXTERNAL SCHEMA spectrum_data
FROM DATA CATALOG
DATABASE 'data_lake_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/SpectrumRole';

-- Create partitioned external table
CREATE EXTERNAL TABLE spectrum_data.sales (
    sale_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    product_category VARCHAR(50)
)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://data-lake-processed/silver/sales/'
TABLE PROPERTIES ('parquet.compression'='SNAPPY');

-- Add partitions
ALTER TABLE spectrum_data.sales ADD PARTITION (year=2024, month=1, day=15)
LOCATION 's3://data-lake-processed/silver/sales/year=2024/month=1/day=15/';

-- Query with partition pruning
SELECT customer_id, SUM(amount) as total
FROM spectrum_data.sales
WHERE year = 2024 AND month = 1 AND day = 15
GROUP BY customer_id;

-- Join internal and external tables
SELECT c.name, SUM(s.amount) as total_sales
FROM dev.customers c
JOIN spectrum_data.sales s ON c.id = s.customer_id
WHERE s.year = 2024
GROUP BY c.name;

Interview Q&A

Q1: How much does Spectrum cost?

Answer: $5 per TB scanned. Minimize cost by using columnar formats (Parquet), partitioning, and predicate pushdown.

Q2: Can Spectrum join internal Redshift tables with external tables?

Answer: Yes. Spectrum federates queries between Redshift-internal tables and S3-external tables seamlessly.

Q3: What file formats does Spectrum support?

Answer: Parquet, ORC (best for columnar), Avro, JSON, CSV, Sequence, and Text.

Summary

Architecture: Redshift → Spectrum Layer → S3 Data Lake
Cost: $5 per TB scanned, optimize with partitioning
Formats: Parquet/ORC recommended for columnar access
Federation: Query internal Redshift + external S3 tables
Performance: Partition pruning, predicate pushdown, auto-scaling

Redshift Spectrum Deep Dive