🏗️ AWS Data Warehouse Interview

Comprehensive interview preparation for data warehouse architecture, Redshift optimization, and analytics on AWS.

Module: AWS Data Engineering • Topic 54 of 65 • Premium Content

Data Warehouse Architecture Overview

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────┐
│                    AWS Data Warehouse Architecture                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  Data Sources          ETL/ELT             Storage                 │
│  ┌──────────┐         ┌──────────┐       ┌──────────┐             │
│  │ Databases│────────▶│   Glue   │──────▶│ Redshift │             │
│  │ Files    │         │   EMR    │       │ Cluster  │             │
│  │ Streams  │         │ Lambda   │       │          │             │
│  └──────────┘         └──────────┘       └────┬─────┘             │
│                                                │                   │
│                     Query Layer                │                   │
│                  ┌─────────────┐               │                   │
│                  │   Athena    │◀──────────────┘                   │
│                  │   Spectrum  │                                   │
│                  └──────┬──────┘                                   │
│                         │                                          │
│              ┌──────────┼──────────┐                              │
│              ▼          ▼          ▼                              │
│        ┌─────────┐ ┌─────────┐ ┌─────────┐                      │
│        │ BI/SQL  │ │Python  │ │  API    │                        │
│        │ Tools   │ │  Apps   │ │ Clients │                        │
│        └─────────┘ └─────────┘ └─────────┘                      │
│                                                                   │
└─────────────────────────────────────────────────────────────────────┘

Q1: Compare Redshift Provisioned vs Serverless. When would you choose each?

Answer:

Feature	Provisioned	Serverless
Cost Model	Instance-hours	RPU-hours
Scaling	Manual/Auto	Automatic
Min Cost	~ $0.25/hr \| ~$ 0.375/RPU-hr
Use Case	Predictable workloads	Variable workloads

Choose Provisioned when:

Predictable, steady-state workloads
Need specific instance types (RA3, DC2)
Reserved capacity for cost savings
Large, consistent data volumes

Choose Serverless when:

Variable/unpredictable workloads
Dev/test environments
Ad-hoc analytics
Quick setup without cluster management

Architecture Diagram

Decision Tree:
┌─────────────────────────────────────────┐
│ Is workload predictable (>80% utilized)?│
└────────────────┬────────────────────────┘
                 │
        ┌────────┴────────┐
        ▼                 ▼
       Yes                No
        │                  │
        ▼                  ▼
  ┌───────────┐    ┌───────────────────┐
  │Provisioned│    │ Need quick setup? │
  └───────────┘    └─────────┬─────────┘
                             │
                  ┌──────────┴──────────┐
                  ▼                     ▼
                 Yes                    No
                  │                     │
                  ▼                     ▼
           ┌───────────┐        ┌───────────┐
           │ Serverless│        │Provisioned│
           │ (Auto)    │        │ (Flexible)│
           └───────────┘        └───────────┘

Q2: How do you design a star schema for data warehousing on Redshift?

Answer:

Star Schema Design:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│                    Star Schema Design                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│                      ┌─────────────────┐                       │
│                      │ dim_date        │                       │
│                      │─────────────────│                       │
│                      │ date_key (PK)   │                       │
│                      │ full_date       │                       │
│                      │ year            │                       │
│                      │ quarter         │                       │
│                      │ month           │                       │
│                      │ day_of_week     │                       │
│                      └────────┬────────┘                       │
│                               │                                 │
│  ┌─────────────────┐         │         ┌─────────────────┐    │
│  │ dim_customer    │         │         │ dim_product     │    │
│  │─────────────────│         │         │─────────────────│    │
│  │ customer_key(PK)│         │         │ product_key (PK)│    │
│  │ customer_id     │         │         │ product_id      │    │
│  │ name            │         │         │ product_name    │    │
│  │ segment         │         │         │ category        │    │
│  │ region          │         │         │ brand           │    │
│  └────────┬────────┘         │         └────────┬────────┘    │
│           │                  │                   │             │
│           └──────────────────┼───────────────────┘             │
│                              │                                 │
│                    ┌─────────┴─────────┐                       │
│                    │   fact_sales      │                       │
│                    │───────────────────│                       │
│                    │ sale_key (PK)     │                       │
│                    │ date_key (FK)     │                       │
│                    │ customer_key (FK) │                       │
│                    │ product_key (FK)  │                       │
│                    │ quantity          │                       │
│                    │ amount            │                       │
│                    │ discount          │                       │
│                    └───────────────────┘                       │
└─────────────────────────────────────────────────────────────────┘

Redshift Table Creation:

-- Dimension tables
CREATE TABLE dim_date (
    date_key INT PRIMARY KEY,
    full_date DATE NOT NULL,
    year INT NOT NULL,
    quarter INT NOT NULL,
    month INT NOT NULL,
    month_name VARCHAR(10),
    day INT NOT NULL,
    day_of_week VARCHAR(10),
    is_weekend BOOLEAN
)
DISTSTYLE ALL
SORTKEY(full_date);

CREATE TABLE dim_customer (
    customer_key INT PRIMARY KEY IDENTITY(1,1),
    customer_id VARCHAR(50) NOT NULL,
    name VARCHAR(100),
    email VARCHAR(200),
    segment VARCHAR(50),
    region VARCHAR(50)
)
DISTSTYLE KEY
DISTKEY(customer_key)
SORTKEY(customer_id);

-- Fact table
CREATE TABLE fact_sales (
    sale_key BIGINT PRIMARY KEY IDENTITY(1,1),
    date_key INT NOT NULL REFERENCES dim_date(date_key),
    customer_key INT NOT NULL REFERENCES dim_customer(customer_key),
    product_key INT NOT NULL REFERENCES dim_product(product_key),
    quantity INT,
    amount DECIMAL(10,2),
    discount DECIMAL(10,2),
    profit DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY(customer_key)
COMPOUND SORTKEY(date_key, customer_key);

ℹ️

Interview Tip: Always discuss distribution keys (DISTKEY) and sort keys (SORTKEY) when discussing Redshift schema design. These are critical for performance.

Q3: How do you optimize Redshift query performance?

Answer:

Performance Optimization Techniques:

1. Distribution Styles:

-- KEY distribution for large fact tables
CREATE TABLE fact_orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY(customer_id);

-- ALL distribution for small dimension tables
CREATE TABLE dim_region (
    region_id INT,
    region_name VARCHAR(50)
)
DISTSTYLE ALL;

-- EVEN distribution for even spread
CREATE TABLE staging_data (
    id BIGINT,
    data VARCHAR(1000)
)
DISTSTYLE EVEN;

2. Sort Keys:

-- Compound sort key for multi-column filters
CREATE TABLE events (
    event_id BIGINT,
    event_date TIMESTAMP,
    user_id BIGINT,
    event_type VARCHAR(50)
)
COMPOUND SORTKEY(event_date, user_id);

-- Interleaved sort key for independent column filters
CREATE TABLE logs (
    log_id BIGINT,
    timestamp TIMESTAMP,
    user_id BIGINT,
    action VARCHAR(100)
)
INTERLEVED SORTKEY(timestamp, user_id, action);

3. Materialized Views:

-- Create materialized view
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT 
    date_trunc('day', sale_date) as sale_date,
    region,
    product_category,
    SUM(amount) as total_sales,
    COUNT(*) as transaction_count
FROM fact_sales
JOIN dim_date ON fact_sales.date_key = dim_date.date_key
JOIN dim_product ON fact_sales.product_key = dim_product.product_key
GROUP BY 1, 2, 3;

-- Refresh
REFRESH MATERIALIZED VIEW mv_daily_sales;

Performance Comparison:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Query Performance Optimization Results             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Before Optimization:                                          │
│  • Query time: 45 seconds                                      │
│  • Data scanned: 100GB                                         │
│  • Cost: $0.50 per query                                       │
│                                                                 │
│  After Optimization:                                           │
│  • Query time: 2.5 seconds (94% faster)                       │
│  • Data scanned: 5GB (95% reduction)                          │
│  • Cost: $0.025 per query (95% cheaper)                       │
│                                                                 │
│  Techniques Applied:                                           │
│  ✓ Optimized DISTKEY on join columns                          │
│  ✓ COMPOUND SORTKEY aligned with WHERE clause                 │
│  ✓ Materialized view for aggregation                          │
│  ✓ Columnar encoding (AZ64, LZO)                             │
└─────────────────────────────────────────────────────────────────┘

Q4: How do you implement ETL/ELT pipelines with Redshift?

Answer:

ETL Pattern (Transform before Load):

# Glue ETL to Redshift
import boto3
from awsglue.context import GlueContext
from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

# Read from source
source = glueContext.create_dynamic_frame.from_catalog(
    database="source_db",
    table_name="raw_sales"
)

# Transform
transformed = ApplyMapping.apply(
    frame=source,
    mappings=[
        ("transaction_id", "string", "sale_id", "bigint"),
        ("amount", "double", "sale_amount", "decimal(10,2)"),
        ("date", "string", "sale_date", "date")
    ]
)

# Write to Redshift
glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=transformed,
    catalog_connection="redshift-connection",
    connection_options={
        "dbtable": "fact_sales",
        "database": "analytics"
    }
)

ELT Pattern (Load then Transform):

-- Load raw data to staging
COPY staging_sales
FROM 's3://data-lake/raw/sales/'
IAM_ROLE 'arn:aws:iam::role/RedshiftRole'
FORMAT AS PARQUET;

-- Transform in Redshift
INSERT INTO fact_sales (date_key, customer_key, product_key, amount)
SELECT 
    d.date_key,
    c.customer_key,
    p.product_key,
    s.amount
FROM staging_sales s
JOIN dim_date d ON s.sale_date = d.full_date
JOIN dim_customer c ON s.customer_id = c.customer_id
JOIN dim_product p ON s.product_id = p.product_id;

-- Vacuum and analyze
VACUUM fact_sales;
ANALYZE fact_sales;

ELT Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              ELT Pattern with Redshift                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Sources            Load (ELT-L)       Transform (ELT-T)       │
│  ┌─────────────┐   ┌─────────────┐    ┌─────────────┐         │
│  │ S3          │──▶│ COPY        │───▶│ SQL/Merge   │         │
│  │ (Raw)       │   │ (Staging)   │   │ (Transform) │         │
│  └─────────────┘   └─────────────┘   └──────┬──────┘         │
│                                              │                  │
│  ┌─────────────┐   ┌─────────────┐          │                  │
│  │ Glue        │──▶│ UNLOAD      │──────────┤                  │
│  │ (Process)   │   │ (To S3)     │          │                  │
│  └─────────────┘   └─────────────┘          │                  │
│                                              ▼                  │
│                                       ┌─────────────┐          │
│                                       │ Final Tables│          │
│                                       │ (Optimized) │          │
│                                       └─────────────┘          │
└─────────────────────────────────────────────────────────────────┘

Q5: How do you implement data security in Redshift?

Answer:

Security Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Redshift Security Architecture                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Network Security                                              │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • VPC deployment                                        │   │
│  │ • Security groups                                       │   │
│  │ • Network ACLs                                          │   │
│  │ • VPC endpoints                                         │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Access Control                                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • IAM roles                                             │   │
│  │ • Database users/groups                                 │   │
│  │ • Row-level security                                    │   │
│  │ • Column-level security                                 │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Data Protection                                               │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Encryption at rest (KMS)                              │   │
│  │ • Encryption in transit (SSL)                           │   │
│  │ • Column encryption                                      │   │
│  │ • Dynamic data masking                                  │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Column-Level Security:

-- Create group
CREATE GROUP analysts;

-- Grant permissions
GRANT USAGE ON SCHEMA analytics TO GROUP analysts;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO GROUP analysts;

-- Revoke sensitive columns
REVOKE SELECT (ssn, salary) ON TABLE employees FROM GROUP analysts;

-- Row-level security
CREATE POLICY region_policy ON sales_data
    FOR ALL
    TO analysts
    USING (region = current_user_region());

Dynamic Data Masking:

-- Create masking policy
CREATE MASKING POLICY email_mask ON customers
    FOR COLUMN email
    USING (
        CASE 
            WHEN is_member_of('admin') THEN email
            ELSE CONCAT('****', RIGHT(email, LENGTH(email) - INSTR(email, '@') + 1))
        END
    );

-- Apply masking
ALTER TABLE customers ALTER COLUMN email SET MASKING POLICY email_mask;

Q6: How do you implement data lakehouse architecture with Redshift Spectrum?

Answer:

Lakehouse Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Lakehouse with Redshift Spectrum              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Redshift Cluster                       │   │
│  │  ┌─────────────┐    ┌─────────────┐                     │   │
│  │  │ Local Tables│    │  External   │                     │   │
│  │  │ (Hot Data)  │    │  Tables     │                     │   │
│  │  └─────────────┘    │  (Cold)     │                     │   │
│  │                      └──────┬──────┘                     │   │
│  └─────────────────────────────┼───────────────────────────┘   │
│                                │                                │
│                                ▼                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    S3 Data Lake                          │   │
│  │  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐ │   │
│  │  │   Hot       │    │   Warm      │    │   Cold      │ │   │
│  │  │   (SSD)     │    │   (S3-IA)   │    │  (Glacier)  │ │   │
│  │  └─────────────┘    └─────────────┘    └─────────────┘ │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

External Table Creation:

-- Create external schema
CREATE EXTERNAL SCHEMA data_lake
FROM DATA CATALOG
DATABASE 'lake_db'
IAM_ROLE 'arn:aws:iam::role/SpectrumRole';

-- Create external table
CREATE EXTERNAL TABLE data_lake.sales_ext (
    transaction_id VARCHAR(50),
    amount DECIMAL(10,2),
    transaction_date DATE
)
PARTITIONED BY (year INT, month INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://data-lake/silver/sales/';

-- Query across local and external
SELECT 
    l.customer_name,
    SUM(e.amount) as total_sales
FROM local.customers l
JOIN data_lake.sales_ext e ON l.customer_id = e.customer_id
WHERE e.year = 2024 AND e.month = 1
GROUP BY l.customer_name;

Q7: How do you implement data quality in Redshift?

Answer:

Data Quality Framework:

-- Quality checks table
CREATE TABLE data_quality_checks (
    check_id INT IDENTITY(1,1),
    table_name VARCHAR(100),
    check_type VARCHAR(50),
    check_sql VARCHAR(1000),
    expected_value VARCHAR(100),
    actual_value VARCHAR(100),
    passed BOOLEAN,
    executed_at TIMESTAMP DEFAULT GETDATE()
);

-- Completeness check
INSERT INTO data_quality_checks (table_name, check_type, check_sql, passed)
SELECT 
    'fact_sales',
    'completeness',
    'SELECT COUNT(*) FROM fact_sales WHERE amount IS NULL',
    CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END
FROM fact_sales
WHERE amount IS NULL;

-- Referential integrity check
INSERT INTO data_quality_checks (table_name, check_type, check_sql, passed)
SELECT 
    'fact_sales',
    'referential_integrity',
    'SELECT COUNT(*) FROM fact_sales f LEFT JOIN dim_customer c ON f.customer_key = c.customer_key WHERE c.customer_key IS NULL',
    CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END
FROM fact_sales f
LEFT JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE c.customer_key IS NULL;

-- Business rule validation
INSERT INTO data_quality_checks (table_name, check_type, check_sql, passed)
SELECT 
    'fact_sales',
    'business_rule',
    'amount >= 0',
    CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END
FROM fact_sales
WHERE amount < 0;

Quality Dashboard:

-- Quality summary view
CREATE VIEW vw_data_quality_summary AS
SELECT 
    table_name,
    check_type,
    COUNT(*) as total_checks,
    SUM(CASE WHEN passed THEN 1 ELSE 0 END) as passed_checks,
    SUM(CASE WHEN NOT passed THEN 1 ELSE 0 END) as failed_checks,
    ROUND(SUM(CASE WHEN passed THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as pass_rate
FROM data_quality_checks
GROUP BY table_name, check_type;

Q8: How do you implement real-time analytics with Redshift?

Answer:

Real-Time Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Real-Time Analytics with Redshift                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Sources          Ingestion          Processing           │
│  ┌─────────────┐      ┌─────────────┐   ┌─────────────┐       │
│  │ Applications│─────▶│ Kinesis     │──▶│ Lambda      │       │
│  │ IoT         │      │ Data Firehose│   │ Transform   │       │
│  └─────────────┘      └─────────────┘   └──────┬──────┘       │
│                                                  │              │
│                                                  ▼              │
│                                           ┌─────────────┐       │
│                                           │  Redshift   │       │
│                                           │  (Streaming │       │
│                                           │   Ingest)   │       │
│                                           └──────┬──────┘       │
│                                                  │              │
│                             ┌────────────────────┼──────────┐  │
│                             ▼                    ▼          ▼  │
│                      ┌──────────┐        ┌──────────┐ ┌────────┐│
│                      │Real-time │        │  Batch   │ │  API   ││
│                      │Dashboard │        │ Reports  │ │  Apps  ││
│                      └──────────┘        └──────────┘ └────────┘│
└─────────────────────────────────────────────────────────────────┘

Streaming Ingestion:

# Kinesis Data Firehose to Redshift
firehose = boto3.client('firehose')

firehose.create_delivery_stream(
    DeliveryStreamName='redshift-streaming',
    DeliveryStreamType='DirectPut',
    RedshiftDestinationConfiguration={
        'RoleARN': 'arn:aws:iam::role/firehose-redshift-role',
        'ClusterJDBCURL': 'jdbc:redshift://cluster.xxx.us-east-1.redshift.amazonaws.com:5439/analytics',
        'CopyCommand': {
            'DataTableName': 'streaming_data',
            'DataTableColumns': 'event_time, event_type, user_id, data'
        },
        'Username': 'admin',
        'Password': 'password',
        'S3Configuration': {
            'RoleARN': 'arn:aws:iam::role/firehose-s3-role',
            'BucketARN': 'arn:aws:s3:::firehose-backup-bucket',
            'Prefix': 'redshift-staging/'
        }
    }
)

Q9: How do you implement data virtualization with Redshift Spectrum?

Answer:

Data Virtualization Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Virtualization with Spectrum                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Query Layer                                                   │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                      Redshift                            │   │
│  │  ┌─────────────┐    ┌─────────────┐                     │   │
│  │  │  Local      │    │  External   │                     │   │
│  │  │  Tables     │    │  Tables     │                     │   │
│  │  │  (Hot)      │    │  (Cold)     │                     │   │
│  │  └─────────────┘    └──────┬──────┘                     │   │
│  └────────────────────────────┼───────────────────────────┘   │
│                                │                                │
│  Storage Layer                ▼                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    S3 Data Lake                          │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │   │
│  │  │   Current   │  │  Historical │  │  External   │     │   │
│  │  │   Data      │  │  Data       │  │  Sources    │     │   │
│  │  │  (Parquet)  │  │  (ORC)      │  │  (CSV/JSON) │     │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Cross-Source Queries                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ SELECT * FROM local.customers c                         │   │
│  │ JOIN external.sales_ext s ON c.id = s.customer_id       │   │
│  │ JOIN external.clickstream_ext cl ON c.id = cl.user_id   │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

External Schema Configuration:

-- Multiple data sources
CREATE EXTERNAL SCHEMA sales_lake
FROM DATA CATALOG
DATABASE 'sales_db'
IAM_ROLE 'arn:aws:iam::role/SpectrumRole';

CREATE EXTERNAL SCHEMA clickstream_lake
FROM DATA CATALOG
DATABASE 'clickstream_db'
IAM_ROLE 'arn:aws:iam::role/SpectrumRole';

-- Cross-source query
SELECT 
    c.customer_name,
    SUM(s.amount) as total_sales,
    COUNT(cl.event_id) as page_views
FROM local.customers c
JOIN sales_lake.sales s ON c.customer_id = s.customer_id
LEFT JOIN clickstream_lake.events cl ON c.customer_id = cl.user_id
GROUP BY c.customer_name;

Q10: How do you implement data warehouse automation?

Answer:

Automation Framework:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Warehouse Automation                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  CI/CD Pipeline                                                │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │  Schema     │───▶│  Build      │───▶│  Deploy     │        │
│  │  Changes    │    │  (Test)     │    │  (Prod)     │        │
│  └─────────────┘    └─────────────┘    └─────────────┘        │
│                                                                 │
│  Orchestration                                                 │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │  Step       │───▶│  Lambda     │───▶│  Redshift   │        │
│  │  Functions  │    │  Tasks      │    │  Queries    │        │
│  └─────────────┘    └─────────────┘    └─────────────┘        │
│                                                                 │
│  Monitoring                                                    │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐        │
│  │  CloudWatch │───▶│  Alarms     │───▶│  SNS        │        │
│  │  Metrics    │    │             │    │  Alerts     │        │
│  └─────────────┘    └─────────────┘    └─────────────┘        │
└─────────────────────────────────────────────────────────────────┘

Step Functions Orchestration:

{
  "StartAt": "ExtractData",
  "States": {
    "ExtractData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "extract-sales-data"
      },
      "Next": "TransformData"
    },
    "TransformData": {
      "Type": "Task",
      "Resource": "arn:aws:states:::glue:startJobRun.sync",
      "Parameters": {
        "JobName": "transform-sales-data"
      },
      "Next": "LoadToRedshift"
    },
    "LoadToRedshift": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:load-redshift",
      "Next": "RunQualityChecks"
    },
    "RunQualityChecks": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:quality-checks",
      "Next": "PublishMetrics"
    },
    "PublishMetrics": {
      "Type": "Task",
      "Resource": "arn:aws:states:::cloudwatch:putMetricData.sync",
      "End": true
    }
  }
}

Q11: How do you implement multi-tenant data warehousing?

Answer:

Multi-Tenant Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Multi-Tenant Data Warehouse                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Tenant Isolation Strategies:                                  │
│                                                                 │
│  1. Separate Databases                                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐           │
│  │ Tenant A DB │  │ Tenant B DB │  │ Tenant C DB │           │
│  │ ─────────── │  │ ─────────── │  │ ─────────── │           │
│  │ Tables      │  │ Tables      │  │ Tables      │           │
│  │ Users       │  │ Users       │  │ Users       │           │
│  └─────────────┘  └─────────────┘  └─────────────┘           │
│                                                                 │
│  2. Schema-Based Isolation                                     │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Schema A          Schema B          Schema C           │   │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │   │
│  │  │ Tables      │  │ Tables      │  │ Tables      │     │   │
│  │  └─────────────┘  └─────────────┘  └─────────────┘     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  3. Row-Level Security                                         │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │  Table with tenant_id column                            │   │
│  │  ┌─────────────────────────────────────────────────┐    │   │
│  │  │ tenant_id │ data...                             │    │   │
│  │  │ A         │ row1, row2                          │    │   │
│  │  │ B         │ row3, row4                          │    │   │
│  │  │ A         │ row5                                │    │   │
│  │  └─────────────────────────────────────────────────┘    │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Row-Level Security Implementation:

-- Create tenant view
CREATE VIEW tenant_sales AS
SELECT * FROM sales_data
WHERE tenant_id = current_setting('app.tenant_id');

-- Set tenant context
SET app.tenant_id = 'tenant_a';

-- Query only returns tenant_a data
SELECT * FROM tenant_sales;

Q12: How do you implement disaster recovery for Redshift?

Answer:

DR Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Redshift Disaster Recovery                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Primary Region (us-east-1)      DR Region (us-west-2)        │
│  ┌─────────────────────┐        ┌─────────────────────┐       │
│  │ Redshift Cluster    │───────▶│ Redshift Cluster    │       │
│  │ (Primary)           │ Snap  │ (Standby)           │       │
│  └─────────────────────┘ shot  └─────────────────────┘       │
│           │              Transfer                            │
│           │                  │                                │
│           ▼                  ▼                                │
│  ┌─────────────────────┐        ┌─────────────────────┐       │
│  │ S3 Data Lake        │───────▶│ S3 Replica          │       │
│  │ (Versioning)        │  CRR   │                     │       │
│  └─────────────────────┘        └─────────────────────┘       │
│                                                                 │
│  RPO: 1 hour (snapshot frequency)                             │
│  RTO: 30 minutes (automated restore)                          │
└─────────────────────────────────────────────────────────────────┘

Automated Snapshot and Restore:

# Automated snapshot
def lambda_handler(event, context):
    redshift = boto3.client('redshift')
    
    # Create manual snapshot
    redshift.create_cluster_snapshot(
        SnapshotIdentifier=f'dr-snapshot-{datetime.now().strftime("%Y%m%d")}',
        ClusterIdentifier='primary-cluster'
    )
    
    # Copy to DR region
    redshift.copy_cluster_snapshot(
        SourceSnapshotIdentifier='dr-snapshot-20240115',
        SourceSnapshotClusterIdentifier='primary-cluster',
        DestinationRegion='us-west-2',
        SnapshotIdentifier='dr-snapshot-us-west-2'
    )

# Automated restore in DR region
def restore_dr_cluster():
    redshift_dr = boto3.client('redshift', region_name='us-west-2')
    
    redshift_dr.restore_from_cluster_snapshot(
        ClusterIdentifier='dr-cluster',
        SnapshotIdentifier='dr-snapshot-us-west-2',
        SnapshotClusterIdentifier='primary-cluster'
    )

Q13: How do you optimize costs in Redshift?

Answer:

Cost Optimization Strategies:

1. Reserved Instances:

# Calculate savings
def calculate_ri_savings():
    on_demand_hourly = 0.25  # ra3.xlplus
    reserved_1yr = 0.16     # All upfront
    reserved_3yr = 0.10     # All upfront
    
    annual_on_demand = on_demand_hourly * 8760
    annual_1yr = reserved_1yr * 8760
    annual_3yr = reserved_3yr * 8760
    
    savings_1yr = (annual_on_demand - annual_1yr) / annual_on_demand * 100
    savings_3yr = (annual_on_demand - annual_3yr) / annual_on_demand * 100
    
    return {
        'on_demand_annual': annual_on_demand,
        '1yr_ri_annual': annual_1yr,
        '3yr_ri_annual': annual_3yr,
        'savings_1yr': savings_1yr,
        'savings_3yr': savings_3yr
    }

2. Pause/Resume for Dev/Test:

# Pause cluster during off-hours
def pause_cluster():
    redshift = boto3.client('redshift')
    redshift.pause_cluster(ClusterIdentifier='dev-cluster')

# Resume cluster
def resume_cluster():
    redshift = boto3.client('redshift')
    redshift.resume_cluster(ClusterIdentifier='dev-cluster')

3. Right-Sizing:

-- Monitor query patterns
SELECT 
    query,
    records,
    elapsed_time,
    bytes_scanned
FROM stl_query
WHERE starttime > DATEADD(day, -7, GETDATE())
ORDER BY bytes_scanned DESC;

-- Identify underutilized clusters
SELECT 
    cluster,
    AVG(cpu_utilization) as avg_cpu,
    AVG(memory_utilization) as avg_memory
FROM system_metrics
WHERE timestamp > DATEADD(day, -30, GETDATE())
GROUP BY cluster
HAVING AVG(cpu_utilization) < 20;

Cost Comparison:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Redshift Cost Comparison                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Instance: ra3.xlplus (4 vCPU, 32 GB RAM)                     │
│                                                                 │
│  On-Demand:        $0.25/hr     = $180/month                  │
│  1yr Reserved:     $0.16/hr     = $115/month (36% savings)    │
│  3yr Reserved:     $0.10/hr     = $72/month (60% savings)     │
│                                                                 │
│  With Auto-Suspend (10hr/day):                                 │
│  On-Demand:        $0.25 * 10   = $75/month (58% savings)     │
│  1yr Reserved:     $0.16 * 10   = $48/month (73% savings)     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Q14: How do you implement data versioning in Redshift?

Answer:

Versioning Strategies:

1. Temporal Tables (SCD Type 2):

-- Create versioned table
CREATE TABLE dim_customer_versioned (
    customer_key BIGINT PRIMARY KEY IDENTITY(1,1),
    customer_id VARCHAR(50),
    name VARCHAR(100),
    email VARCHAR(200),
    effective_date TIMESTAMP,
    end_date TIMESTAMP,
    is_current BOOLEAN
);

-- Merge with versioning
MERGE INTO dim_customer_versioned t
USING staging_customer s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND (t.name != s.name OR t.email != s.email) THEN
    UPDATE SET 
        is_current = FALSE,
        end_date = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
    INSERT (customer_id, name, email, effective_date, end_date, is_current)
    VALUES (s.customer_id, s.name, s.email, CURRENT_TIMESTAMP, NULL, TRUE);

2. Snapshot-Based Versioning:

-- Create snapshot table
CREATE TABLE sales_snapshot AS
SELECT 
    *,
    GETDATE() as snapshot_date,
    'v1' as version
FROM fact_sales;

-- Query specific version
SELECT * FROM sales_snapshot WHERE version = 'v1';

3. Git-Based Schema Versioning:

# Version control for schema changes
import git

repo = git.Repo('/path/to/schema-repo')

# Save current schema
def save_schema_version():
    schema = get_current_schema()
    
    with open('schema.sql', 'w') as f:
        f.write(schema)
    
    repo.index.add(['schema.sql'])
    repo.index.commit(f'Schema update: {datetime.now().isoformat()}')
    
    tag = repo.create_tag(f'v{get_version()}')
    return tag

Q15: How do you implement monitoring for Redshift?

Answer:

Monitoring Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Redshift Monitoring Architecture                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Metrics           Logs              Alarms                    │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐           │
│  │ CloudWatch  │  │ System      │  │ SNS         │           │
│  │ Metrics     │  │ Tables      │  │ Alerts      │           │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘           │
│         │                │                │                    │
│         └────────────────┼────────────────┘                    │
│                          ▼                                      │
│                   ┌─────────────┐                              │
│                   │  Dashboard  │                              │
│                   │ (QuickSight)│                              │
│                   └─────────────┘                              │
└─────────────────────────────────────────────────────────────────┘

CloudWatch Metrics:

cloudwatch = boto3.client('cloudwatch')

# Redshift metrics
metrics = [
    'CPUUtilization',
    'DatabaseConnections',
    'HealthStatus',
    'NetworkReceiveThroughput',
    'NetworkTransmitThroughput',
    'PercentageDiskSpaceUsed',
    'QueryDuration',
    'QueryRuntimeBreakdown'
]

# Create alarms
for metric in metrics:
    cloudwatch.put_metric_alarm(
        AlarmName=f'Redshift-{metric}',
        MetricName=metric,
        Namespace='AWS/Redshift',
        Statistic='Average',
        Period=300,
        EvaluationPeriods=2,
        Threshold=get_threshold(metric),
        ComparisonOperator='GreaterThanThreshold',
        AlarmActions=['arn:aws:sns:us-east-1:123456789:redshift-alerts'],
        Dimensions=[
            {'Name': 'ClusterIdentifier', 'Value': 'my-cluster'}
        ]
    )

System Table Queries:

-- Long-running queries
SELECT 
    query,
    username,
    starttime,
    elapsed,
    records,
    bytes
FROM stl_query
WHERE starttime > DATEADD(hour, -1, GETDATE())
AND elapsed > 60000  -- > 1 minute
ORDER BY elapsed DESC;

-- Disk usage
SELECT 
    owner,
    diskno,
    used,
    capacity,
    (used::float / capacity::float) * 100 as pct_used
FROM stv_partitions
WHERE host = (SELECT master节点 FROM stv_mv_state LIMIT 1);

-- Query performance
SELECT 
    query,
    service_class,
    num_partitions,
    elapsed_time,
    queue_time
FROM stl_wlm_query
WHERE starttime > DATEADD(day, -7, GETDATE())
ORDER BY elapsed_time DESC;

Q16: How do you implement data governance in Redshift?

Answer:

Governance Framework:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Redshift Data Governance                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Metadata Management                                          │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Data catalog (Glue)                                   │   │
│  │ • Column descriptions                                   │   │
│  │ • Data lineage                                          │   │
│  │ • Business glossary                                     │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Access Control                                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Role-based access control                             │   │
│  │ • Column-level permissions                              │   │
│  │ • Row-level security                                    │   │
│  │ • Data masking                                          │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Audit & Compliance                                           │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • CloudTrail integration                                │   │
│  │ • Query logging                                         │   │
│  │ • Data access logs                                      │   │
│  │ • Compliance reports                                    │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Data Classification:

-- Add classification tags
COMMENT ON TABLE customers IS 'Classification: PII - Contains customer personal information';
COMMENT ON COLUMN customers.ssn IS 'Classification: Highly Sensitive - Social Security Number';
COMMENT ON COLUMN customers.email IS 'Classification: Sensitive - Contact Information';

-- Query classification
SELECT 
    c.table_name,
    c.column_name,
    pg_description.description as classification
FROM information_schema.columns c
JOIN pg_description ON c.table_name::regclass = pg_description.objoid
WHERE pg_description.description LIKE '%Sensitive%';

Q17: How do you implement data warehouse testing?

Answer:

Testing Framework:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Data Warehouse Testing Framework                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Unit Tests                                                    │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Transformation logic                                  │   │
│  │ • Data type conversions                                 │   │
│  │ • Business rule validation                              │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Integration Tests                                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • ETL pipeline execution                                │   │
│  │ • Data flow validation                                  │   │
│  │ • Cross-table consistency                               │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
│  Performance Tests                                             │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │ • Query execution time                                  │   │
│  │ • Load performance                                      │   │
│  │ • Concurrency testing                                   │   │
│  └─────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

SQL-Based Testing:

-- Test completeness
CREATE OR REPLACE PROCEDURE test_completeness(table_name VARCHAR)
AS $$
DECLARE
    null_count INT;
BEGIN
    EXECUTE 'SELECT COUNT(*) FROM ' || table_name || ' WHERE amount IS NULL'
    INTO null_count;
    
    IF null_count > 0 THEN
        RAISE EXCEPTION 'Completeness test failed: % null values', null_count;
    END IF;
END;
$$ LANGUAGE plpgsql;

-- Test referential integrity
CREATE OR REPLACE PROCEDURE test_referential_integrity()
AS $$
DECLARE
    orphan_count INT;
BEGIN
    SELECT COUNT(*) INTO orphan_count
    FROM fact_sales f
    LEFT JOIN dim_customer c ON f.customer_key = c.customer_key
    WHERE c.customer_key IS NULL;
    
    IF orphan_count > 0 THEN
        RAISE EXCEPTION 'Referential integrity test failed: % orphan records', orphan_count;
    END IF;
END;
$$ LANGUAGE plpgsql;

Q18: How do you implement Redshift Spectrum for federated queries?

Answer:

Federated Query Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Federated Query Architecture                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                   Redshift Cluster                       │   │
│  │  ┌─────────────────────────────────────────────────┐   │   │
│  │  │                    Spectrum                       │   │   │
│  │  │  ┌─────────────┐  ┌─────────────┐  ┌──────────┐ │   │   │
│  │  │  │   S3        │  │   RDS       │  │ External │ │   │   │
│  │  │  │   (Data     │  │   (MySQL)   │  │ (API)    │ │   │   │
│  │  │  │    Lake)    │  │             │  │          │ │   │   │
│  │  │  └─────────────┘  └─────────────┘  └──────────┘ │   │   │
│  │  └─────────────────────────────────────────────────┘   │   │
│  └─────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

External Schema Configuration:

-- Connect to RDS MySQL
CREATE EXTERNAL SCHEMA mysql_source
FROM DATA CATALOG
DATABASE 'mysql_db'
IAM_ROLE 'arn:aws:iam::role/SpectrumRole'
URI 'jdbc:mysql://rds-instance.xxx.us-east-1.rds.amazonaws.com:3306/mydb';

-- Query across sources
SELECT 
    r.customer_name,
    s.total_spent,
    m.login_count
FROM mysql_source.customers r
JOIN local.sales_summary s ON r.id = s.customer_id
JOIN spectrum.clickstream m ON r.id = m.user_id;

Q19: How do you implement Redshift for IoT analytics?

Answer:

IoT Analytics Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Redshift IoT Analytics Architecture                │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  IoT Devices          Ingestion          Storage               │
│  ┌─────────────┐     ┌─────────────┐   ┌─────────────┐        │
│  │ Sensors     │────▶│ IoT Core     │──▶│ Kinesis     │        │
│  │ (MQTT)      │     │              │   │ Streams     │        │
│  └─────────────┘     └─────────────┘   └──────┬──────┘        │
│                                                │                │
│  ┌─────────────┐     ┌─────────────┐          │                │
│  │ Gateways    │────▶│ IoT Analytics│─────────┤                │
│  └─────────────┘     └─────────────┘          │                │
│                                                ▼                │
│                                         ┌─────────────┐        │
│                                         │   S3        │        │
│                                         │  (Raw)      │        │
│                                         └──────┬──────┘        │
│                                                │                │
│                                                ▼                │
│                                         ┌─────────────┐        │
│                                         │  Redshift   │        │
│                                         │  (Analytics)│        │
│                                         └──────┬──────┘        │
│                                                │                │
│                             ┌──────────────────┼──────────┐    │
│                             ▼                  ▼          ▼    │
│                      ┌──────────┐      ┌──────────┐ ┌────────┐│
│                      │ TimeSeries│     │ Predictive│ │ Alerts ││
│                      │ Analysis  │     │ Maintenance│ │        ││
│                      └──────────┘      └──────────┘ └────────┘│
└─────────────────────────────────────────────────────────────────┘

Time-Series Optimization:

-- Time-series table design
CREATE TABLE iot_readings (
    device_id VARCHAR(50),
    reading_time TIMESTAMP,
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2),
    pressure DECIMAL(7,2)
)
DISTKEY(device_id)
COMPOUND SORTKEY(reading_time, device_id);

-- Partition by time range
CREATE EXTERNAL TABLE iot_readings_ext (
    device_id VARCHAR(50),
    reading_time TIMESTAMP,
    temperature DECIMAL(5,2),
    humidity DECIMAL(5,2),
    pressure DECIMAL(7,2)
)
PARTITIONED BY (year INT, month INT, day INT, hour INT)
STORED AS PARQUET
LOCATION 's3://iot-data/readings/';

Q20: How do you implement Redshift for machine learning?

Answer:

ML Integration Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Redshift ML Architecture                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Preparation        Model Training        Deployment      │
│  ┌─────────────┐        ┌─────────────┐      ┌─────────────┐  │
│  │ Redshift    │───────▶│ SageMaker   │─────▶│ Endpoint    │  │
│  │ (SQL)       │  UNLOAD│ AutoPilot   │      │ (Real-time) │  │
│  └─────────────┘        └─────────────┘      └─────────────┘  │
│                                                                 │
│  Prediction             Integration           Monitoring       │
│  ┌─────────────┐        ┌─────────────┐      ┌─────────────┐  │
│  │ CREATE MODEL│───────▶│ SQL Queries │─────▶│ CloudWatch  │  │
│  │ (Predict)   │        │             │      │ Metrics     │  │
│  └─────────────┘        └─────────────┘      └─────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Redshift ML Implementation:

-- Create ML model
CREATE MODEL churn_prediction
FROM (
    SELECT 
        customer_id,
        tenure_months,
        monthly_charges,
        total_charges,
        churned
    FROM customer_data
)
TARGET churned
FUNCTION predict_churn
IAM_ROLE 'arn:aws:iam::role/RedshiftMLRole'
SETTINGS (
    S3_BUCKET 'ml-artifacts',
    MAX_RUNTIME 3600
);

-- Use model for predictions
SELECT 
    customer_id,
    predict_churn(tenure_months, monthly_charges, total_charges) as churn_probability
FROM new_customers
WHERE predict_churn(tenure_months, monthly_charges, total_charges) > 0.7;

Q21: How do you implement data quality monitoring in Redshift?

Answer:

Quality Monitoring Framework:

-- Quality metrics table
CREATE TABLE data_quality_metrics (
    metric_id INT IDENTITY(1,1),
    table_name VARCHAR(100),
    metric_name VARCHAR(100),
    metric_value DECIMAL(10,2),
    threshold DECIMAL(10,2),
    passed BOOLEAN,
    measured_at TIMESTAMP DEFAULT GETDATE()
);

-- Completeness metric
INSERT INTO data_quality_metrics (table_name, metric_name, metric_value, threshold, passed)
SELECT 
    'fact_sales',
    'completeness_amount',
    (COUNT(*) - COUNT(amount))::DECIMAL / COUNT(*) * 100,
    0,
    CASE WHEN (COUNT(*) - COUNT(amount)) = 0 THEN TRUE ELSE FALSE END
FROM fact_sales;

-- Uniqueness metric
INSERT INTO data_quality_metrics (table_name, metric_name, metric_value, threshold, passed)
SELECT 
    'fact_sales',
    'uniqueness_sale_key',
    COUNT(*) - COUNT(DISTINCT sale_key),
    0,
    CASE WHEN COUNT(*) = COUNT(DISTINCT sale_key) THEN TRUE ELSE FALSE END
FROM fact_sales;

-- Freshness metric
INSERT INTO data_quality_metrics (table_name, metric_name, metric_value, threshold, passed)
SELECT 
    'fact_sales',
    'freshness_hours',
    EXTRACT(EPOCH FROM (GETDATE() - MAX(sale_date))) / 3600,
    24,
    CASE WHEN EXTRACT(EPOCH FROM (GETDATE() - MAX(sale_date))) / 3600 <= 24 THEN TRUE ELSE FALSE END
FROM fact_sales;

Q22: How do you implement Redshift for real-time dashboards?

Answer:

Real-Time Dashboard Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Real-Time Dashboard Architecture                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Sources          Ingestion          Dashboard            │
│  ┌─────────────┐      ┌─────────────┐   ┌─────────────┐       │
│  │ Applications│─────▶│ Kinesis     │──▶│ QuickSight  │       │
│  │ APIs        │      │ Firehose    │   │ Dashboard   │       │
│  └─────────────┘      └──────┬──────┘   └─────────────┘       │
│                              │                                  │
│                              ▼                                  │
│                       ┌─────────────┐                          │
│                       │  Redshift   │                          │
│                       │  (Materialized│                         │
│                       │   Views)    │                          │
│                       └─────────────┘                          │
└─────────────────────────────────────────────────────────────────┘

Materialized View for Real-Time:

-- Real-time aggregation
CREATE MATERIALIZED VIEW mv_realtime_sales AS
SELECT 
    date_trunc('minute', sale_date) as sale_minute,
    region,
    product_category,
    COUNT(*) as transaction_count,
    SUM(amount) as total_sales,
    AVG(amount) as avg_sale
FROM fact_sales
WHERE sale_date >= DATEADD(minute, -5, GETDATE())
GROUP BY 1, 2, 3;

-- Auto-refresh
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_realtime_sales;

Q23: How do you implement Redshift data sharing?

Answer:

Data Sharing Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Redshift Data Sharing                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Producer Cluster           Consumer Clusters                  │
│  ┌─────────────────────┐   ┌─────────────────────┐            │
│  │   Primary Cluster   │   │  Analytics Cluster  │            │
│  │   ┌─────────────┐   │   │  ┌─────────────┐   │            │
│  │   │  Shared     │───┼──▶│  │  External   │   │            │
│  │   │  Database   │   │   │  │  Tables     │   │            │
│  │   └─────────────┘   │   │  └─────────────┘   │            │
│  └─────────────────────┘   └─────────────────────┘            │
│                                                                 │
│  ┌─────────────────────┐   ┌─────────────────────┐            │
│  │  Reporting Cluster  │   │  ML Cluster         │            │
│  │  ┌─────────────┐   │   │  ┌─────────────┐   │            │
│  │  │  External   │   │   │  │  External   │   │            │
│  │  │  Tables     │   │   │  │  Tables     │   │            │
│  │  └─────────────┘   │   │  └─────────────┘   │            │
│  └─────────────────────┘   └─────────────────────┘            │
└─────────────────────────────────────────────────────────────────┘

Data Sharing Setup:

-- Producer: Create share
CREATE DATABASE SHARE sales_share;

-- Grant usage
GRANT USAGE ON DATABASE sales_share TO GROUP analytics_team;

-- Add tables to share
ALTER DATABASE sales_share ADD TABLE fact_sales;
ALTER DATABASE sales_share ADD TABLE dim_customer;

-- Consumer: Create database from share
CREATE DATABASE sales_analytics FROM SHARE producer-cluster:sales_share;

Q24: How do you implement Redshift for financial analytics?

Answer:

Financial Analytics Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Financial Analytics Architecture                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Sources          Processing          Analytics           │
│  ┌─────────────┐      ┌─────────────┐   ┌─────────────┐       │
│  │ Transactions│─────▶│ ETL Pipeline│──▶│ Redshift    │       │
│  │ Market Data │      │ (Glue)      │   │ (DW)        │       │
│  │ References  │      │             │   │             │       │
│  └─────────────┘      └─────────────┘   └──────┬──────┘       │
│                                                  │              │
│                             ┌────────────────────┼──────────┐  │
│                             ▼                    ▼          ▼  │
│                      ┌──────────┐        ┌──────────┐ ┌────────┐│
│                      │ Risk     │        │ Compliance│ │Reports ││
│                      │ Analysis │        │ Reporting │ │        ││
│                      └──────────┘        └──────────┘ └────────┘│
└─────────────────────────────────────────────────────────────────┘

Financial Data Modeling:

-- Financial fact table
CREATE TABLE fact_transactions (
    transaction_key BIGINT PRIMARY KEY IDENTITY(1,1),
    date_key INT NOT NULL,
    account_key INT NOT NULL,
    counterparty_key INT NOT NULL,
    instrument_key INT NOT NULL,
    amount DECIMAL(18,4),
    currency VARCHAR(3),
    exchange_rate DECIMAL(10,6),
    usd_equivalent DECIMAL(18,4),
    transaction_type VARCHAR(20),
    settlement_status VARCHAR(20)
)
DISTSTYLE KEY
DISTKEY(account_key)
COMPOUND SORTKEY(date_key, account_key);

Q25: How do you implement Redshift for healthcare analytics?

Answer:

Healthcare Analytics Architecture:

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Healthcare Analytics Architecture                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Data Sources          Processing          Analytics           │
│  ┌─────────────┐      ┌─────────────┐   ┌─────────────┐       │
│  │ EHR Systems │─────▶│ De-identify │──▶│ Redshift    │       │
│  │ Claims Data │      │ & Transform │   │ (HIPAA      │       │
│  │ Clinical    │      │             │   │  Compliant) │       │
│  └─────────────┘      └─────────────┘   └──────┬──────┘       │
│                                                  │              │
│                             ┌────────────────────┼──────────┐  │
│                             ▼                    ▼          ▼  │
│                      ┌──────────┐        ┌──────────┐ ┌────────┐│
│                      │ Clinical │        │ Population│ │Quality ││
│                      │ Research │        │ Health    │ │Metrics ││
│                      └──────────┘        └──────────┘ └────────┘│
└─────────────────────────────────────────────────────────────────┘

HIPAA-Compliant Implementation:

-- De-identification function
CREATE OR REPLACE FUNCTION de_identify_ssn(ssn VARCHAR)
RETURNS VARCHAR
AS $$
BEGIN
    RETURN CONCAT('XXX-XX-', RIGHT(ssn, 4));
END;
$$ LANGUAGE plpgsql;

-- Row-level security for patient data
CREATE POLICY patient_access ON patient_records
    FOR ALL
    TO healthcare_team
    USING (provider_id = current_user_provider_id());

-- Audit logging
CREATE TABLE access_audit (
    access_id INT IDENTITY(1,1),
    user_name VARCHAR(100),
    table_name VARCHAR(100),
    access_type VARCHAR(20),
    access_time TIMESTAMP DEFAULT GETDATE(),
    query_text VARCHAR(10000)
);

Summary

Mastering AWS data warehousing requires understanding:

Architecture Design: Star/snowflake schemas, distribution strategies, sort keys
Performance Optimization: Materialized views, compression, query tuning
Security & Compliance: Column/row-level security, encryption, HIPAA
Cost Management: Reserved instances, auto-suspend, right-sizing
Operations: Monitoring, testing, disaster recovery

These concepts form the foundation for building scalable, performant, and secure data warehouses on AWS.

AWS Data Warehouse Interview Questions

🏗️ AWS Data Warehouse Interview

Data Warehouse Architecture Overview

Q1: Compare Redshift Provisioned vs Serverless. When would you choose each?

Q2: How do you design a star schema for data warehousing on Redshift?

Q3: How do you optimize Redshift query performance?

Q4: How do you implement ETL/ELT pipelines with Redshift?

Q5: How do you implement data security in Redshift?

Q6: How do you implement data lakehouse architecture with Redshift Spectrum?

Q7: How do you implement data quality in Redshift?

Q8: How do you implement real-time analytics with Redshift?

Q9: How do you implement data virtualization with Redshift Spectrum?

Q10: How do you implement data warehouse automation?

Q11: How do you implement multi-tenant data warehousing?

Q12: How do you implement disaster recovery for Redshift?

Q13: How do you optimize costs in Redshift?

Q14: How do you implement data versioning in Redshift?

Q15: How do you implement monitoring for Redshift?

Q16: How do you implement data governance in Redshift?

Q17: How do you implement data warehouse testing?

Q18: How do you implement Redshift Spectrum for federated queries?

Q19: How do you implement Redshift for IoT analytics?

Q20: How do you implement Redshift for machine learning?

Q21: How do you implement data quality monitoring in Redshift?

Q22: How do you implement Redshift for real-time dashboards?

Q23: How do you implement Redshift data sharing?

Q24: How do you implement Redshift for financial analytics?

Q25: How do you implement Redshift for healthcare analytics?

Summary