Data Warehouse Architecture Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Data Warehouse Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Sources ETL/ELT Storage β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β DatabasesββββββββββΆβ Glue ββββββββΆβ Redshift β β
β β Files β β EMR β β Cluster β β
β β Streams β β Lambda β β β β
β ββββββββββββ ββββββββββββ ββββββ¬ββββββ β
β β β
β Query Layer β β
β βββββββββββββββ β β
β β Athena βββββββββββββββββ β
β β Spectrum β β
β ββββββββ¬βββββββ β
β β β
β ββββββββββββΌβββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββ βββββββββββ βββββββββββ β
β β BI/SQL β βPython β β API β β
β β Tools β β Apps β β Clients β β
β βββββββββββ βββββββββββ βββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q1: Compare Redshift Provisioned vs Serverless. When would you choose each?
Answer:
| Feature | Provisioned | Serverless |
|---|---|---|
| Cost Model | Instance-hours | RPU-hours |
| Scaling | Manual/Auto | Automatic |
| Min Cost | ~0.375/RPU-hr | |
| Use Case | Predictable workloads | Variable workloads |
Choose Provisioned when:
- Predictable, steady-state workloads
- Need specific instance types (RA3, DC2)
- Reserved capacity for cost savings
- Large, consistent data volumes
Choose Serverless when:
- Variable/unpredictable workloads
- Dev/test environments
- Ad-hoc analytics
- Quick setup without cluster management
Decision Tree:
βββββββββββββββββββββββββββββββββββββββββββ
β Is workload predictable (>80% utilized)?β
ββββββββββββββββββ¬βββββββββββββββββββββββββ
β
ββββββββββ΄βββββββββ
βΌ βΌ
Yes No
β β
βΌ βΌ
βββββββββββββ βββββββββββββββββββββ
βProvisionedβ β Need quick setup? β
βββββββββββββ βββββββββββ¬ββββββββββ
β
ββββββββββββ΄βββββββββββ
βΌ βΌ
Yes No
β β
βΌ βΌ
βββββββββββββ βββββββββββββ
β Serverlessβ βProvisionedβ
β (Auto) β β (Flexible)β
βββββββββββββ βββββββββββββ
Q2: How do you design a star schema for data warehousing on Redshift?
Answer:
Star Schema Design:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Star Schema Design β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββ β
β β dim_date β β
β βββββββββββββββββββ β
β β date_key (PK) β β
β β full_date β β
β β year β β
β β quarter β β
β β month β β
β β day_of_week β β
β ββββββββββ¬βββββββββ β
β β β
β βββββββββββββββββββ β βββββββββββββββββββ β
β β dim_customer β β β dim_product β β
β βββββββββββββββββββ β βββββββββββββββββββ β
β β customer_key(PK)β β β product_key (PK)β β
β β customer_id β β β product_id β β
β β name β β β product_name β β
β β segment β β β category β β
β β region β β β brand β β
β ββββββββββ¬βββββββββ β ββββββββββ¬βββββββββ β
β β β β β
β ββββββββββββββββββββΌββββββββββββββββββββ β
β β β
β βββββββββββ΄ββββββββββ β
β β fact_sales β β
β βββββββββββββββββββββ β
β β sale_key (PK) β β
β β date_key (FK) β β
β β customer_key (FK) β β
β β product_key (FK) β β
β β quantity β β
β β amount β β
β β discount β β
β βββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Redshift Table Creation:
-- Dimension tables
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE NOT NULL,
year INT NOT NULL,
quarter INT NOT NULL,
month INT NOT NULL,
month_name VARCHAR(10),
day INT NOT NULL,
day_of_week VARCHAR(10),
is_weekend BOOLEAN
)
DISTSTYLE ALL
SORTKEY(full_date);
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY IDENTITY(1,1),
customer_id VARCHAR(50) NOT NULL,
name VARCHAR(100),
email VARCHAR(200),
segment VARCHAR(50),
region VARCHAR(50)
)
DISTSTYLE KEY
DISTKEY(customer_key)
SORTKEY(customer_id);
-- Fact table
CREATE TABLE fact_sales (
sale_key BIGINT PRIMARY KEY IDENTITY(1,1),
date_key INT NOT NULL REFERENCES dim_date(date_key),
customer_key INT NOT NULL REFERENCES dim_customer(customer_key),
product_key INT NOT NULL REFERENCES dim_product(product_key),
quantity INT,
amount DECIMAL(10,2),
discount DECIMAL(10,2),
profit DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY(customer_key)
COMPOUND SORTKEY(date_key, customer_key);
βΉοΈ
Interview Tip: Always discuss distribution keys (DISTKEY) and sort keys (SORTKEY) when discussing Redshift schema design. These are critical for performance.
Q3: How do you optimize Redshift query performance?
Answer:
Performance Optimization Techniques:
1. Distribution Styles:
-- KEY distribution for large fact tables
CREATE TABLE fact_orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2)
)
DISTSTYLE KEY
DISTKEY(customer_id);
-- ALL distribution for small dimension tables
CREATE TABLE dim_region (
region_id INT,
region_name VARCHAR(50)
)
DISTSTYLE ALL;
-- EVEN distribution for even spread
CREATE TABLE staging_data (
id BIGINT,
data VARCHAR(1000)
)
DISTSTYLE EVEN;
2. Sort Keys:
-- Compound sort key for multi-column filters
CREATE TABLE events (
event_id BIGINT,
event_date TIMESTAMP,
user_id BIGINT,
event_type VARCHAR(50)
)
COMPOUND SORTKEY(event_date, user_id);
-- Interleaved sort key for independent column filters
CREATE TABLE logs (
log_id BIGINT,
timestamp TIMESTAMP,
user_id BIGINT,
action VARCHAR(100)
)
INTERLEVED SORTKEY(timestamp, user_id, action);
3. Materialized Views:
-- Create materialized view
CREATE MATERIALIZED VIEW mv_daily_sales AS
SELECT
date_trunc('day', sale_date) as sale_date,
region,
product_category,
SUM(amount) as total_sales,
COUNT(*) as transaction_count
FROM fact_sales
JOIN dim_date ON fact_sales.date_key = dim_date.date_key
JOIN dim_product ON fact_sales.product_key = dim_product.product_key
GROUP BY 1, 2, 3;
-- Refresh
REFRESH MATERIALIZED VIEW mv_daily_sales;
Performance Comparison:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Performance Optimization Results β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Before Optimization: β
β β’ Query time: 45 seconds β
β β’ Data scanned: 100GB β
β β’ Cost: $0.50 per query β
β β
β After Optimization: β
β β’ Query time: 2.5 seconds (94% faster) β
β β’ Data scanned: 5GB (95% reduction) β
β β’ Cost: $0.025 per query (95% cheaper) β
β β
β Techniques Applied: β
β β Optimized DISTKEY on join columns β
β β COMPOUND SORTKEY aligned with WHERE clause β
β β Materialized view for aggregation β
β β Columnar encoding (AZ64, LZO) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q4: How do you implement ETL/ELT pipelines with Redshift?
Answer:
ETL Pattern (Transform before Load):
# Glue ETL to Redshift
import boto3
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glueContext = GlueContext(SparkContext.getOrCreate())
# Read from source
source = glueContext.create_dynamic_frame.from_catalog(
database="source_db",
table_name="raw_sales"
)
# Transform
transformed = ApplyMapping.apply(
frame=source,
mappings=[
("transaction_id", "string", "sale_id", "bigint"),
("amount", "double", "sale_amount", "decimal(10,2)"),
("date", "string", "sale_date", "date")
]
)
# Write to Redshift
glueContext.write_dynamic_frame.from_jdbc_conf(
frame=transformed,
catalog_connection="redshift-connection",
connection_options={
"dbtable": "fact_sales",
"database": "analytics"
}
)
ELT Pattern (Load then Transform):
-- Load raw data to staging
COPY staging_sales
FROM 's3://data-lake/raw/sales/'
IAM_ROLE 'arn:aws:iam::role/RedshiftRole'
FORMAT AS PARQUET;
-- Transform in Redshift
INSERT INTO fact_sales (date_key, customer_key, product_key, amount)
SELECT
d.date_key,
c.customer_key,
p.product_key,
s.amount
FROM staging_sales s
JOIN dim_date d ON s.sale_date = d.full_date
JOIN dim_customer c ON s.customer_id = c.customer_id
JOIN dim_product p ON s.product_id = p.product_id;
-- Vacuum and analyze
VACUUM fact_sales;
ANALYZE fact_sales;
ELT Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ELT Pattern with Redshift β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Sources Load (ELT-L) Transform (ELT-T) β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β S3 ββββΆβ COPY βββββΆβ SQL/Merge β β
β β (Raw) β β (Staging) β β (Transform) β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β βββββββββββββββ βββββββββββββββ β β
β β Glue ββββΆβ UNLOAD ββββββββββββ€ β
β β (Process) β β (To S3) β β β
β βββββββββββββββ βββββββββββββββ β β
β βΌ β
β βββββββββββββββ β
β β Final Tablesβ β
β β (Optimized) β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q5: How do you implement data security in Redshift?
Answer:
Security Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift Security Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Network Security β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ VPC deployment β β
β β β’ Security groups β β
β β β’ Network ACLs β β
β β β’ VPC endpoints β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Access Control β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ IAM roles β β
β β β’ Database users/groups β β
β β β’ Row-level security β β
β β β’ Column-level security β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Data Protection β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Encryption at rest (KMS) β β
β β β’ Encryption in transit (SSL) β β
β β β’ Column encryption β β
β β β’ Dynamic data masking β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Column-Level Security:
-- Create group
CREATE GROUP analysts;
-- Grant permissions
GRANT USAGE ON SCHEMA analytics TO GROUP analysts;
GRANT SELECT ON ALL TABLES IN SCHEMA analytics TO GROUP analysts;
-- Revoke sensitive columns
REVOKE SELECT (ssn, salary) ON TABLE employees FROM GROUP analysts;
-- Row-level security
CREATE POLICY region_policy ON sales_data
FOR ALL
TO analysts
USING (region = current_user_region());
Dynamic Data Masking:
-- Create masking policy
CREATE MASKING POLICY email_mask ON customers
FOR COLUMN email
USING (
CASE
WHEN is_member_of('admin') THEN email
ELSE CONCAT('****', RIGHT(email, LENGTH(email) - INSTR(email, '@') + 1))
END
);
-- Apply masking
ALTER TABLE customers ALTER COLUMN email SET MASKING POLICY email_mask;
Q6: How do you implement data lakehouse architecture with Redshift Spectrum?
Answer:
Lakehouse Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Lakehouse with Redshift Spectrum β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Redshift Cluster β β
β β βββββββββββββββ βββββββββββββββ β β
β β β Local Tablesβ β External β β β
β β β (Hot Data) β β Tables β β β
β β βββββββββββββββ β (Cold) β β β
β β ββββββββ¬βββββββ β β
β βββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β S3 Data Lake β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β Hot β β Warm β β Cold β β β
β β β (SSD) β β (S3-IA) β β (Glacier) β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
External Table Creation:
-- Create external schema
CREATE EXTERNAL SCHEMA data_lake
FROM DATA CATALOG
DATABASE 'lake_db'
IAM_ROLE 'arn:aws:iam::role/SpectrumRole';
-- Create external table
CREATE EXTERNAL TABLE data_lake.sales_ext (
transaction_id VARCHAR(50),
amount DECIMAL(10,2),
transaction_date DATE
)
PARTITIONED BY (year INT, month INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://data-lake/silver/sales/';
-- Query across local and external
SELECT
l.customer_name,
SUM(e.amount) as total_sales
FROM local.customers l
JOIN data_lake.sales_ext e ON l.customer_id = e.customer_id
WHERE e.year = 2024 AND e.month = 1
GROUP BY l.customer_name;
Q7: How do you implement data quality in Redshift?
Answer:
Data Quality Framework:
-- Quality checks table
CREATE TABLE data_quality_checks (
check_id INT IDENTITY(1,1),
table_name VARCHAR(100),
check_type VARCHAR(50),
check_sql VARCHAR(1000),
expected_value VARCHAR(100),
actual_value VARCHAR(100),
passed BOOLEAN,
executed_at TIMESTAMP DEFAULT GETDATE()
);
-- Completeness check
INSERT INTO data_quality_checks (table_name, check_type, check_sql, passed)
SELECT
'fact_sales',
'completeness',
'SELECT COUNT(*) FROM fact_sales WHERE amount IS NULL',
CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END
FROM fact_sales
WHERE amount IS NULL;
-- Referential integrity check
INSERT INTO data_quality_checks (table_name, check_type, check_sql, passed)
SELECT
'fact_sales',
'referential_integrity',
'SELECT COUNT(*) FROM fact_sales f LEFT JOIN dim_customer c ON f.customer_key = c.customer_key WHERE c.customer_key IS NULL',
CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END
FROM fact_sales f
LEFT JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE c.customer_key IS NULL;
-- Business rule validation
INSERT INTO data_quality_checks (table_name, check_type, check_sql, passed)
SELECT
'fact_sales',
'business_rule',
'amount >= 0',
CASE WHEN COUNT(*) = 0 THEN TRUE ELSE FALSE END
FROM fact_sales
WHERE amount < 0;
Quality Dashboard:
-- Quality summary view
CREATE VIEW vw_data_quality_summary AS
SELECT
table_name,
check_type,
COUNT(*) as total_checks,
SUM(CASE WHEN passed THEN 1 ELSE 0 END) as passed_checks,
SUM(CASE WHEN NOT passed THEN 1 ELSE 0 END) as failed_checks,
ROUND(SUM(CASE WHEN passed THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) as pass_rate
FROM data_quality_checks
GROUP BY table_name, check_type;
Q8: How do you implement real-time analytics with Redshift?
Answer:
Real-Time Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Real-Time Analytics with Redshift β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Sources Ingestion Processing β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β ApplicationsβββββββΆβ Kinesis ββββΆβ Lambda β β
β β IoT β β Data Firehoseβ β Transform β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β β Redshift β β
β β (Streaming β β
β β Ingest) β β
β ββββββββ¬βββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ βββββββββββ
β βReal-time β β Batch β β API ββ
β βDashboard β β Reports β β Apps ββ
β ββββββββββββ ββββββββββββ βββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Streaming Ingestion:
# Kinesis Data Firehose to Redshift
firehose = boto3.client('firehose')
firehose.create_delivery_stream(
DeliveryStreamName='redshift-streaming',
DeliveryStreamType='DirectPut',
RedshiftDestinationConfiguration={
'RoleARN': 'arn:aws:iam::role/firehose-redshift-role',
'ClusterJDBCURL': 'jdbc:redshift://cluster.xxx.us-east-1.redshift.amazonaws.com:5439/analytics',
'CopyCommand': {
'DataTableName': 'streaming_data',
'DataTableColumns': 'event_time, event_type, user_id, data'
},
'Username': 'admin',
'Password': 'password',
'S3Configuration': {
'RoleARN': 'arn:aws:iam::role/firehose-s3-role',
'BucketARN': 'arn:aws:s3:::firehose-backup-bucket',
'Prefix': 'redshift-staging/'
}
}
)
Q9: How do you implement data virtualization with Redshift Spectrum?
Answer:
Data Virtualization Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Virtualization with Spectrum β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Query Layer β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Redshift β β
β β βββββββββββββββ βββββββββββββββ β β
β β β Local β β External β β β
β β β Tables β β Tables β β β
β β β (Hot) β β (Cold) β β β
β β βββββββββββββββ ββββββββ¬βββββββ β β
β ββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββ β
β β β
β Storage Layer βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β S3 Data Lake β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β Current β β Historical β β External β β β
β β β Data β β Data β β Sources β β β
β β β (Parquet) β β (ORC) β β (CSV/JSON) β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Cross-Source Queries β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β SELECT * FROM local.customers c β β
β β JOIN external.sales_ext s ON c.id = s.customer_id β β
β β JOIN external.clickstream_ext cl ON c.id = cl.user_id β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
External Schema Configuration:
-- Multiple data sources
CREATE EXTERNAL SCHEMA sales_lake
FROM DATA CATALOG
DATABASE 'sales_db'
IAM_ROLE 'arn:aws:iam::role/SpectrumRole';
CREATE EXTERNAL SCHEMA clickstream_lake
FROM DATA CATALOG
DATABASE 'clickstream_db'
IAM_ROLE 'arn:aws:iam::role/SpectrumRole';
-- Cross-source query
SELECT
c.customer_name,
SUM(s.amount) as total_sales,
COUNT(cl.event_id) as page_views
FROM local.customers c
JOIN sales_lake.sales s ON c.customer_id = s.customer_id
LEFT JOIN clickstream_lake.events cl ON c.customer_id = cl.user_id
GROUP BY c.customer_name;
Q10: How do you implement data warehouse automation?
Answer:
Automation Framework:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Warehouse Automation β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β CI/CD Pipeline β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Schema βββββΆβ Build βββββΆβ Deploy β β
β β Changes β β (Test) β β (Prod) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β Orchestration β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Step βββββΆβ Lambda βββββΆβ Redshift β β
β β Functions β β Tasks β β Queries β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β Monitoring β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β CloudWatch βββββΆβ Alarms βββββΆβ SNS β β
β β Metrics β β β β Alerts β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step Functions Orchestration:
{
"StartAt": "ExtractData",
"States": {
"ExtractData": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "extract-sales-data"
},
"Next": "TransformData"
},
"TransformData": {
"Type": "Task",
"Resource": "arn:aws:states:::glue:startJobRun.sync",
"Parameters": {
"JobName": "transform-sales-data"
},
"Next": "LoadToRedshift"
},
"LoadToRedshift": {
"Type": "Task",
"Resource": "arn:aws:lambda:load-redshift",
"Next": "RunQualityChecks"
},
"RunQualityChecks": {
"Type": "Task",
"Resource": "arn:aws:lambda:quality-checks",
"Next": "PublishMetrics"
},
"PublishMetrics": {
"Type": "Task",
"Resource": "arn:aws:states:::cloudwatch:putMetricData.sync",
"End": true
}
}
}
Q11: How do you implement multi-tenant data warehousing?
Answer:
Multi-Tenant Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Multi-Tenant Data Warehouse β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Tenant Isolation Strategies: β
β β
β 1. Separate Databases β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Tenant A DB β β Tenant B DB β β Tenant C DB β β
β β βββββββββββ β β βββββββββββ β β βββββββββββ β β
β β Tables β β Tables β β Tables β β
β β Users β β Users β β Users β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β 2. Schema-Based Isolation β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Schema A Schema B Schema C β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β β β Tables β β Tables β β Tables β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β 3. Row-Level Security β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Table with tenant_id column β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β tenant_id β data... β β β
β β β A β row1, row2 β β β
β β β B β row3, row4 β β β
β β β A β row5 β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Row-Level Security Implementation:
-- Create tenant view
CREATE VIEW tenant_sales AS
SELECT * FROM sales_data
WHERE tenant_id = current_setting('app.tenant_id');
-- Set tenant context
SET app.tenant_id = 'tenant_a';
-- Query only returns tenant_a data
SELECT * FROM tenant_sales;
Q12: How do you implement disaster recovery for Redshift?
Answer:
DR Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift Disaster Recovery β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Primary Region (us-east-1) DR Region (us-west-2) β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Redshift Cluster βββββββββΆβ Redshift Cluster β β
β β (Primary) β Snap β (Standby) β β
β βββββββββββββββββββββββ shot βββββββββββββββββββββββ β
β β Transfer β
β β β β
β βΌ βΌ β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β S3 Data Lake βββββββββΆβ S3 Replica β β
β β (Versioning) β CRR β β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β
β RPO: 1 hour (snapshot frequency) β
β RTO: 30 minutes (automated restore) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Automated Snapshot and Restore:
# Automated snapshot
def lambda_handler(event, context):
redshift = boto3.client('redshift')
# Create manual snapshot
redshift.create_cluster_snapshot(
SnapshotIdentifier=f'dr-snapshot-{datetime.now().strftime("%Y%m%d")}',
ClusterIdentifier='primary-cluster'
)
# Copy to DR region
redshift.copy_cluster_snapshot(
SourceSnapshotIdentifier='dr-snapshot-20240115',
SourceSnapshotClusterIdentifier='primary-cluster',
DestinationRegion='us-west-2',
SnapshotIdentifier='dr-snapshot-us-west-2'
)
# Automated restore in DR region
def restore_dr_cluster():
redshift_dr = boto3.client('redshift', region_name='us-west-2')
redshift_dr.restore_from_cluster_snapshot(
ClusterIdentifier='dr-cluster',
SnapshotIdentifier='dr-snapshot-us-west-2',
SnapshotClusterIdentifier='primary-cluster'
)
Q13: How do you optimize costs in Redshift?
Answer:
Cost Optimization Strategies:
1. Reserved Instances:
# Calculate savings
def calculate_ri_savings():
on_demand_hourly = 0.25 # ra3.xlplus
reserved_1yr = 0.16 # All upfront
reserved_3yr = 0.10 # All upfront
annual_on_demand = on_demand_hourly * 8760
annual_1yr = reserved_1yr * 8760
annual_3yr = reserved_3yr * 8760
savings_1yr = (annual_on_demand - annual_1yr) / annual_on_demand * 100
savings_3yr = (annual_on_demand - annual_3yr) / annual_on_demand * 100
return {
'on_demand_annual': annual_on_demand,
'1yr_ri_annual': annual_1yr,
'3yr_ri_annual': annual_3yr,
'savings_1yr': savings_1yr,
'savings_3yr': savings_3yr
}
2. Pause/Resume for Dev/Test:
# Pause cluster during off-hours
def pause_cluster():
redshift = boto3.client('redshift')
redshift.pause_cluster(ClusterIdentifier='dev-cluster')
# Resume cluster
def resume_cluster():
redshift = boto3.client('redshift')
redshift.resume_cluster(ClusterIdentifier='dev-cluster')
3. Right-Sizing:
-- Monitor query patterns
SELECT
query,
records,
elapsed_time,
bytes_scanned
FROM stl_query
WHERE starttime > DATEADD(day, -7, GETDATE())
ORDER BY bytes_scanned DESC;
-- Identify underutilized clusters
SELECT
cluster,
AVG(cpu_utilization) as avg_cpu,
AVG(memory_utilization) as avg_memory
FROM system_metrics
WHERE timestamp > DATEADD(day, -30, GETDATE())
GROUP BY cluster
HAVING AVG(cpu_utilization) < 20;
Cost Comparison:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift Cost Comparison β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Instance: ra3.xlplus (4 vCPU, 32 GB RAM) β
β β
β On-Demand: $0.25/hr = $180/month β
β 1yr Reserved: $0.16/hr = $115/month (36% savings) β
β 3yr Reserved: $0.10/hr = $72/month (60% savings) β
β β
β With Auto-Suspend (10hr/day): β
β On-Demand: $0.25 * 10 = $75/month (58% savings) β
β 1yr Reserved: $0.16 * 10 = $48/month (73% savings) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Q14: How do you implement data versioning in Redshift?
Answer:
Versioning Strategies:
1. Temporal Tables (SCD Type 2):
-- Create versioned table
CREATE TABLE dim_customer_versioned (
customer_key BIGINT PRIMARY KEY IDENTITY(1,1),
customer_id VARCHAR(50),
name VARCHAR(100),
email VARCHAR(200),
effective_date TIMESTAMP,
end_date TIMESTAMP,
is_current BOOLEAN
);
-- Merge with versioning
MERGE INTO dim_customer_versioned t
USING staging_customer s
ON t.customer_id = s.customer_id AND t.is_current = TRUE
WHEN MATCHED AND (t.name != s.name OR t.email != s.email) THEN
UPDATE SET
is_current = FALSE,
end_date = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, effective_date, end_date, is_current)
VALUES (s.customer_id, s.name, s.email, CURRENT_TIMESTAMP, NULL, TRUE);
2. Snapshot-Based Versioning:
-- Create snapshot table
CREATE TABLE sales_snapshot AS
SELECT
*,
GETDATE() as snapshot_date,
'v1' as version
FROM fact_sales;
-- Query specific version
SELECT * FROM sales_snapshot WHERE version = 'v1';
3. Git-Based Schema Versioning:
# Version control for schema changes
import git
repo = git.Repo('/path/to/schema-repo')
# Save current schema
def save_schema_version():
schema = get_current_schema()
with open('schema.sql', 'w') as f:
f.write(schema)
repo.index.add(['schema.sql'])
repo.index.commit(f'Schema update: {datetime.now().isoformat()}')
tag = repo.create_tag(f'v{get_version()}')
return tag
Q15: How do you implement monitoring for Redshift?
Answer:
Monitoring Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift Monitoring Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Metrics Logs Alarms β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β CloudWatch β β System β β SNS β β
β β Metrics β β Tables β β Alerts β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β
β ββββββββββββββββββΌβββββββββββββββββ β
β βΌ β
β βββββββββββββββ β
β β Dashboard β β
β β (QuickSight)β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
CloudWatch Metrics:
cloudwatch = boto3.client('cloudwatch')
# Redshift metrics
metrics = [
'CPUUtilization',
'DatabaseConnections',
'HealthStatus',
'NetworkReceiveThroughput',
'NetworkTransmitThroughput',
'PercentageDiskSpaceUsed',
'QueryDuration',
'QueryRuntimeBreakdown'
]
# Create alarms
for metric in metrics:
cloudwatch.put_metric_alarm(
AlarmName=f'Redshift-{metric}',
MetricName=metric,
Namespace='AWS/Redshift',
Statistic='Average',
Period=300,
EvaluationPeriods=2,
Threshold=get_threshold(metric),
ComparisonOperator='GreaterThanThreshold',
AlarmActions=['arn:aws:sns:us-east-1:123456789:redshift-alerts'],
Dimensions=[
{'Name': 'ClusterIdentifier', 'Value': 'my-cluster'}
]
)
System Table Queries:
-- Long-running queries
SELECT
query,
username,
starttime,
elapsed,
records,
bytes
FROM stl_query
WHERE starttime > DATEADD(hour, -1, GETDATE())
AND elapsed > 60000 -- > 1 minute
ORDER BY elapsed DESC;
-- Disk usage
SELECT
owner,
diskno,
used,
capacity,
(used::float / capacity::float) * 100 as pct_used
FROM stv_partitions
WHERE host = (SELECT masterθηΉ FROM stv_mv_state LIMIT 1);
-- Query performance
SELECT
query,
service_class,
num_partitions,
elapsed_time,
queue_time
FROM stl_wlm_query
WHERE starttime > DATEADD(day, -7, GETDATE())
ORDER BY elapsed_time DESC;
Q16: How do you implement data governance in Redshift?
Answer:
Governance Framework:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift Data Governance β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Metadata Management β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Data catalog (Glue) β β
β β β’ Column descriptions β β
β β β’ Data lineage β β
β β β’ Business glossary β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Access Control β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Role-based access control β β
β β β’ Column-level permissions β β
β β β’ Row-level security β β
β β β’ Data masking β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Audit & Compliance β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ CloudTrail integration β β
β β β’ Query logging β β
β β β’ Data access logs β β
β β β’ Compliance reports β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Classification:
-- Add classification tags
COMMENT ON TABLE customers IS 'Classification: PII - Contains customer personal information';
COMMENT ON COLUMN customers.ssn IS 'Classification: Highly Sensitive - Social Security Number';
COMMENT ON COLUMN customers.email IS 'Classification: Sensitive - Contact Information';
-- Query classification
SELECT
c.table_name,
c.column_name,
pg_description.description as classification
FROM information_schema.columns c
JOIN pg_description ON c.table_name::regclass = pg_description.objoid
WHERE pg_description.description LIKE '%Sensitive%';
Q17: How do you implement data warehouse testing?
Answer:
Testing Framework:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Data Warehouse Testing Framework β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Unit Tests β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Transformation logic β β
β β β’ Data type conversions β β
β β β’ Business rule validation β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Integration Tests β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ ETL pipeline execution β β
β β β’ Data flow validation β β
β β β’ Cross-table consistency β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Performance Tests β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β’ Query execution time β β
β β β’ Load performance β β
β β β’ Concurrency testing β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SQL-Based Testing:
-- Test completeness
CREATE OR REPLACE PROCEDURE test_completeness(table_name VARCHAR)
AS $$
DECLARE
null_count INT;
BEGIN
EXECUTE 'SELECT COUNT(*) FROM ' || table_name || ' WHERE amount IS NULL'
INTO null_count;
IF null_count > 0 THEN
RAISE EXCEPTION 'Completeness test failed: % null values', null_count;
END IF;
END;
$$ LANGUAGE plpgsql;
-- Test referential integrity
CREATE OR REPLACE PROCEDURE test_referential_integrity()
AS $$
DECLARE
orphan_count INT;
BEGIN
SELECT COUNT(*) INTO orphan_count
FROM fact_sales f
LEFT JOIN dim_customer c ON f.customer_key = c.customer_key
WHERE c.customer_key IS NULL;
IF orphan_count > 0 THEN
RAISE EXCEPTION 'Referential integrity test failed: % orphan records', orphan_count;
END IF;
END;
$$ LANGUAGE plpgsql;
Q18: How do you implement Redshift Spectrum for federated queries?
Answer:
Federated Query Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Federated Query Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Redshift Cluster β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Spectrum β β β
β β β βββββββββββββββ βββββββββββββββ ββββββββββββ β β β
β β β β S3 β β RDS β β External β β β β
β β β β (Data β β (MySQL) β β (API) β β β β
β β β β Lake) β β β β β β β β
β β β βββββββββββββββ βββββββββββββββ ββββββββββββ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
External Schema Configuration:
-- Connect to RDS MySQL
CREATE EXTERNAL SCHEMA mysql_source
FROM DATA CATALOG
DATABASE 'mysql_db'
IAM_ROLE 'arn:aws:iam::role/SpectrumRole'
URI 'jdbc:mysql://rds-instance.xxx.us-east-1.rds.amazonaws.com:3306/mydb';
-- Query across sources
SELECT
r.customer_name,
s.total_spent,
m.login_count
FROM mysql_source.customers r
JOIN local.sales_summary s ON r.id = s.customer_id
JOIN spectrum.clickstream m ON r.id = m.user_id;
Q19: How do you implement Redshift for IoT analytics?
Answer:
IoT Analytics Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift IoT Analytics Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β IoT Devices Ingestion Storage β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Sensors ββββββΆβ IoT Core ββββΆβ Kinesis β β
β β (MQTT) β β β β Streams β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β βββββββββββββββ βββββββββββββββ β β
β β Gateways ββββββΆβ IoT Analyticsβββββββββββ€ β
β βββββββββββββββ βββββββββββββββ β β
β βΌ β
β βββββββββββββββ β
β β S3 β β
β β (Raw) β β
β ββββββββ¬βββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β β Redshift β β
β β (Analytics)β β
β ββββββββ¬βββββββ β
β β β
β ββββββββββββββββββββΌβββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ βββββββββββ
β β TimeSeriesβ β Predictiveβ β Alerts ββ
β β Analysis β β Maintenanceβ β ββ
β ββββββββββββ ββββββββββββ βββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time-Series Optimization:
-- Time-series table design
CREATE TABLE iot_readings (
device_id VARCHAR(50),
reading_time TIMESTAMP,
temperature DECIMAL(5,2),
humidity DECIMAL(5,2),
pressure DECIMAL(7,2)
)
DISTKEY(device_id)
COMPOUND SORTKEY(reading_time, device_id);
-- Partition by time range
CREATE EXTERNAL TABLE iot_readings_ext (
device_id VARCHAR(50),
reading_time TIMESTAMP,
temperature DECIMAL(5,2),
humidity DECIMAL(5,2),
pressure DECIMAL(7,2)
)
PARTITIONED BY (year INT, month INT, day INT, hour INT)
STORED AS PARQUET
LOCATION 's3://iot-data/readings/';
Q20: How do you implement Redshift for machine learning?
Answer:
ML Integration Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift ML Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Preparation Model Training Deployment β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Redshift βββββββββΆβ SageMaker βββββββΆβ Endpoint β β
β β (SQL) β UNLOADβ AutoPilot β β (Real-time) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β Prediction Integration Monitoring β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β CREATE MODELβββββββββΆβ SQL Queries βββββββΆβ CloudWatch β β
β β (Predict) β β β β Metrics β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Redshift ML Implementation:
-- Create ML model
CREATE MODEL churn_prediction
FROM (
SELECT
customer_id,
tenure_months,
monthly_charges,
total_charges,
churned
FROM customer_data
)
TARGET churned
FUNCTION predict_churn
IAM_ROLE 'arn:aws:iam::role/RedshiftMLRole'
SETTINGS (
S3_BUCKET 'ml-artifacts',
MAX_RUNTIME 3600
);
-- Use model for predictions
SELECT
customer_id,
predict_churn(tenure_months, monthly_charges, total_charges) as churn_probability
FROM new_customers
WHERE predict_churn(tenure_months, monthly_charges, total_charges) > 0.7;
Q21: How do you implement data quality monitoring in Redshift?
Answer:
Quality Monitoring Framework:
-- Quality metrics table
CREATE TABLE data_quality_metrics (
metric_id INT IDENTITY(1,1),
table_name VARCHAR(100),
metric_name VARCHAR(100),
metric_value DECIMAL(10,2),
threshold DECIMAL(10,2),
passed BOOLEAN,
measured_at TIMESTAMP DEFAULT GETDATE()
);
-- Completeness metric
INSERT INTO data_quality_metrics (table_name, metric_name, metric_value, threshold, passed)
SELECT
'fact_sales',
'completeness_amount',
(COUNT(*) - COUNT(amount))::DECIMAL / COUNT(*) * 100,
0,
CASE WHEN (COUNT(*) - COUNT(amount)) = 0 THEN TRUE ELSE FALSE END
FROM fact_sales;
-- Uniqueness metric
INSERT INTO data_quality_metrics (table_name, metric_name, metric_value, threshold, passed)
SELECT
'fact_sales',
'uniqueness_sale_key',
COUNT(*) - COUNT(DISTINCT sale_key),
0,
CASE WHEN COUNT(*) = COUNT(DISTINCT sale_key) THEN TRUE ELSE FALSE END
FROM fact_sales;
-- Freshness metric
INSERT INTO data_quality_metrics (table_name, metric_name, metric_value, threshold, passed)
SELECT
'fact_sales',
'freshness_hours',
EXTRACT(EPOCH FROM (GETDATE() - MAX(sale_date))) / 3600,
24,
CASE WHEN EXTRACT(EPOCH FROM (GETDATE() - MAX(sale_date))) / 3600 <= 24 THEN TRUE ELSE FALSE END
FROM fact_sales;
Q22: How do you implement Redshift for real-time dashboards?
Answer:
Real-Time Dashboard Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Real-Time Dashboard Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Sources Ingestion Dashboard β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β ApplicationsβββββββΆβ Kinesis ββββΆβ QuickSight β β
β β APIs β β Firehose β β Dashboard β β
β βββββββββββββββ ββββββββ¬βββββββ βββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββ β
β β Redshift β β
β β (Materializedβ β
β β Views) β β
β βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Materialized View for Real-Time:
-- Real-time aggregation
CREATE MATERIALIZED VIEW mv_realtime_sales AS
SELECT
date_trunc('minute', sale_date) as sale_minute,
region,
product_category,
COUNT(*) as transaction_count,
SUM(amount) as total_sales,
AVG(amount) as avg_sale
FROM fact_sales
WHERE sale_date >= DATEADD(minute, -5, GETDATE())
GROUP BY 1, 2, 3;
-- Auto-refresh
REFRESH MATERIALIZED VIEW CONCURRENTLY mv_realtime_sales;
Q23: How do you implement Redshift data sharing?
Answer:
Data Sharing Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Redshift Data Sharing β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Producer Cluster Consumer Clusters β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Primary Cluster β β Analytics Cluster β β
β β βββββββββββββββ β β βββββββββββββββ β β
β β β Shared βββββΌβββΆβ β External β β β
β β β Database β β β β Tables β β β
β β βββββββββββββββ β β βββββββββββββββ β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
β β Reporting Cluster β β ML Cluster β β
β β βββββββββββββββ β β βββββββββββββββ β β
β β β External β β β β External β β β
β β β Tables β β β β Tables β β β
β β βββββββββββββββ β β βββββββββββββββ β β
β βββββββββββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Sharing Setup:
-- Producer: Create share
CREATE DATABASE SHARE sales_share;
-- Grant usage
GRANT USAGE ON DATABASE sales_share TO GROUP analytics_team;
-- Add tables to share
ALTER DATABASE sales_share ADD TABLE fact_sales;
ALTER DATABASE sales_share ADD TABLE dim_customer;
-- Consumer: Create database from share
CREATE DATABASE sales_analytics FROM SHARE producer-cluster:sales_share;
Q24: How do you implement Redshift for financial analytics?
Answer:
Financial Analytics Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Financial Analytics Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Sources Processing Analytics β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β TransactionsβββββββΆβ ETL PipelineββββΆβ Redshift β β
β β Market Data β β (Glue) β β (DW) β β
β β References β β β β β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ βββββββββββ
β β Risk β β Complianceβ βReports ββ
β β Analysis β β Reporting β β ββ
β ββββββββββββ ββββββββββββ βββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Financial Data Modeling:
-- Financial fact table
CREATE TABLE fact_transactions (
transaction_key BIGINT PRIMARY KEY IDENTITY(1,1),
date_key INT NOT NULL,
account_key INT NOT NULL,
counterparty_key INT NOT NULL,
instrument_key INT NOT NULL,
amount DECIMAL(18,4),
currency VARCHAR(3),
exchange_rate DECIMAL(10,6),
usd_equivalent DECIMAL(18,4),
transaction_type VARCHAR(20),
settlement_status VARCHAR(20)
)
DISTSTYLE KEY
DISTKEY(account_key)
COMPOUND SORTKEY(date_key, account_key);
Q25: How do you implement Redshift for healthcare analytics?
Answer:
Healthcare Analytics Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Healthcare Analytics Architecture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Data Sources Processing Analytics β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β EHR Systems βββββββΆβ De-identify ββββΆβ Redshift β β
β β Claims Data β β & Transform β β (HIPAA β β
β β Clinical β β β β Compliant) β β
β βββββββββββββββ βββββββββββββββ ββββββββ¬βββββββ β
β β β
β ββββββββββββββββββββββΌβββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ βββββββββββ
β β Clinical β β Populationβ βQuality ββ
β β Research β β Health β βMetrics ββ
β ββββββββββββ ββββββββββββ βββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
HIPAA-Compliant Implementation:
-- De-identification function
CREATE OR REPLACE FUNCTION de_identify_ssn(ssn VARCHAR)
RETURNS VARCHAR
AS $$
BEGIN
RETURN CONCAT('XXX-XX-', RIGHT(ssn, 4));
END;
$$ LANGUAGE plpgsql;
-- Row-level security for patient data
CREATE POLICY patient_access ON patient_records
FOR ALL
TO healthcare_team
USING (provider_id = current_user_provider_id());
-- Audit logging
CREATE TABLE access_audit (
access_id INT IDENTITY(1,1),
user_name VARCHAR(100),
table_name VARCHAR(100),
access_type VARCHAR(20),
access_time TIMESTAMP DEFAULT GETDATE(),
query_text VARCHAR(10000)
);
Summary
Mastering AWS data warehousing requires understanding:
- Architecture Design: Star/snowflake schemas, distribution strategies, sort keys
- Performance Optimization: Materialized views, compression, query tuning
- Security & Compliance: Column/row-level security, encryption, HIPAA
- Cost Management: Reserved instances, auto-suspend, right-sizing
- Operations: Monitoring, testing, disaster recovery
These concepts form the foundation for building scalable, performant, and secure data warehouses on AWS.