Athena Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AMAZON ATHENA ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β QUERY CLIENTS β β
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
β β β Athena β β QuickSightβ β Jupyter β β JDBC/ODBCβ β β
β β β Console β β β β Notebook β β Driver β β β
β β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β β
β βββββββββΌβββββββββββββββΌβββββββββββββββΌβββββββββββββββΌββββββββββββββββ β
β βΌ βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ATHENA ENGINE (Serverless, Auto-scaling) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Query Processing β β β
β β β β’ SQL parsing and optimization β β β
β β β β’ Distributed execution (Presto/Trino) β β β
β β β β’ Auto-scaling compute β β β
β β β β’ Result caching β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β DATA CATALOG (Glue Data Catalog) β β β
β β β β β β
β β β β’ Database definitions β β β
β β β β’ Table schemas β β β
β β β β’ Partition metadata β β β
β β β β’ SerDe information β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β S3 Data Lake β β Federated β β Results β β
β β (Primary) β β Sources β β (S3) β β
β β β β β β β β
β β Parquet/ORC β β RDS/Redshift β β Query results β β
β β JSON/CSV β β DynamoDB β β cached 30 days β β
β β Avro β β ElastiCache β β β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Athena Query Examples
Basic Queries
-- Standard SQL query on S3 data
SELECT
customer_id,
product_category,
COUNT(*) as order_count,
SUM(amount) as total_amount
FROM orders
WHERE year = 2024 AND month = 1
GROUP BY customer_id, product_category
ORDER BY total_amount DESC
LIMIT 100;
-- Query with complex predicates
SELECT
DATE_TRUNC('hour', event_time) as hour_bucket,
event_type,
COUNT(*) as event_count,
COUNT(DISTINCT user_id) as unique_users
FROM clickstream_events
WHERE year = 2024
AND month = 1
AND day = 15
AND event_type IN ('page_view', 'click', 'purchase')
GROUP BY DATE_TRUNC('hour', event_time), event_type
ORDER BY hour_bucket;
CTAS (Create Table As Select)
-- Create optimized table from query results
CREATE TABLE analytics_db.daily_sales_summary
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['year', 'month', 'day'],
external_location = 's3://data-lake-curated/daily-sales-summary/'
) AS
SELECT
sale_date,
customer_id,
product_id,
SUM(amount) as total_amount,
COUNT(*) as transaction_count,
AVG(amount) as avg_amount
FROM raw_db.sales
WHERE year = 2024
GROUP BY sale_date, customer_id, product_id;
-- Create table with specific distribution
CREATE TABLE analytics_db.customer_metrics
WITH (
format = 'PARQUET',
parquet_compression = 'SNAPPY',
bucketed_by = ARRAY['customer_id'],
bucket_count = 16,
external_location = 's3://data-lake-curated/customer-metrics/'
) AS
SELECT
customer_id,
COUNT(DISTINCT order_id) as total_orders,
SUM(amount) as lifetime_value,
MIN(order_date) as first_order_date,
MAX(order_date) as last_order_date
FROM raw_db.orders
GROUP BY customer_id;
UNLOAD (Export Query Results)
-- Export query results to S3
UNLOAD (
'SELECT * FROM analytics_db.daily_sales_summary WHERE year = 2024 AND month = 1'
)
TO 's3://data-export/sales-summary/2024/01/'
FORMAT PARQUET
PARTITIONED BY (day)
IAM_ROLE 'arn:aws:iam::123456789012:role/AthenaUnloadRole'
MAX_FILE_SIZE = '128 MB';
Athena Workgroups
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ATHENA WORKGROUPS β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β WORKGROUP: analytics-team β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β QUERY CONFIGURATION β β β
β β β β β β
β β β β’ Engine version: Athena v3 β β β
β β β β’ Results location: s3://athena-results/analytics/ β β β
β β β β’ Encryption: SSE-S3 or SSE-KMS β β β
β β β β’ Bytes scanned cutoff: 10 GB β β β
β β β β’ Query timeout: 30 minutes β β β
β β β β’ Enforcement: Required (prevent full scans) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β QUERY METRICS β β β
β β β β β β
β β β β’ Query count β β β
β β β β’ Data scanned (GB) β β β
β β β β’ Execution time (ms) β β β
β β β β’ Cost estimate β β β
β β β β’ Query plan (in CloudWatch) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β WORKGROUP: ad-hoc-analysis β β
β β β β
β β β’ Higher cutoff: 100 GB β β
β β β’ No enforcement (for exploration) β β
β β β’ Separate billing β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Workgroup Configuration
import boto3
athena = boto3.client('athena')
# Create workgroup
response = athena.create_work_group(
Name='analytics-team',
Description='Workgroup for analytics team',
Configuration={
'ResultConfiguration': {
'OutputLocation': 's3://athena-results/analytics/',
'EncryptionConfiguration': {
'EncryptionOption': 'SSE_S3'
}
},
'EnforceWorkGroupConfiguration': True,
'PublishCloudWatchMetricsEnabled': True,
'BytesScannedCutoffPerQuery': 10737418240, # 10 GB
'RequesterPaysEnabled': False,
'EngineVersion': {
'SelectedEngineVersion': 'AUTO',
'EffectiveEngineVersion': 'Athena engine version 3'
}
},
Tags={'Team': 'analytics', 'Environment': 'production'}
)
Athena Federated Query
# Federated query to RDS MySQL
federated_query = """
SELECT
r.customer_id,
r.customer_name,
r.email,
a.order_count,
a.total_spend
FROM mysql_catalog.production_db.customers r
LEFT JOIN (
SELECT
customer_id,
COUNT(*) as order_count,
SUM(amount) as total_spend
FROM athenaanalytics_db.orders
WHERE year = 2024
GROUP BY customer_id
) a ON r.customer_id = a.customer_id
WHERE r.created_at > DATE('2023-01-01')
"""
# Execute federated query
response = athena.start_query_execution(
QueryString=federated_query,
QueryExecutionContext={
'Database': 'default'
},
WorkGroup='analytics-team'
)
# Federated query to DynamoDB
dynamodb_query = """
SELECT
d.customer_id,
d.profile_data,
o.order_count
FROM dynamodb_catalog.customer_profiles d
JOIN athenaanalytics_db.orders_summary o
ON d.customer_id = o.customer_id
WHERE d.status = 'active'
"""
βΉοΈ
Pro Tip: Use federated queries sparingly. They scan external data sources directly, which can be slow and expensive. For frequent queries, use ETL to materialize results in S3.
Cost Optimization
| Strategy | Implementation | Savings |
|---|---|---|
| Columnar Formats | Use Parquet/ORC | 50-90% less data scanned |
| Partitioning | Partition by date keys | Reduce scanned partitions |
| Predicate Filtering | Always use WHERE clauses | Avoid full table scans |
| Bucketing | Bucket by high-cardinality keys | Reduce shuffle |
| Result Caching | Athena caches results automatically | Free |
| Workgroup Limits | Set bytes scanned cutoff | Prevent runaway costs |
-- Cost-optimized query example
-- Good: Uses partitions and filters
SELECT *
FROM orders
WHERE year = 2024
AND month = 1
AND day = 15
AND customer_id = 'CUST-001';
-- Bad: Full table scan (expensive!)
SELECT *
FROM orders
WHERE customer_id = 'CUST-001';
β οΈ
Cost Warning: Athena charges 5. Always partition data and filter queries to minimize cost.
Interview Questions & Answers
Q1: How does Athena pricing work?
Answer: Athena charges $5 per TB of data scanned. Key cost optimization strategies:
- Use columnar formats (Parquet/ORC)
- Partition data by query patterns
- Always use WHERE clauses to filter
- Set workgroup bytes scanned limits
Q2: What is CTAS and when should you use it?
Answer: CTAS (Create Table As Select) creates a new table from query results. Use it to:
- Materialize aggregated results
- Create optimized table formats
- Partition data for faster queries
- Convert formats (CSV to Parquet)
Q3: What is the difference between Athena v2 and v3?
Answer:
- Athena v2: Presto 0.217, standard features
- Athena v3: Trino 351, improved performance, new SQL functions
Athena v3 offers 2-10x better performance for most queries.
Q4: How do federated queries work in Athena?
Answer: Federated queries use Lambda connectors to query external data sources:
- Install connector (e.g., MySQL, DynamoDB)
- Register data source in Glue Catalog
- Query using catalog prefix (e.g., mysql_catalog.db.table)
Data is scanned at the source and filtered before returning.
Q5: What is the maximum query result size in Athena?
Answer:
- Console: 10 MB display limit (can download full results)
- API: 100 MB result set limit
- UNLOAD: No limit (writes directly to S3)
For large result sets, use UNLOAD to export to S3.
Summary
Amazon Athena is the go-to serverless query service for data lakes. Key takeaways:
- Serverless: No infrastructure management, auto-scaling
- Pricing: $5 per TB scanned - optimize with columnar formats
- Partitioning: Essential for cost and performance
- CTAS: Materialize query results as optimized tables
- Federated: Query external databases via Lambda connectors
- Workgroups: Manage access, costs, and configurations
- Formats: Prefer Parquet/ORC for 50-90% cost reduction