🔍 Amazon Athena

Master Athena serverless queries, federated access, workgroups, CTAS, and cost optimization.

Module: AWS Data Engineering • Topic 12 of 65 • Premium Content

Athena Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AMAZON ATHENA ARCHITECTURE                                │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  QUERY CLIENTS                                                      │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │    │
│  │  │ Athena   │  │ QuickSight│  │ Jupyter  │  │ JDBC/ODBC│          │    │
│  │  │ Console  │  │          │  │ Notebook │  │ Driver   │          │    │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘          │    │
│  └───────┼──────────────┼──────────────┼──────────────┼───────────────┘    │
│          ▼              ▼              ▼              ▼                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  ATHENA ENGINE (Serverless, Auto-scaling)                           │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Query Processing                                              │  │    │
│  │  │  • SQL parsing and optimization                                │  │    │
│  │  │  • Distributed execution (Presto/Trino)                        │  │    │
│  │  │  • Auto-scaling compute                                        │  │    │
│  │  │  • Result caching                                              │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  DATA CATALOG (Glue Data Catalog)                             │  │    │
│  │  │                                                               │  │    │
│  │  │  • Database definitions                                       │  │    │
│  │  │  • Table schemas                                              │  │    │
│  │  │  • Partition metadata                                         │  │    │
│  │  │  • SerDe information                                          │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│              ┌─────────────────┼─────────────────┐                         │
│              ▼                 ▼                 ▼                         │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐            │
│  │  S3 Data Lake   │  │  Federated      │  │  Results        │            │
│  │  (Primary)      │  │  Sources        │  │  (S3)           │            │
│  │                 │  │                 │  │                 │            │
│  │  Parquet/ORC    │  │  RDS/Redshift   │  │  Query results  │            │
│  │  JSON/CSV       │  │  DynamoDB       │  │  cached 30 days │            │
│  │  Avro           │  │  ElastiCache    │  │                 │            │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘            │
└─────────────────────────────────────────────────────────────────────────────┘

Athena Query Examples

Basic Queries

-- Standard SQL query on S3 data
SELECT
    customer_id,
    product_category,
    COUNT(*) as order_count,
    SUM(amount) as total_amount
FROM orders
WHERE year = 2024 AND month = 1
GROUP BY customer_id, product_category
ORDER BY total_amount DESC
LIMIT 100;

-- Query with complex predicates
SELECT
    DATE_TRUNC('hour', event_time) as hour_bucket,
    event_type,
    COUNT(*) as event_count,
    COUNT(DISTINCT user_id) as unique_users
FROM clickstream_events
WHERE year = 2024
  AND month = 1
  AND day = 15
  AND event_type IN ('page_view', 'click', 'purchase')
GROUP BY DATE_TRUNC('hour', event_time), event_type
ORDER BY hour_bucket;

CTAS (Create Table As Select)

-- Create optimized table from query results
CREATE TABLE analytics_db.daily_sales_summary
WITH (
    format = 'PARQUET',
    parquet_compression = 'SNAPPY',
    partitioned_by = ARRAY['year', 'month', 'day'],
    external_location = 's3://data-lake-curated/daily-sales-summary/'
) AS
SELECT
    sale_date,
    customer_id,
    product_id,
    SUM(amount) as total_amount,
    COUNT(*) as transaction_count,
    AVG(amount) as avg_amount
FROM raw_db.sales
WHERE year = 2024
GROUP BY sale_date, customer_id, product_id;

-- Create table with specific distribution
CREATE TABLE analytics_db.customer_metrics
WITH (
    format = 'PARQUET',
    parquet_compression = 'SNAPPY',
    bucketed_by = ARRAY['customer_id'],
    bucket_count = 16,
    external_location = 's3://data-lake-curated/customer-metrics/'
) AS
SELECT
    customer_id,
    COUNT(DISTINCT order_id) as total_orders,
    SUM(amount) as lifetime_value,
    MIN(order_date) as first_order_date,
    MAX(order_date) as last_order_date
FROM raw_db.orders
GROUP BY customer_id;

UNLOAD (Export Query Results)

-- Export query results to S3
UNLOAD (
    'SELECT * FROM analytics_db.daily_sales_summary WHERE year = 2024 AND month = 1'
)
TO 's3://data-export/sales-summary/2024/01/'
FORMAT PARQUET
PARTITIONED BY (day)
IAM_ROLE 'arn:aws:iam::123456789012:role/AthenaUnloadRole'
MAX_FILE_SIZE = '128 MB';

Athena Workgroups

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ATHENA WORKGROUPS                                          │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  WORKGROUP: analytics-team                                          │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  QUERY CONFIGURATION                                          │  │    │
│  │  │                                                               │  │    │
│  │  │  • Engine version: Athena v3                                  │  │    │
│  │  │  • Results location: s3://athena-results/analytics/           │  │    │
│  │  │  • Encryption: SSE-S3 or SSE-KMS                              │  │    │
│  │  │  • Bytes scanned cutoff: 10 GB                                │  │    │
│  │  │  • Query timeout: 30 minutes                                  │  │    │
│  │  │  • Enforcement: Required (prevent full scans)                 │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  QUERY METRICS                                                │  │    │
│  │  │                                                               │  │    │
│  │  │  • Query count                                                │  │    │
│  │  │  • Data scanned (GB)                                          │  │    │
│  │  │  • Execution time (ms)                                        │  │    │
│  │  │  • Cost estimate                                              │  │    │
│  │  │  • Query plan (in CloudWatch)                                 │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  WORKGROUP: ad-hoc-analysis                                         │    │
│  │                                                                     │    │
│  │  • Higher cutoff: 100 GB                                           │    │
│  │  • No enforcement (for exploration)                                 │    │
│  │  • Separate billing                                                │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Workgroup Configuration

import boto3

athena = boto3.client('athena')

# Create workgroup
response = athena.create_work_group(
    Name='analytics-team',
    Description='Workgroup for analytics team',
    Configuration={
        'ResultConfiguration': {
            'OutputLocation': 's3://athena-results/analytics/',
            'EncryptionConfiguration': {
                'EncryptionOption': 'SSE_S3'
            }
        },
        'EnforceWorkGroupConfiguration': True,
        'PublishCloudWatchMetricsEnabled': True,
        'BytesScannedCutoffPerQuery': 10737418240,  # 10 GB
        'RequesterPaysEnabled': False,
        'EngineVersion': {
            'SelectedEngineVersion': 'AUTO',
            'EffectiveEngineVersion': 'Athena engine version 3'
        }
    },
    Tags={'Team': 'analytics', 'Environment': 'production'}
)

Athena Federated Query

# Federated query to RDS MySQL
federated_query = """
SELECT
    r.customer_id,
    r.customer_name,
    r.email,
    a.order_count,
    a.total_spend
FROM mysql_catalog.production_db.customers r
LEFT JOIN (
    SELECT
        customer_id,
        COUNT(*) as order_count,
        SUM(amount) as total_spend
    FROM athenaanalytics_db.orders
    WHERE year = 2024
    GROUP BY customer_id
) a ON r.customer_id = a.customer_id
WHERE r.created_at > DATE('2023-01-01')
"""

# Execute federated query
response = athena.start_query_execution(
    QueryString=federated_query,
    QueryExecutionContext={
        'Database': 'default'
    },
    WorkGroup='analytics-team'
)

# Federated query to DynamoDB
dynamodb_query = """
SELECT
    d.customer_id,
    d.profile_data,
    o.order_count
FROM dynamodb_catalog.customer_profiles d
JOIN athenaanalytics_db.orders_summary o
ON d.customer_id = o.customer_id
WHERE d.status = 'active'
"""

ℹ️

Pro Tip: Use federated queries sparingly. They scan external data sources directly, which can be slow and expensive. For frequent queries, use ETL to materialize results in S3.

Cost Optimization

Strategy	Implementation	Savings
Columnar Formats	Use Parquet/ORC	50-90% less data scanned
Partitioning	Partition by date keys	Reduce scanned partitions
Predicate Filtering	Always use WHERE clauses	Avoid full table scans
Bucketing	Bucket by high-cardinality keys	Reduce shuffle
Result Caching	Athena caches results automatically	Free
Workgroup Limits	Set bytes scanned cutoff	Prevent runaway costs

-- Cost-optimized query example
-- Good: Uses partitions and filters
SELECT *
FROM orders
WHERE year = 2024
  AND month = 1
  AND day = 15
  AND customer_id = 'CUST-001';

-- Bad: Full table scan (expensive!)
SELECT *
FROM orders
WHERE customer_id = 'CUST-001';

⚠️

Cost Warning: Athena charges $5 per TB scanned. A full scan of a 1 TB table costs$ 5. Always partition data and filter queries to minimize cost.

Interview Questions & Answers

Q1: How does Athena pricing work?

Answer: Athena charges $5 per TB of data scanned. Key cost optimization strategies:

Use columnar formats (Parquet/ORC)
Partition data by query patterns
Always use WHERE clauses to filter
Set workgroup bytes scanned limits

Q2: What is CTAS and when should you use it?

Answer: CTAS (Create Table As Select) creates a new table from query results. Use it to:

Materialize aggregated results
Create optimized table formats
Partition data for faster queries
Convert formats (CSV to Parquet)

Q3: What is the difference between Athena v2 and v3?

Answer:

Athena v2: Presto 0.217, standard features
Athena v3: Trino 351, improved performance, new SQL functions

Athena v3 offers 2-10x better performance for most queries.

Q4: How do federated queries work in Athena?

Answer: Federated queries use Lambda connectors to query external data sources:

Install connector (e.g., MySQL, DynamoDB)
Register data source in Glue Catalog
Query using catalog prefix (e.g., mysql_catalog.db.table)

Data is scanned at the source and filtered before returning.

Q5: What is the maximum query result size in Athena?