⚡ Real-time Streaming Pipelines

Master Kinesis-Lambda-S3-Athena real-time streaming architecture and patterns.

Module: AWS Data Engineering • Topic 17 of 65 • Premium Content

Streaming Pipeline Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│         REAL-TIME STREAMING PIPELINE: Kinesis → Lambda → S3 → Athena        │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  DATA SOURCES (Producers)                                           │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │    │
│  │  │ IoT      │  │ Web Apps │  │ Mobile   │  │ Logs     │          │    │
│  │  │ Sensors  │  │ Events   │  │ Events   │  │ Streams  │          │    │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘          │    │
│  └───────┼──────────────┼──────────────┼──────────────┼───────────────┘    │
│          ▼              ▼              ▼              ▼                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  KINESIS DATA STREAMS (Ingestion)                                   │    │
│  │                                                                     │    │
│  │  Shard 0        Shard 1        Shard 2        Shard 3              │    │
│  │  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐       │    │
│  │  │ 1MB/s in │   │ 1MB/s in │   │ 1MB/s in │   │ 1MB/s in │       │    │
│  │  │ 2MB/s out│   │ 2MB/s out│   │ 2MB/s out│   │ 2MB/s out│       │    │
│  │  └──────────┘   └──────────┘   └──────────┘   └──────────┘       │    │
│  │                                                                     │    │
│  │  Retention: 24 hours (extendable to 365)                            │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  LAMBDA (Processing)                                                 │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Event Source Mapping                                         │  │    │
│  │  │  • Batch Size: 100-1000 records                               │  │    │
│  │  │  • Batch Window: 60 seconds                                   │  │    │
│  │  │  • Parallelization Factor: 10                                 │  │    │
│  │  │  • Maximum Batching: Enabled                                  │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  Processing Logic                                             │  │    │
│  │  │  • Parse and validate records                                 │  │    │
│  │  │  • Enrich with reference data                                 │  │    │
│  │  │  • Aggregate (windowed)                                       │  │    │
│  │  │  • Filter invalid records                                     │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│                                ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  S3 (Storage)                                                        │    │
│  │                                                                     │    │
│  │  s3://realtime-data/                                                │    │
│  │  ├── raw/                (Ingested records)                         │    │
│  │  │   └── {date}/{hour}/  (Hourly partitions)                       │    │
│  │  ├── processed/          (Transformed)                              │    │
│  │  │   └── {date}/{hour}/  (Hourly partitions)                       │    │
│  │  └── aggregated/         (Windowed aggregates)                      │    │
│  │      └── {date}/{hour}/  (Hourly aggregates)                       │    │
│  │                                                                     │    │
│  │  Format: Parquet (compressed)                                       │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│                                │                                           │
│              ┌─────────────────┼─────────────────┐                         │
│              ▼                 ▼                 ▼                         │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐            │
│  │  Athena         │  │  QuickSight     │  │  Lambda         │            │
│  │  (Ad-hoc)       │  │  (Dashboards)   │  │  (Alerts)       │            │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘            │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  MONITORING                                                         │    │
│  │  CloudWatch Metrics: IteratorAge, GetRecords.IteratorAgeMilliseconds│    │
│  │  CloudWatch Alarms: Throttles, Errors, IteratorAge > threshold     │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Lambda Processing Code

import json
import base64
import boto3
from datetime import datetime

s3 = boto3.client('s3')

def lambda_handler(event, context):
    """Process Kinesis records and write to S3."""
    
    processed_records = []
    
    for record in event['Records']:
        # Decode Kinesis data
        payload = base64.b64decode(record['kinesis']['data'])
        data = json.loads(payload)
        
        # Process and transform
        processed = process_record(data)
        processed_records.append(processed)
    
    # Write batch to S3
    if processed_records:
        write_to_s3(processed_records)
    
    return {'processed': len(processed_records)}

def process_record(data):
    """Transform and enrich record."""
    return {
        'event_id': data.get('event_id'),
        'user_id': data.get('user_id'),
        'event_type': data.get('event_type'),
        'timestamp': datetime.now().isoformat(),
        'processed_at': datetime.now().isoformat()
    }

def write_to_s3(records):
    """Write batch of records to S3 as Parquet."""
    import pandas as pd
    from io import BytesIO
    
    df = pd.DataFrame(records)
    
    # Partition by date and hour
    now = datetime.now()
    key = f"processed/year={now.year}/month={now.month:02d}/day={now.day:02d}/hour={now.hour:02d}/{now.strftime('%Y%m%d%H%M%S')}.parquet"
    
    buffer = BytesIO()
    df.to_parquet(buffer, index=False, compression='snappy')
    
    s3.put_object(
        Bucket='realtime-data-lake',
        Key=key,
        Body=buffer.getvalue()
    )

Interview Q&A

Q1: What is IteratorAge in Kinesis?

Answer: IteratorAge measures the lag between the latest record in the stream and the last record processed by the consumer. High iterator age indicates the consumer is falling behind.

Q2: How do you handle backpressure in streaming?

Answer: Use batch window configuration, increase parallelization factor, optimize Lambda execution time, or use Kinesis Enhanced Fan-Out.

Q3: What is the difference between at-least-once and exactly-once delivery?

Answer: At-least-once may process duplicates; exactly-once processes each record once. Use idempotent processing or DynamoDB deduplication for exactly-once semantics.

Summary

Architecture: Kinesis → Lambda → S3 → Athena/QuickSight
Key Metrics: IteratorAge, Throttles, Errors
Best Practices: Idempotent processing, partition by time, use Parquet
Cost Optimization: Right-size batch size, use provisioned concurrency for critical paths

Real-time Streaming Pipelines on AWS