🌊 Amazon Kinesis

Master Kinesis Data Streams, Firehose, Analytics, Video Streams, and real-time streaming patterns.

Module: AWS Data Engineering • Topic 10 of 65 • Premium Content

Kinesis Architecture Overview

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KINESIS STREAMING ARCHITECTURE                            │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  DATA SOURCES                                                        │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐          │    │
│  │  │   IoT    │  │  Mobile  │  │  Web     │  │  Logs    │          │    │
│  │  │ Devices  │  │  Apps    │  │  Apps    │  │  Files   │          │    │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘          │    │
│  └───────┼──────────────┼──────────────┼──────────────┼───────────────┘    │
│          ▼              ▼              ▼              ▼                     │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  KINESIS DATA STREAMS (Shard-based, ordered per shard)             │    │
│  │  Shard 0 (1MB/s in, 2MB/s out)  Shard 1 (1MB/s in, 2MB/s out)    │    │
│  │  Retention: 24 hrs (up to 365 days)                                │    │
│  └─────────────────────────────┬───────────────────────────────────────┘    │
│              ┌─────────────────┼─────────────────┐                         │
│              ▼                 ▼                 ▼                         │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐            │
│  │  Firehose       │  │  Analytics      │  │  Data Streams   │            │
│  │  (Delivery)     │  │  (SQL/Flink)    │  │  (Custom Apps)  │            │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘            │
│           ▼                    ▼                    ▼                      │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐            │
│  │ S3/Redshift/ES  │  │ Lambda/S3       │  │ Custom Consumers│            │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘            │
└─────────────────────────────────────────────────────────────────────────────┘

Kinesis Data Streams

Stream Configuration

import boto3
kinesis = boto3.client('kinesis')

# Create with provisioned shards
response = kinesis.create_stream(
    StreamName='sensor-data-stream',
    ShardCount=4,
    StreamModeDetails={'StreamMode': 'PROVISIONED'}
)

# Or create on-demand (auto-scaling)
kinesis.create_stream(
    StreamName='sensor-data-stream',
    StreamModeDetails={'StreamMode': 'ON_DEMAND'}
)

Producer Example

import json
from datetime import datetime

def put_records(stream_name, records):
    response = kinesis.put_records(
        StreamName=stream_name,
        Records=[
            {'Data': json.dumps(r), 'PartitionKey': r['sensor_id']}
            for r in records
        ]
    )
    if response['FailedRecordCount'] > 0:
        print(f"Failed: {response['FailedRecordCount']} records")
    return response

KCL Consumer

from amazon_kclpy import processor, kcl
import json

class RecordProcessor(processor.RecordProcessorBase):
    def process_records(self, records):
        for record in records:
            data = json.loads(record.binary_data)
            print(f"Processing: {data['sensor_id']}")

if __name__ == '__main__':
    kcl.run(RecordProcessor())

ℹ️

Pro Tip: Use ON_DEMAND mode for unpredictable traffic. Use PROVISIONED for predictable workloads to control costs.

Kinesis Firehose

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KINESIS FIREHOSE ARCHITECTURE                              │
│                                                                             │
│  Stream/Direct ──► Firehose ──► Lambda Transform ──► Destination            │
│                                                                             │
│  DESTINATIONS:                                                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐                  │
│  │    S3    │  │ Redshift │  │ Elastic  │  │ HTTP     │                  │
│  │          │  │ (COPY)   │  │ search   │  │ Endpoint │                  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘                  │
│                                                                             │
│  BUFFERING: Size 1-128MB, Interval 60-900s (default: 5MB/300s)           │
│  COMPRESSION: GZIP, Snappy, Zip                                            │
└─────────────────────────────────────────────────────────────────────────────┘

Firehose to S3 with Partitioning

firehose = boto3.client('firehose')

response = firehose.create_delivery_stream(
    DeliveryStreamName='sensor-data-firehose',
    DeliveryStreamType='KinesisStreamAsSource',
    KinesisStreamSourceConfiguration={
        'KinesisStreamARN': 'arn:aws:kinesis:us-east-1:123456789012:stream/sensor-data-stream',
        'RoleARN': 'arn:aws:iam::123456789012:role/FirehoseRole'
    },
    ExtendedS3DestinationConfiguration={
        'RoleARN': 'arn:aws:iam::123456789012:role/FirehoseRole',
        'BucketARN': 'arn:aws:s3:::data-lake-raw',
        'Prefix': 'sensor-data/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/',
        'ErrorOutputPrefix': 'errors/',
        'BufferingHints': {'SizeInMBs': 64, 'IntervalInSeconds': 300},
        'CompressionFormat': 'UNCOMPRESSED',
        'ProcessingConfiguration': {
            'Enabled': True,
            'Processors': [{
                'Type': 'Lambda',
                'Parameters': [{'ParameterName': 'LambdaArn', 'ParameterValue': 'arn:aws:lambda:...'}]
            }]
        }
    }
)

Kinesis Analytics (SQL & Flink)

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    KINESIS ANALYTICS ARCHITECTURE                            │
│                                                                             │
│  Source Stream ──► Application ──► Output                                   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  SQL APPLICATION                                                     │    │
│  │                                                                     │    │
│  │  CREATE STREAM "sensor_stream" (                                    │    │
│  │      sensor_id VARCHAR(16),                                         │    │
│  │      temperature DOUBLE,                                            │    │
│  │      event_time TIMESTAMP                                           │    │
│  │  );                                                                 │    │
│  │                                                                     │    │
│  │  CREATE OR REPLACE PUMP "stream_pump" AS                            │    │
│  │  INSERT INTO "output_stream"                                        │    │
│  │  SELECT * FROM "sensor_stream"                                      │    │
│  │  WHERE temperature > 50.0;                                          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  FLINK APPLICATION                                                   │    │
│  │                                                                     │    │
│  │  • Apache Flink runtime                                             │    │
│  │  • Java/Scala/Python support                                        │    │
│  │  • Window operations (tumbling, sliding, session)                   │    │
│  │  • State management                                                  │    │
│  │  • Exactly-once processing                                          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

Kinesis Analytics SQL Example

-- Real-time aggregation
CREATE OR REPLACE STREAM "aggregated_stream" (
    sensor_id VARCHAR(16),
    avg_temperature DOUBLE,
    max_temperature DOUBLE,
    min_temperature DOUBLE,
    reading_count INTEGER,
    window_start TIMESTAMP
);

CREATE OR REPLACE PUMP "aggregate_pump" AS
INSERT INTO "aggregated_stream"
SELECT
    sensor_id,
    AVG(temperature) AS avg_temperature,
    MAX(temperature) AS max_temperature,
    MIN(temperature) AS min_temperature,
    COUNT(*) AS reading_count,
    STEP("sensor_stream".ROWTIME BY INTERVAL '5' MINUTE) AS window_start
FROM "sensor_stream"
GROUP BY sensor_id, STEP("sensor_stream".ROWTIME BY INTERVAL '5' MINUTE);

-- Anomaly detection
CREATE OR REPLACE STREAM "anomaly_stream" (
    sensor_id VARCHAR(16),
    temperature DOUBLE,
    anomaly_type VARCHAR(32)
);

CREATE OR REPLACE PUMP "anomaly_pump" AS
INSERT INTO "anomaly_stream"
SELECT
    sensor_id,
    temperature,
    'HIGH_TEMP' AS anomaly_type
FROM "sensor_stream"
WHERE temperature > (
    SELECT AVG(temperature) + 3 * STDDEV(temperature)
    FROM "sensor_stream"
);

Kinesis Best Practices

ℹ️

Pro Tip: Use Enhanced Fan-Out for dedicated throughput per consumer (2MB/s per consumer vs shared 2MB/s per shard). Ideal for multiple consumers reading the same stream.

Sharding Calculation

Metric	Formula	Example
Inbound	Records/s × Avg size	1000 × 1KB = 1MB/s
Outbound	Consumers × Records/s × Size	3 × 500 × 1KB = 1.5MB/s
Shards needed	MAX(Inbound/1MB, Outbound/2MB)	MAX(1, 0.75) = 1 shard

Cost Comparison

Component	Provisioned	On-Demand
Per shard/hour	$0.015	N/A
Per million PUT Payload Units	$0.01 \|$ 0.14
Storage (beyond 24hr)	$0.023/GB/month \|$ 0.023/GB/month
Enhanced Fan-Out	$0.015/shard/hour	N/A

⚠️

Cost Warning: On-demand mode costs ~9x more per record than provisioned. Use provisioned mode when you can predict throughput requirements.

Interview Questions & Answers

Q1: What is the difference between Kinesis Data Streams and Firehose?

Answer:

Data Streams: Real-time, custom consumers, shard-based, manual scaling
Firehose: Managed delivery, auto-scaling, built-in destinations (S3, Redshift, ES)

Use Data Streams for custom processing. Use Firehose for simple delivery to destinations.

Q2: How do you handle data ordering in Kinesis?

Answer: Data is ordered within a shard only. Use partition keys wisely:

Same partition key goes to same shard
Order is guaranteed per shard, not across shards
For global ordering: use single shard (limited to 1MB/s)

Q3: What is Enhanced Fan-Out and when should you use it?

Answer: Enhanced Fan-Out provides dedicated 2MB/s throughput per consumer instead of shared 2MB/s per shard. Use when:

Multiple consumers read same stream
Low-latency requirements (<70ms)
Each consumer needs consistent throughput

Q4: How does Kinesis handle data retention?

Answer:

Default: 24 hours
Maximum: 365 days
Extended retention costs $0.023/GB/month
Use S3 for long-term storage instead

Q5: What is the difference between Provisioned and On-Demand modes?

Answer:

Provisioned: You specify shard count, manual scaling, lower cost
On-Demand: Auto-scaling, pay-per-use, higher cost per record

Use Provisioned for predictable workloads, On-Demand for variable/unpredictable.

Summary

Kinesis is AWS's core streaming platform. Key takeaways:

Data Streams: Shard-based, custom consumers, ordered per shard
Firehose: Managed delivery, auto-scaling, S3/Redshift/ES destinations
Analytics: SQL and Flink for real-time processing
Sharding: 1MB/s in, 2MB/s out per shard; plan sharding carefully
Cost: Provisioned is cheaper for predictable workloads; On-Demand for variable
Fan-Out: Use for multiple consumers needing dedicated throughput

Amazon Kinesis for Data Engineers

🌊 Amazon Kinesis

Kinesis Architecture Overview

Kinesis Data Streams

Stream Configuration

Producer Example

KCL Consumer

Kinesis Firehose

Firehose to S3 with Partitioning

Kinesis Analytics (SQL & Flink)

Kinesis Analytics SQL Example

Kinesis Best Practices

Sharding Calculation

Cost Comparison

Interview Questions & Answers

Q1: What is the difference between Kinesis Data Streams and Firehose?

Q2: How do you handle data ordering in Kinesis?

Q3: What is Enhanced Fan-Out and when should you use it?

Q4: How does Kinesis handle data retention?

Q5: What is the difference between Provisioned and On-Demand modes?

Summary