📨 Amazon MSK

Master Amazon MSK Kafka clusters, MSK Connect, ksqlDB, Serverless, and managed streaming patterns.

Module: AWS Data Engineering • Topic 11 of 65 • Premium Content

MSK Architecture

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    AMAZON MSK ARCHITECTURE                                    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │                    MSK CLUSTER (VPC)                                 │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  BROKER NODES (Multi-AZ)                                       │  │    │
│  │  │                                                               │  │    │
│  │  │  AZ-a              AZ-b              AZ-c                    │  │    │
│  │  │  ┌──────────┐     ┌──────────┐     ┌──────────┐             │  │    │
│  │  │  │ Broker 1 │     │ Broker 2 │     │ Broker 3 │             │  │    │
│  │  │  │ k.m5.large│    │ k.m5.large│    │ k.m5.large│            │  │    │
│  │  │  │ 8 vCPU   │     │ 8 vCPU   │     │ 8 vCPU   │             │  │    │
│  │  │  │ 16 GB RAM│     │ 16 GB RAM│     │ 16 GB RAM│             │  │    │
│  │  │  └──────────┘     └──────────┘     └──────────┘             │  │    │
│  │  │                                                               │  │    │
│  │  │  Topics replicated across all brokers                         │  │    │
│  │  │  Partitions distributed across brokers                        │  │    │
│  │  │  Replication Factor: 3 (default)                              │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  ZOOKEEPER / KRAFT CLUSTER                                    │  │    │
│  │  │                                                               │  │    │
│  │  │  • ZooKeeper mode: 3 ZooKeeper nodes                          │  │    │
│  │  │  • KRaft mode: No ZooKeeper needed (newer)                    │  │    │
│  │  │  • Manages cluster metadata                                    │  │    │
│  │  │  • Leader election                                             │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  STORAGE                                                      │  │    │
│  │  │                                                               │  │    │
│  │  │  • EBS volumes (GP3) per broker                               │  │    │
│  │  │  • 100 GB - 16 TB per broker                                  │  │    │
│  │  │  • Auto-scaling storage enabled                                │  │    │
│  │  │  • Tiered storage for old data (S3)                           │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  MSK CONNECT                                                         │    │
│  │                                                                     │    │
│  │  ┌──────────────┐         ┌──────────────┐                        │    │
│  │  │  Source       │         │  Sink        │                        │    │
│  │  │  Connector    │         │  Connector   │                        │    │
│  │  │              │         │              │                        │    │
│  │  │  • JDBC       │         │  • S3        │                        │    │
│  │  │  • Debezium   │         │  • Redshift  │                        │    │
│  │  │  • DynamoDB   │         │  • Elasticsearch│                    │    │
│  │  │  • CloudWatch │         │  • HTTP      │                        │    │
│  │  └──────────────┘         └──────────────┘                        │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

MSK Cluster Configuration

import boto3

msk = boto3.client('kafka')

# Create MSK cluster
response = msk.create_cluster(
    ClusterName='data-streaming-cluster',
    KafkaVersion='3.5.1',
    NumberOfBrokerNodes=3,
    BrokerNodeGroupInfo={
        'InstanceType': 'k.m5.large',
        'ClientSubnets': ['subnet-12345678', 'subnet-87654321', 'subnet-abcdef12'],
        'SecurityGroups': ['sg-12345678'],
        'StorageInfo': {
            'EBSStorageInfo': {
                'VolumeSize': 500,
                'ProvisionedThroughput': {
                    'Enabled': True,
                    'VolumeThroughput': 250
                }
            }
        },
        'ConnectivityInfo': {
            'PublicAccess': {'Type': 'DISABLED'}
        }
    },
    EncryptionInfo={
        'EncryptionInTransit': {
            'ClientBroker': 'TLS',
            'InCluster': True
        },
        'EncryptionAtRestKmsKeyArn': 'arn:aws:kms:us-east-1:123456789012:key/12345678'
    },
    ConfigurationInfo={
        'Arn': 'arn:aws:kafka:us-east-1:123456789012:configuration/custom-config/12345678',
        'Revision': 1
    },
    EnhancedMonitoring='PER_BROKER',
    OpenMonitoring={
        'Prometheus': {
            'JmxExporter': {'EnabledInBroker': True},
            'NodeExporter': {'EnabledInBroker': True}
        }
    },
    LoggingInfo={
        'BrokerLogs': {
            'CloudWatchLogs': {'Enabled': True, 'LogGroup': '/aws/msk/cluster'},
            'S3Logs': {'Enabled': True, 'Bucket': 'msk-logs-bucket'},
            'Firehose': {'Enabled': False}
        }
    },
    Tags={'Environment': 'production', 'Team': 'data-engineering'}
)

cluster_arn = response['ClusterArn']
print(f"Cluster ARN: {cluster_arn}")

MSK Serverless

# Create MSK Serverless cluster
response = msk_serverless.create_cluster(
    clusterName='serverless-streaming',
    kafkaVersion='3.4.0',
    vpcConfigs=[
        {
            'subnetIds': ['subnet-12345678', 'subnet-87654321'],
            'securityGroups': ['sg-12345678']
        }
    ],
    clientAuthentication={
        'sasl': {
            'iam': {'enabled': True}
        }
    },
    tags={'Environment': 'production'}
)

ℹ️

Pro Tip: MSK Serverless is ideal for unpredictable workloads. It automatically scales based on throughput and charges per VPU-hour (Virtual Processing Unit).

MSK Connect

Connector Configuration

# Create MSK Connect connector (S3 Sink)
response = msk_connect.create_connector(
    connectorName='s3-sink-connector',
    kafkaConnectArn='arn:aws:kafkaconnect:us-east-1:123456789012:connector/custom-config/12345678',
    connectorConfiguration={
        'connector.class': 'io.confluent.connect.s3.S3SinkConnector',
        'tasks.max': '3',
        'topics.regex': 'raw-(.*)',
        's3.bucket.name': 'data-lake-raw',
        's3.region': 'us-east-1',
        'flush.size': '1000',
        'rotate.interval.ms': '300000',
        'storage.class': 'io.confluent.connect.s3.storage.S3Partitioner',
        'partitioner.class': 'io.confluent.connect.storage.partitioner.TimeBasedPartitioner',
        'path.format': "'year'=YYYY/'month'=MM/'day'=dd",
        'locale': 'en_US',
        'timezone': 'UTC',
        'format.class': 'io.confluent.connect.s3.format.parquet.ParquetFormat',
        'parquet.codec': 'snappy',
        'schema.compatibility': 'BACKWARD'
    },
    serviceExecutionRoleArn='arn:aws:iam::123456789012:role/MSKConnectRole',
    kafkaConnectEndpoint='https://connect.us-east-1.amazonaws.com',
    capacity={
        'connectorProvisionedThroughput': {
            'readCapacityPerSec': 100,
            'writeCapacityPerSec': 100
        }
    },
    tags={'Environment': 'production'}
)

Debezium CDC Connector

# Debezium source connector for CDC
debezium_config = {
    'connector.class': 'io.debezium.connector.mysql.MySqlConnector',
    'tasks.max': '1',
    'database.hostname': 'rds-instance.cluster-123456.us-east-1.rds.amazonaws.com',
    'database.port': '3306',
    'database.user': 'debezium',
    'database.password': 'secure-password',
    'database.server.id': '1',
    'database.server.name': 'production',
    'database.include.list': 'production_db',
    'table.include.list': 'production_db.customers,production_db.orders',
    'database.history.kafka.bootstrap.servers': 'b-1.12345678.us-east-1.kafka.amazonaws.com:9092',
    'database.history.kafka.topic': 'schema-changes.production',
    'transforms': 'route',
    'transforms.route.type': 'org.apache.kafka.connect.transforms.RegexRouter',
    'transforms.route.regex': '([^.]+)\\.([^.]+)\\.([^.]+)',
    'transforms.route.replacement': '$3'
}

ksqlDB

Architecture Diagram

┌─────────────────────────────────────────────────────────────────────────────┐
│                    ksqlDB ARCHITECTURE                                        │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐    │
│  │  KAFKA TOPICS ──► ksqlDB Server ──► Materialized Views             │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  STREAMS (Unbounded, append-only)                             │  │    │
│  │  │                                                               │  │    │
│  │  │  CREATE STREAM sensor_stream (                                 │  │    │
│  │  │      sensor_id VARCHAR KEY,                                    │  │    │
│  │  │      temperature DOUBLE,                                       │  │    │
│  │  │      event_time TIMESTAMP                                      │  │    │
│  │  │  ) WITH (                                                      │  │    │
│  │  │      KAFKA_TOPIC = 'sensor-data',                              │  │    │
│  │  │      VALUE_FORMAT = 'JSON'                                     │  │    │
│  │  │  );                                                            │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  │                                                                     │    │
│  │  ┌───────────────────────────────────────────────────────────────┐  │    │
│  │  │  TABLES (Materialized, updatable)                              │  │    │
│  │  │                                                               │  │    │
│  │  │  CREATE TABLE sensor_agg AS                                    │  │    │
│  │  │      SELECT sensor_id,                                         │  │    │
│  │  │             AVG(temperature) AS avg_temp,                      │  │    │
│  │  │             COUNT(*) AS reading_count                          │  │    │
│  │  │      FROM sensor_stream                                        │  │    │
│  │  │      WINDOW TUMBLING (5 MINUTES)                               │  │    │
│  │  │      GROUP BY sensor_id;                                       │  │    │
│  │  └───────────────────────────────────────────────────────────────┘  │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────────────────┘

ℹ️

ksqlDB Use Cases: Real-time aggregation, filtering, enrichment, join streams with tables, sessionization, anomaly detection.

Interview Questions & Answers

Q1: What is the difference between MSK and self-managed Kafka?

Answer:

MSK: AWS-managed, automatic patching, multi-AZ, monitoring built-in
Self-managed: Full control, custom configurations, more operational overhead

MSK reduces operational burden but costs more. Self-managed gives more flexibility.

Q2: How does MSK handle data replication?

Answer:

Default replication factor: 3 (across AZs)
ISR (In-Sync Replicas) ensures durability
Acknowledgment settings: acks=0/1/all
Min In-Sync Replicas setting affects availability vs durability

Q3: When should you use MSK Serverless?

Answer:

Variable/unpredictable workloads
Development and testing
Small to medium throughput requirements
Want to avoid cluster management

Q4: What is the purpose of MSK Connect?

Answer: MSK Connect runs Kafka Connect connectors as a managed service:

Source connectors: Capture changes from databases, files
Sink connectors: Deliver data to S3, Redshift, Elasticsearch
No infrastructure management
Auto-scaling based on throughput

Q5: How do you monitor MSK cluster health?

Answer:

CloudWatch metrics (CPU, memory, disk, network)
Broker logs (CloudWatch, S3, Firehose)
Open monitoring (Prometheus)
MSK Console health dashboard

Summary

Amazon MSK is the managed Kafka service on AWS. Key takeaways:

Architecture: Multi-AZ brokers with EBS storage
Serverless: Auto-scaling, pay-per-use for variable workloads
MSK Connect: Managed Kafka Connect for source/sink connectors
ksqlDB: SQL for stream processing on Kafka
Security: TLS encryption, IAM authentication, VPC connectivity
Monitoring: CloudWatch, Open Monitoring, Broker Logs

Amazon MSK for Data Engineers

📨 Amazon MSK

MSK Architecture

MSK Cluster Configuration

MSK Serverless

MSK Connect

Connector Configuration

Debezium CDC Connector

ksqlDB

Interview Questions & Answers

Q1: What is the difference between MSK and self-managed Kafka?

Q2: How does MSK handle data replication?

Q3: When should you use MSK Serverless?

Q4: What is the purpose of MSK Connect?

Q5: How do you monitor MSK cluster health?

Summary