MSK Architecture
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AMAZON MSK ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MSK CLUSTER (VPC) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β BROKER NODES (Multi-AZ) β β β
β β β β β β
β β β AZ-a AZ-b AZ-c β β β
β β β ββββββββββββ ββββββββββββ ββββββββββββ β β β
β β β β Broker 1 β β Broker 2 β β Broker 3 β β β β
β β β β k.m5.largeβ β k.m5.largeβ β k.m5.largeβ β β β
β β β β 8 vCPU β β 8 vCPU β β 8 vCPU β β β β
β β β β 16 GB RAMβ β 16 GB RAMβ β 16 GB RAMβ β β β
β β β ββββββββββββ ββββββββββββ ββββββββββββ β β β
β β β β β β
β β β Topics replicated across all brokers β β β
β β β Partitions distributed across brokers β β β
β β β Replication Factor: 3 (default) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β ZOOKEEPER / KRAFT CLUSTER β β β
β β β β β β
β β β β’ ZooKeeper mode: 3 ZooKeeper nodes β β β
β β β β’ KRaft mode: No ZooKeeper needed (newer) β β β
β β β β’ Manages cluster metadata β β β
β β β β’ Leader election β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β STORAGE β β β
β β β β β β
β β β β’ EBS volumes (GP3) per broker β β β
β β β β’ 100 GB - 16 TB per broker β β β
β β β β’ Auto-scaling storage enabled β β β
β β β β’ Tiered storage for old data (S3) β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MSK CONNECT β β
β β β β
β β ββββββββββββββββ ββββββββββββββββ β β
β β β Source β β Sink β β β
β β β Connector β β Connector β β β
β β β β β β β β
β β β β’ JDBC β β β’ S3 β β β
β β β β’ Debezium β β β’ Redshift β β β
β β β β’ DynamoDB β β β’ Elasticsearchβ β β
β β β β’ CloudWatch β β β’ HTTP β β β
β β ββββββββββββββββ ββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MSK Cluster Configuration
import boto3
msk = boto3.client('kafka')
# Create MSK cluster
response = msk.create_cluster(
ClusterName='data-streaming-cluster',
KafkaVersion='3.5.1',
NumberOfBrokerNodes=3,
BrokerNodeGroupInfo={
'InstanceType': 'k.m5.large',
'ClientSubnets': ['subnet-12345678', 'subnet-87654321', 'subnet-abcdef12'],
'SecurityGroups': ['sg-12345678'],
'StorageInfo': {
'EBSStorageInfo': {
'VolumeSize': 500,
'ProvisionedThroughput': {
'Enabled': True,
'VolumeThroughput': 250
}
}
},
'ConnectivityInfo': {
'PublicAccess': {'Type': 'DISABLED'}
}
},
EncryptionInfo={
'EncryptionInTransit': {
'ClientBroker': 'TLS',
'InCluster': True
},
'EncryptionAtRestKmsKeyArn': 'arn:aws:kms:us-east-1:123456789012:key/12345678'
},
ConfigurationInfo={
'Arn': 'arn:aws:kafka:us-east-1:123456789012:configuration/custom-config/12345678',
'Revision': 1
},
EnhancedMonitoring='PER_BROKER',
OpenMonitoring={
'Prometheus': {
'JmxExporter': {'EnabledInBroker': True},
'NodeExporter': {'EnabledInBroker': True}
}
},
LoggingInfo={
'BrokerLogs': {
'CloudWatchLogs': {'Enabled': True, 'LogGroup': '/aws/msk/cluster'},
'S3Logs': {'Enabled': True, 'Bucket': 'msk-logs-bucket'},
'Firehose': {'Enabled': False}
}
},
Tags={'Environment': 'production', 'Team': 'data-engineering'}
)
cluster_arn = response['ClusterArn']
print(f"Cluster ARN: {cluster_arn}")
MSK Serverless
# Create MSK Serverless cluster
response = msk_serverless.create_cluster(
clusterName='serverless-streaming',
kafkaVersion='3.4.0',
vpcConfigs=[
{
'subnetIds': ['subnet-12345678', 'subnet-87654321'],
'securityGroups': ['sg-12345678']
}
],
clientAuthentication={
'sasl': {
'iam': {'enabled': True}
}
},
tags={'Environment': 'production'}
)
βΉοΈ
Pro Tip: MSK Serverless is ideal for unpredictable workloads. It automatically scales based on throughput and charges per VPU-hour (Virtual Processing Unit).
MSK Connect
Connector Configuration
# Create MSK Connect connector (S3 Sink)
response = msk_connect.create_connector(
connectorName='s3-sink-connector',
kafkaConnectArn='arn:aws:kafkaconnect:us-east-1:123456789012:connector/custom-config/12345678',
connectorConfiguration={
'connector.class': 'io.confluent.connect.s3.S3SinkConnector',
'tasks.max': '3',
'topics.regex': 'raw-(.*)',
's3.bucket.name': 'data-lake-raw',
's3.region': 'us-east-1',
'flush.size': '1000',
'rotate.interval.ms': '300000',
'storage.class': 'io.confluent.connect.s3.storage.S3Partitioner',
'partitioner.class': 'io.confluent.connect.storage.partitioner.TimeBasedPartitioner',
'path.format': "'year'=YYYY/'month'=MM/'day'=dd",
'locale': 'en_US',
'timezone': 'UTC',
'format.class': 'io.confluent.connect.s3.format.parquet.ParquetFormat',
'parquet.codec': 'snappy',
'schema.compatibility': 'BACKWARD'
},
serviceExecutionRoleArn='arn:aws:iam::123456789012:role/MSKConnectRole',
kafkaConnectEndpoint='https://connect.us-east-1.amazonaws.com',
capacity={
'connectorProvisionedThroughput': {
'readCapacityPerSec': 100,
'writeCapacityPerSec': 100
}
},
tags={'Environment': 'production'}
)
Debezium CDC Connector
# Debezium source connector for CDC
debezium_config = {
'connector.class': 'io.debezium.connector.mysql.MySqlConnector',
'tasks.max': '1',
'database.hostname': 'rds-instance.cluster-123456.us-east-1.rds.amazonaws.com',
'database.port': '3306',
'database.user': 'debezium',
'database.password': 'secure-password',
'database.server.id': '1',
'database.server.name': 'production',
'database.include.list': 'production_db',
'table.include.list': 'production_db.customers,production_db.orders',
'database.history.kafka.bootstrap.servers': 'b-1.12345678.us-east-1.kafka.amazonaws.com:9092',
'database.history.kafka.topic': 'schema-changes.production',
'transforms': 'route',
'transforms.route.type': 'org.apache.kafka.connect.transforms.RegexRouter',
'transforms.route.regex': '([^.]+)\\.([^.]+)\\.([^.]+)',
'transforms.route.replacement': '$3'
}
ksqlDB
Architecture Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ksqlDB ARCHITECTURE β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β KAFKA TOPICS βββΊ ksqlDB Server βββΊ Materialized Views β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β STREAMS (Unbounded, append-only) β β β
β β β β β β
β β β CREATE STREAM sensor_stream ( β β β
β β β sensor_id VARCHAR KEY, β β β
β β β temperature DOUBLE, β β β
β β β event_time TIMESTAMP β β β
β β β ) WITH ( β β β
β β β KAFKA_TOPIC = 'sensor-data', β β β
β β β VALUE_FORMAT = 'JSON' β β β
β β β ); β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β TABLES (Materialized, updatable) β β β
β β β β β β
β β β CREATE TABLE sensor_agg AS β β β
β β β SELECT sensor_id, β β β
β β β AVG(temperature) AS avg_temp, β β β
β β β COUNT(*) AS reading_count β β β
β β β FROM sensor_stream β β β
β β β WINDOW TUMBLING (5 MINUTES) β β β
β β β GROUP BY sensor_id; β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βΉοΈ
ksqlDB Use Cases: Real-time aggregation, filtering, enrichment, join streams with tables, sessionization, anomaly detection.
Interview Questions & Answers
Q1: What is the difference between MSK and self-managed Kafka?
Answer:
- MSK: AWS-managed, automatic patching, multi-AZ, monitoring built-in
- Self-managed: Full control, custom configurations, more operational overhead
MSK reduces operational burden but costs more. Self-managed gives more flexibility.
Q2: How does MSK handle data replication?
Answer:
- Default replication factor: 3 (across AZs)
- ISR (In-Sync Replicas) ensures durability
- Acknowledgment settings: acks=0/1/all
- Min In-Sync Replicas setting affects availability vs durability
Q3: When should you use MSK Serverless?
Answer:
- Variable/unpredictable workloads
- Development and testing
- Small to medium throughput requirements
- Want to avoid cluster management
Q4: What is the purpose of MSK Connect?
Answer: MSK Connect runs Kafka Connect connectors as a managed service:
- Source connectors: Capture changes from databases, files
- Sink connectors: Deliver data to S3, Redshift, Elasticsearch
- No infrastructure management
- Auto-scaling based on throughput
Q5: How do you monitor MSK cluster health?
Answer:
- CloudWatch metrics (CPU, memory, disk, network)
- Broker logs (CloudWatch, S3, Firehose)
- Open monitoring (Prometheus)
- MSK Console health dashboard
Summary
Amazon MSK is the managed Kafka service on AWS. Key takeaways:
- Architecture: Multi-AZ brokers with EBS storage
- Serverless: Auto-scaling, pay-per-use for variable workloads
- MSK Connect: Managed Kafka Connect for source/sink connectors
- ksqlDB: SQL for stream processing on Kafka
- Security: TLS encryption, IAM authentication, VPC connectivity
- Monitoring: CloudWatch, Open Monitoring, Broker Logs