πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

CDC Patterns on AWS

AWS Data EngineeringDMS, Debezium & DynamoDB Streams⭐ Premium

Advertisement

πŸ”„ CDC Patterns on AWS

Master Change Data Capture with DMS, Debezium, and DynamoDB Streams.

Module: AWS Data Engineering β€’ Topic 23 of 65 β€’ Premium Content

CDC Architecture

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CDC PATTERNS ON AWS                                        β”‚
β”‚                                                                             β”‚
β”‚  OPTION 1: AWS DMS (Managed)                                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  Source  │───►│   DMS    │───►│   S3     │───►│ Redshift β”‚            β”‚
β”‚  β”‚  (RDS)   β”‚    β”‚ Replicatonβ”‚   β”‚  (Lake)  β”‚    β”‚ (WH)     β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                                                             β”‚
β”‚  OPTION 2: Debezium on MSK (Self-managed)                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  Source  │───►│ Debezium │───►│   MSK    │───►│ Lambda/  β”‚            β”‚
β”‚  β”‚  (MySQL) β”‚    β”‚ Connectorβ”‚    β”‚ (Kafka)  β”‚    β”‚ Glue     β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                                                             β”‚
β”‚  OPTION 3: DynamoDB Streams                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚ DynamoDB │───►│ Streams  │───►│ Lambda   │───►│ S3/other β”‚            β”‚
β”‚  β”‚  Table   β”‚    β”‚          β”‚    β”‚ Process  β”‚    β”‚          β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

AWS DMS Configuration

import boto3

dms = boto3.client('dms')

# Create replication instance
response = dms.create_replication_instance(
    ReplicationInstanceIdentifier='cdc-replication-instance',
    ReplicationInstanceClass='dms.r5.large',
    AllocatedStorage=100,
    MultiAZ=True,
    VpcSecurityGroupIds=['sg-12345678'],
    ReplicationSubnetGroupIdentifier='my-subnet-group',
    PubliclyAccessible=False,
    EngineVersion='3.5.1',
    AutoMinorVersionUpgrade=True
)

# Create source endpoint (RDS MySQL)
source_endpoint = dms.create_endpoint(
    EndpointIdentifier='source-mysql',
    EndpointType='source',
    EngineName='mysql',
    ServerName='source-db.cluster-123456.us-east-1.rds.amazonaws.com',
    Port=3306,
    DatabaseName='production',
    Username='dms_user',
    Password='SecurePassword123!',
    SslMode='require'
)

# Create target endpoint (S3)
target_endpoint = dms.create_endpoint(
    EndpointIdentifier='target-s3',
    EndpointType='target',
    EngineName='s3',
    S3Settings={
        'BucketName': 'cdc-data-lake',
        'BucketFolder': 'raw/',
        'ServiceAccessRoleArn': 'arn:aws:iam::123456789012:role/DMSS3Role',
        'CompressionType': 'GZIP',
        'ExternalTableDefinition': '{"Version":1,"DataFormat":"parquet"}'
    }
)

# Create replication task with CDC
replication_task = dms.create_replication_task(
    ReplicationTaskIdentifier='cdc-task',
    SourceEndpointArn=source_endpoint['Endpoint']['EndpointArn'],
    TargetEndpointArn=target_endpoint['Endpoint']['EndpointArn'],
    ReplicationInstanceArn=response['ReplicationInstance']['ReplicationInstanceArn'],
    MigrationType='cdc',
    TableMappings='''{
        "rules": [{
            "rule-type": "selection",
            "rule-id": "1",
            "rule-name": "include-all-tables",
            "object-locator": {
                "schema-name": "production",
                "table-name": "%"
            },
            "rule-action": "include"
        }]
    }''',
    ReplicationTaskSettings='''{
        "TargetMetadata": {
            "TargetSchema": "",
            "SupportLobs": true,
            "FullLobMode": false,
            "LobChunkSize": 64,
            "LimitedSizeLobMode": true,
            "LobMaxSize": 32
        },
        "FullLoadSettings": {
            "TargetTablePrepMode": "DROP_AND_CREATE",
            "CreatePkAfterFullLoad": true
        },
        "Logging": {
            "EnableLogging": true,
            "LogComponents": [
                {"Id": "SOURCE_UNLOAD", "Severity": "LOGGER_SEVERITY_DEFAULT"},
                {"Id": "TARGET_LOAD", "Severity": "LOGGER_SEVERITY_DEFAULT"},
                {"Id": "SOURCE_CAPTURE", "Severity": "LOGGER_SEVERITY_DEFAULT"},
                {"Id": "TARGET_APPLY", "Severity": "LOGGER_SEVERITY_DEFAULT"}
            ]
        }
    }'''
)

ℹ️

Pro Tip: DMS CDC uses the source database's transaction log (MySQL binlog, PostgreSQL WAL) to capture changes with minimal impact on source performance.

Debezium on MSK

{
  "name": "mysql-cdc-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "source-db.cluster-123456.us-east-1.rds.amazonaws.com",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "secure-password",
    "database.server.id": "184054",
    "database.server.name": "production",
    "database.include.list": "production_db",
    "database.history.kafka.bootstrap.servers": "b-1.msk-cluster:9092",
    "database.history.kafka.topic": "schema-changes.production",
    "decimal.handling.mode": "double",
    "time.precision.mode": "connect",
    "transforms": "route,unwrap",
    "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
    "transforms.route.replacement": "$3",
    "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
  }
}

Interview Q&A

Q1: DMS vs Debezium?

Answer: DMS is AWS-managed, simpler setup, limited transformations. Debezium is open-source, more flexible, runs on MSK. DMS for quick migrations; Debezium for complex streaming architectures.

Q2: What are the three CDC modes in DMS?

Answer: 1) Full load only, 2) Full load + CDC, 3) CDC only. Use full load + CDC for initial migration with ongoing sync.

Q3: How do DynamoDB Streams work?

Answer: Streams capture item-level modifications (INSERT, MODIFY, REMOVE) with 24-hour retention. Lambda or Kinesis can consume stream records.

Summary

  • DMS: AWS-managed CDC, easy setup, supports most databases
  • Debezium: Open-source, runs on MSK, more flexible
  • DynamoDB Streams: Built-in CDC for DynamoDB tables
  • Best Practice: Use transaction log-based CDC for minimal source impact

Advertisement