Change Data Capture on GCP

Master Change Data Capture on GCP including Datastream, Debezium, Firestore exports, and real-time replication patterns.

16 min readAdvanced

CDC Architecture on GCP

📊 BigQuery Architecture for Data Engineering

Interview Tip: BigQuery separates storage and compute. Queries are charged by slots (compute) + bytes scanned. Always partition and cluster tables to reduce costs.

Datastream Implementation

from google.cloud import datastream_v1

client = datastream_v1.DatastreamClient()

# Create connection profile for source database
connection_profile = client.create_connection_profile(
    request={
        "parent": "projects/my-project/locations/us-central1",
        "connection_profile_id": "mysql-source",
        "connection_profile": {
            "display_name": "MySQL Source",
            "mysql_profile": {
                "hostname": "10.0.0.1",
                "port": 3306,
                "username": "cdc_user",
                "password": "secure-password"
            }
        }
    }
)

# Create stream
stream = client.create_stream(
    request={
        "parent": "projects/my-project/locations/us-central1",
        "stream_id": "mysql-to-bigquery",
        "stream": {
            "display_name": "MySQL to BigQuery CDC",
            "source_config": {
                "source_connection_profile": connection_profile.name,
                "mysql_source_config": {
                    "include_databases": ["production"],
                    "include_objects": {
                        "production": {
                            "tables": ["users", "orders", "products"]
                        }
                    },
                    "max_concurrent_cdc_tasks": 4,
                    "max_concurrent_full_snapshot_tasks": 2
                }
            },
            "destination_config": {
                "destination_connection_profile": "projects/my-project/locations/us-central1/connectionProfiles/bigquery-destination",
                "bigquery_destination_config": {
                    "single_target_dataset": {
                        "dataset_id": "cdc_replicated"
                    }
                }
            }
        }
    }
)

Debezium on Dataproc

# Debezium connector configuration for Kafka
debezium_config = {
    "name": "mysql-cdc-connector",
    "config": {
        "connector.class": "io.debezium.connector.mysql.MySqlConnector",
        "database.hostname": "10.0.0.1",
        "database.port": "3306",
        "database.user": "cdc_user",
        "database.password": "secure-password",
        "database.server.id": "1",
        "database.include.list": "production",
        "table.include.list": "production.users,production.orders",
        "database.history.kafka.bootstrap.servers": "kafka:9092",
        "database.history.kafka.topic": "schema-changes",
        "transforms": "route",
        "transforms.route.type": "org.apache.kafka.connect.transforms.RegexRouter",
        "transforms.route.regex": "([^.]+)\\.([^.]+)\\.([^.]+)",
        "transforms.route.replacement": "cdc-$3"
    }
}

✨

Best Practice: Use Datastream for managed CDC to BigQuery. It handles log reading, schema mapping, and error recovery automatically. For custom requirements, use Debezium with Kafka on Dataproc. Always monitor CDC lag and implement alerts for replication delays.

💬

Common Interview Questions

Q1: What is CDC and why is it important?

Answer: CDC (Change Data Capture) captures and streams database changes (INSERT, UPDATE, DELETE) in real-time. It's important for: 1) Real-time analytics, 2) Data synchronization across systems, 3) Event-driven architectures, 4) Minimal impact on source databases.

Q2: What is the difference between Datastream and Debezium?

Answer: Datastream is a managed GCP service for CDC to BigQuery/GCS with minimal setup. Debezium is open-source, runs on Kafka, and supports more databases/customizations. Use Datastream for simplicity, Debezium for complex requirements or multi-cloud.

Q3: How do you handle schema changes in CDC?

Answer: Datastream handles additive schema changes automatically. For non-additive changes, pause the stream, update the schema, and resume. Debezium uses schema history topics for tracking. Always test schema changes in non-production first.

Q4: What are the limitations of CDC?

Answer: 1) Log-based CDC requires database with transaction logs, 2) Schema changes need careful handling, 3) Large backfills may impact source, 4) Network latency affects replication lag, 5) Some operations (TRUNCATE) may not be captured.

Q5: How do you monitor CDC health?

Answer: 1) Monitor replication lag (lag < 1 minute is ideal), 2) Track error rates, 3) Monitor source database performance, 4) Alert on schema change events, 5) Track throughput (rows/second), 6) Monitor destination write latency.

CDC Patterns: Datastream, Debezium & Firestore Exports