Apache Kafka Architecture: A Deep Dive into Distributed Event Streaming

Architecture Diagram: Partition Distribution & Replication

Architecture Diagram: Consumer Group Coordination

Architecture Diagram: Log Segment Internals

Why Kafka? The Industry Context

Apache Kafka was created at LinkedIn in 2011 to solve a critical problem: the company was generating 1 billion+ events per day from user activity logs, and existing message queues (RabbitMQ, ActiveMQ) couldn't handle the throughput. Kafka's founder, Jay Kreps, designed it as a distributed commit log — a data structure that combines the throughput of a log file with the reliability of a message queue.

The Problem Kafka Solves

Legacy Pain Point	Traditional Approach	Kafka's Solution
Throughput at scale	Message queues (RabbitMQ: 50K msg/s)	100K+ msg/s per broker, horizontally scalable
Data retention	Queue deletes after consumption	Retain data for days/weeks/months
Replay capability	No replay (fire-and-forget)	Read from any offset, replay entire history
Multiple consumers	Point-to-point (1 consumer per message)	Publish-subscribe (N consumers per message)
Ordering guarantees	Limited (per-queue)	Per-partition ordering with keys
Decoupling producers/consumers	Tightly coupled APIs	Schema registry + topic abstraction

Who Uses Kafka in Production?

LinkedIn: 8 trillion messages/day, 7PB of data daily — Kafka's birthplace
Netflix: 700 billion events/day for personalization, recommendations, and monitoring
Uber: 3 million events/second for real-time ride matching and pricing
Airbnb: 2.5 billion events/day for search, pricing, and fraud detection
Goldman Sachs: Processes trillions in daily trading volume through Kafka
Apple: Uses Kafka for Siri voice processing, Apple Music, and iCloud sync

Kafka vs. The Competition

Feature	Apache Kafka	RabbitMQ	Amazon Kinesis	Apache Pulsar
Throughput	100K+ msg/s/broker	50K msg/s	1MB/s/shard	100K+ msg/s/broker
Retention	Unlimited (disk-based)	Until consumed	24 hours (default)	Unlimited
Replay	Yes (offset-based)	No	Yes (24h window)	Yes
Ordering	Per-partition	Per-queue	Per-shard	Per-partition
Multi-tenancy	Topic-level isolation	Vhost-based	Account-level	Tenant/namespace
Operational complexity	Medium (ZK/KRaft)	Low	Low (managed)	High

Key Insight: Kafka isn't just a message queue — it's a distributed event streaming platform. The difference: message queues are for sending messages between services, while Kafka is for building real-time data pipelines that can be replayed, reprocessed, and consumed by multiple independent systems simultaneously.

Real-World Case Study: E-commerce Event Streaming

An e-commerce platform processes 50 million events per day across web, mobile, and backend services. Before Kafka:

Data latency: Batch ETL ran every 4 hours — dashboards showed 4-hour-old data
Event loss: RabbitMQ dropped 0.1% of events during traffic spikes
Replay impossible: When a bug corrupted the user_actions topic, there was no way to replay historical data
Consumer coupling: Adding a new analytics consumer required modifying the producer

After implementing Kafka:

Architecture Diagram

Architecture:
  Producers: Web (React), Mobile (iOS/Android), Backend (Node.js, Java)
  Topics: user-actions, orders, payments, inventory, notifications
  Consumers: Real-time analytics, Recommendation engine, Fraud detection, Data warehouse

Key design decisions:

Decision	Rationale	Impact
12 partitions per topic	3 brokers × 4 partitions = even distribution	100K msg/s sustained throughput
Replication factor 3	Tolerate 1 broker failure with zero data loss	99.999% availability
`acks=all`, `min.insync.replicas=2`	Strongest durability guarantee	Zero data loss in 2 years
`retention.ms=604800000` (7 days)	Replay window for bug fixes and reprocessing	Saved 3 incidents from data corruption
Schema Registry (Avro)	Schema evolution without breaking consumers	Added 5 new fields without downtime

Results:

Data freshness: 4 hours → 200 milliseconds (12,000x improvement)
Event delivery: 99.9% → 99.999% (50x reduction in lost events)
New consumer onboarding: 2 weeks → 2 hours (topic discovery vs. API modification)
Incident recovery: 8 hours (restore from backup) → 10 minutes (replay from Kafka)

Common Kafka Failures and How to Handle Them

Failure	Symptoms	Root Cause	Solution
Consumer lag growing	Lag metric increases monotonously	Consumer can't keep up with production rate	Add consumers (up to partition count), optimize processing
Under-replicated partitions	`UnderReplicatedPartitions` > 0	Broker failure or network partition	Check broker health, verify ISR size ≥ `min.insync.replicas`
Producer `NOT_ENOUGH_REPLICAS`	Producer throws exception	ISR size dropped below `min.insync.replicas`	Wait for replica to rejoin, or temporarily reduce `min.insync.replicas`
Consumer `OffsetOutOfRangeException`	Consumer can't find offset	Data was deleted due to retention policy	Reset offset to `earliest` or `latest`, investigate retention needs
Broker OOM	Broker crashes repeatedly	Too much memory used for caching	Increase `heap.size`, reduce `num.partitions`, or add brokers

Formal Definitions

Detailed Explanation

Apache Kafka is a distributed event streaming platform capable of handling trillions of events per day with sub-millisecond latency. At its core, Kafka operates as a distributed commit log, where data is persisted in an append-only fashion across a cluster of brokers. The fundamental architectural principle is decoupling of producers and consumers through a publish-subscribe model that enables horizontal scalability while maintaining strict ordering guarantees within partitions.

Broker Layer Deep Dive

The Broker Layer consists of a cluster of servers (brokers) that each maintain partition data and serve client requests. Each broker has a unique identifier (broker.id) and listens on a configured port (default: 9092). The brokers communicate with each other for replication and with ZooKeeper (or KRaft controller) for cluster metadata management. When a producer sends a message, the broker acknowledges receipt based on the configured acknowledgment policy (acks=0, 1, or all). The broker stores the message in its local commit log, which is a sequence of files on disk optimized for sequential I/O operations. Each partition is a separate directory containing multiple log segment files, with only the active segment accepting writes.

The broker's network layer uses Java NIO with a configurable thread pool (num.network.threads, default 3) for handling socket I/O, and a separate I/O thread pool (num.io.threads, default 8) for disk operations. This two-tier threading model allows Kafka to handle high concurrent connections while maintaining efficient disk utilization. The broker also maintains an in-memory cache of recently accessed index entries to accelerate random offset lookups.

Topic Partitioning and Distribution

Topics serve as logical categories or feed names for messages. Each topic is divided into multiple partitions, which are the unit of parallelism in Kafka. The number of partitions determines the maximum number of consumer instances that can read from the topic in parallel within a consumer group. Partitions are distributed across brokers using a consistent hashing algorithm, ensuring even distribution of load. Each partition maintains its own commit log, and messages within a partition are assigned monotonically increasing offsets. This design enables Kafka to achieve extremely high throughput by parallelizing both writes (across partitions) and reads (via consumer groups).

Partition assignment to brokers follows a rack-aware algorithm that considers broker availability zones for fault tolerance. The controller broker maintains the partition assignment map in memory and propagates changes via ZooKeeper or the KRaft metadata log. When a new broker joins the cluster, partitions are rebalanced to utilize the new capacity.

Replication and Fault Tolerance

Replication provides fault tolerance and high availability. Each partition has a configurable replication factor (typically 3 for production), meaning data is replicated across multiple brokers. One broker is elected as the partition leader, handling all reads and writes for that partition. Follower replicas continuously pull data from the leader to stay synchronized. The set of in-sync replicas (ISR) includes all replicas that are fully caught up with the leader. If a follower falls too far behind (controlled by replica.lag.time.max.ms), it is removed from the ISR. When the leader fails, a new leader is elected from the ISR, ensuring data consistency. The ISR mechanism balances durability guarantees with availability, as the leader can continue serving requests as long as the minimum ISR size is maintained.

The replication protocol uses a pull model where followers fetch from leaders via the FetchRequest API. The leader tracks each follower's fetch position and adjusts the ISR accordingly. Replication latency is bounded by the fetch interval (replica.fetch.max.bytes, replica.fetch.wait.max.ms) and network conditions between brokers.

Consumer Group Coordination

Consumer Groups enable load-balanced consumption across multiple consumers. Each consumer in a group is assigned a subset of partitions, ensuring that each partition is consumed by exactly one consumer within the group. The Group Coordinator (a broker) manages the group membership and triggers rebalancing when consumers join or leave. The rebalancing process uses protocols like RangeAssignor, RoundRobinAssignor, or CooperativeStickyAssignor to distribute partitions. Consumer offsets are tracked in the __consumer_offsets topic, allowing consumers to resume from their last committed position. This design enables horizontal scaling of consumption while maintaining ordering guarantees within partitions.

The heartbeat protocol (heartbeat.interval.ms, default 3s) ensures liveness detection. If a consumer fails to send a heartbeat within session.timeout.ms (default 45s), it is considered dead and its partitions are reassigned. The consumer poll loop must complete within max.poll.interval.ms (default 300s) to avoid being evicted from the group.

Storage Layer Internals

The Storage Layer uses a highly optimized append-only log structure. Each log segment consists of three files: .log (message data), .index (offset index), and .timeindex (timestamp index). The index files enable efficient random access to specific offsets without scanning the entire log. Kafka uses a page cache aggressively, leveraging the operating system's memory management for read performance. The log compaction feature allows Kafka to retain only the latest value for each key, enabling topics to serve as a materialized view of the latest state. Retention policies can be time-based, size-based, or compacted, providing flexibility in data lifecycle management.

Kafka's storage design achieves near-zero-copy transfer from disk to network using sendfile() on Linux. Messages are written to the log in batches using FileChannel.transferTo(), and the OS page cache handles read-ahead for sequential access patterns. Log segments are memory-mapped for efficient random access to index files.

Key Formulas

Key Concepts Table

Component	Description	Default Value	Production Recommendation
Broker	Individual Kafka server node	Port 9092	3+ nodes per cluster
Topic	Logical message category	N/A	Use kebab-case naming
Partition	Unit of parallelism within topic	1	6-12 per topic (based on throughput)
Replica	Copy of partition for fault tolerance	1	3 (min.insync.replicas=2)
ISR	In-Sync Replicas	All replicas	All replicas minus lagging ones
Consumer Group	Set of consumers sharing workload	N/A	One per microservice
Offset	Sequential message position	0	Auto-committed or manual
ZooKeeper	Cluster coordination service	Port 2181	3 or 5 node ensemble
Controller	Broker managing cluster metadata	Dynamic election	Dedicated controller broker
Segment	Physical log file	1GB	Default (tunable)
Log Compaction	Retain latest value per key	cleanup.policy=delete	Use for state topics

Code Examples

Creating a Topic with Advanced Configuration

# Create topic with specific configurations
# --bootstrap-server: comma-separated list of broker addresses for initial connection
# --topic: name of the topic (kebab-case recommended)
# --partitions: number of partitions (determines max parallelism)
# --replication-factor: number of replicas per partition (3 for production)
# --config: topic-level overrides for server properties

kafka-topics.sh --create \
  --bootstrap-server kafka-broker-0:9092,kafka-broker-1:9092,kafka-broker-2:9092 \
  --topic order-events \
  --partitions 12 \
  --replication-factor 3 \
  --config min.insync.replicas=2 \
  --config max.message.bytes=10485760 \
  --config retention.ms=604800000 \
  --config cleanup.policy=delete \
  --config compression.type=lz4 \
  --config segment.bytes=1073741824 \
  --config flush.messages=10000 \
  --config index.interval.bytes=4096 \
  --config unclean.leader.election.enable=false

# min.insync.replicas=2: minimum replicas that must acknowledge a write before it's acknowledged to producer
# max.message.bytes=10485760: maximum message size (10MB) per batch
# retention.ms=604800000: retain data for 7 days (604800000ms)
# cleanup.policy=delete: delete segments after retention expires (alternative: compact)
# compression.type=lz4: compression algorithm for batch encoding (none, gzip, snappy, lz4, zstd)
# segment.bytes=1073741824: roll log segment after 1GB (1073741824 bytes)
# flush.messages=10000: flush to disk after every 10000 messages (durability vs performance)
# index.interval.bytes=4096: add index entry every 4096 bytes for offset lookup
# unclean.leader.election.enable=false: prevent out-of-sync replica from becoming leader

Cluster Monitoring Script

#!/bin/bash
# Monitor Kafka cluster health and partition distribution
# Uses kafka CLI tools for comprehensive cluster visibility

BOOTSTRAP_SERVER="kafka-broker-0:9092"

echo "=== KAFKA CLUSTER STATUS ==="
echo "Timestamp: $(date)"
echo ""

# List all brokers and their API versions
# Shows available APIs and supported versions for client compatibility
echo "--- Active Brokers ---"
kafka-broker-api-versions.sh --bootstrap-server $BOOTSTRAP_SERVER | head -20

echo ""
echo "--- Topic Partition Distribution ---"
# Describe shows partition leaders, ISRs, and replica placement
# Use --topic to filter to specific topic
kafka-topics.sh --describe --bootstrap-server $BOOTSTRAP_SERVER --topic order-events

echo ""
echo "--- Consumer Group Lag ---"
# Lag = log_end_offset - committed_offset per partition
# High lag indicates consumer cannot keep up with production rate
kafka-consumer-groups.sh --bootstrap-server $BOOTSTRAP_SERVER \
  --group order-processing-service \
  --describe

echo ""
echo "--- Cluster Config ---"
# List all broker-level configurations
kafka-configs.sh --bootstrap-server $BOOTSTRAP_SERVER \
  --entity-type brokers \
  --entity-default \
  --describe

echo ""
echo "--- Under-replicated Partitions ---"
# Under-replicated = ISR size < replication factor
# Indicates broker failure or network partition
kafka-topics.sh --describe --bootstrap-server $BOOTSTRAP_SERVER \
  | grep -i "under-replicated"

echo ""
echo "--- Active Controller ---"
# Exactly one broker should be the controller
kafka-metadata.sh --snapshot /var/kafka/metadata/metadata-1.log \
  --cluster-id <cluster-id> --command-config admin.properties 2>/dev/null || \
  echo "Check controller via JMX: kafka.controller:type=KafkaController,name=ActiveControllerCount"

Java Producer Configuration with Detailed Comments

import org.apache.kafka.clients.producer.*;
import org.apache.kafka.common.serialization.StringSerializer;
import java.util.Properties;
import java.util.concurrent.Future;

public class KafkaProducerExample {
    
    public static void main(String[] args) {
        Properties props = new Properties();
        
        // Core configurations
        props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, 
            "kafka-broker-0:9092,kafka-broker-1:9092,kafka-broker-2:9092");
        // Comma-separated broker list for initial metadata fetch;
        // client discovers all brokers from metadata response
        
        props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, 
            StringSerializer.class.getName());
        // Serializer for message keys; determines partition routing
        // Options: StringSerializer, LongSerializer, ByteArraySerializer
        
        props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, 
            StringSerializer.class.getName());
        // Serializer for message values; independent of key serializer
        
        // Reliability configurations
        props.put(ProducerConfig.ACKS_CONFIG, "all");
        // acks=0: no acknowledgment (fastest, no durability)
        // acks=1: leader acknowledgment only (moderate durability)
        // acks=all: all ISR replicas acknowledge (highest durability)
        
        props.put(ProducerConfig.RETRIES_CONFIG, 10);
        // Number of retries on transient failures (e.g., LEADER_NOT_AVAILABLE)
        // With idempotence enabled, retries are unlimited by default
        
        props.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 100);
        // Delay between retries to avoid overwhelming broker
        // Exponential backoff applied on consecutive failures
        
        props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);
        // Max unacknowledged requests per broker connection
        // >1 enables pipelining but may cause reordering without idempotence
        // With idempotence, ordering is preserved even with value > 1
        
        // Idempotent producer for exactly-once semantics
        props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
        // Prevents duplicate messages during retries
        // Assigns Producer ID (PID) and per-partition sequence numbers
        // Automatically sets acks=all, retries=Integer.MAX_VALUE
        
        props.put(ProducerConfig.TRANSACTIONAL_ID_CONFIG, "order-producer-1");
        // Unique identifier for transactional producer
        // Enables atomic writes across multiple partitions/topics
        // Must be unique per producer instance (fencing via epoch)
        
        // Performance configurations
        props.put(ProducerConfig.BATCH_SIZE_CONFIG, 65536);
        // Maximum batch size in bytes (64KB)
        // Batch is sent when full OR linger.ms expires, whichever first
        // Larger batches improve throughput, increase latency
        
        props.put(ProducerConfig.LINGER_MS_CONFIG, 10);
        // Time to wait before sending batch even if not full
        // 0 = send immediately (lowest latency, lowest throughput)
        // 10-50ms = good balance for most workloads
        
        props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 33554432);
        // Total memory for buffering records (32MB)
        // Records waiting to be sent to broker
        // If buffer is full, send() blocks until space is available
        
        props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4");
        // Compression algorithm applied to batches
        // none: no compression (fastest, largest payload)
        // gzip: best compression ratio, high CPU usage
        // snappy: moderate compression, low CPU
        // lz4: good compression, fast decompression (recommended)
        // zstd: best compression ratio, moderate CPU (Kafka 2.1+)
        
        // Monitoring
        props.put(ProducerConfig.METRICS_NUM_SAMPLES_CONFIG, 10);
        // Number of samples to keep for metric calculations
        
        props.put(ProducerConfig.METRICS_SAMPLE_WINDOW_MS_CONFIG, 30000);
        // Time window for metric aggregation (30 seconds)
        
        KafkaProducer<String, String> producer = new KafkaProducer<>(props);
        
        // Initialize transactions (required if transactional.id is set)
        // Acquires Producer ID and epoch from Transaction Coordinator
        producer.initTransactions();
        
        try {
            // Begin transaction for atomic multi-partition writes
            // All sends until commitTransaction() are atomic
            producer.beginTransaction();
            
            for (int i = 0; i < 1000; i++) {
                String key = "order-" + (i % 100);
                // Key determines partition: hash(key) % numPartitions
                // Same key always routes to same partition (ordering guarantee)
                
                String value = "{\"orderId\": \"" + i + "\", \"amount\": " + 
                    (Math.random() * 1000) + "}";
                
                ProducerRecord<String, String> record = 
                    new ProducerRecord<>("order-events", key, value);
                // Topic, key, value; partition determined by key hash
                // Optional: specify partition directly as 3rd argument
                
                record.headers().add("correlation-id", 
                    java.util.UUID.randomUUID().toString().getBytes());
                // Headers carry metadata without affecting partition routing
                // Useful for tracing, audit, and cross-system correlation
                
                record.headers().add("source-system", "order-service".getBytes());
                // Identify the originating service for observability
                
                Future<RecordMetadata> future = producer.send(record, 
                    (metadata, exception) -> {
                        if (exception != null) {
                            System.err.println("Failed to send record: " + 
                                exception.getMessage());
                        } else {
                            System.out.printf("Sent record to partition %d, " +
                                "offset %d, timestamp %d%n",
                                metadata.partition(), metadata.offset(),
                                metadata.timestamp());
                        }
                    });
                // Async send with callback; callback runs on sender thread
                // For sync send, call future.get() (blocks, lower throughput)
            }
            
            // Commit transaction atomically
            // Writes TxnMarker to all affected partitions
            // Marks transaction as committed in __transaction_state
            producer.commitTransaction();
            
        } catch (Exception e) {
            // Abort rolls back all writes in this transaction
            // Writes become invisible to read_committed consumers
            producer.abortTransaction();
            throw e;
        } finally {
            // Close releases resources, flushes pending sends
            producer.close();
        }
    }
}

Java Consumer with Advanced Configurations

import org.apache.kafka.clients.consumer.*;
import org.apache.kafka.common.TopicPartition;
import org.apache.kafka.common.serialization.StringDeserializer;
import java.time.Duration;
import java.util.*;
import java.util.concurrent.atomic.AtomicBoolean;

public class KafkaConsumerExample {
    
    private static final AtomicBoolean running = new AtomicBoolean(true);
    
    public static void main(String[] args) {
        Properties props = new Properties();
        
        // Core configurations
        props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, 
            "kafka-broker-0:9092,kafka-broker-1:9092,kafka-broker-2:9092");
        // Initial broker list; consumer discovers full cluster from metadata
        
        props.put(ConsumerConfig.GROUP_ID_CONFIG, "order-processing-service");
        // Consumer group identifier
        // Partitions are assigned among consumers with same group.id
        
        props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, 
            StringDeserializer.class.getName());
        // Deserializer for message keys; must match producer's key serializer
        
        props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, 
            StringDeserializer.class.getName());
        // Deserializer for message values; must match producer's value serializer
        
        // Consumer behavior
        props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, false);
        // false = manual offset commit (recommended for exactly-once)
        // true = auto-commit every auto.commit.interval.ms (default 5000ms)
        // Auto-commit may lose messages if processing fails after commit
        
        props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
        // earliest: read from beginning if no committed offset exists
        // latest: read from end (new messages only)
        // none: throw exception if no committed offset
        
        props.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, 500);
        // Maximum records returned per poll() call
        // Lower values reduce poll loop latency
        // Higher values improve throughput
        
        props.put(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, 300000);
        // Maximum time between poll() calls before consumer is evicted
        // If processing takes longer, consumer is considered dead
        // Increase for slow processing; decrease for fast detection
        
        props.put(ConsumerConfig.SESSION_TIMEOUT_MS_CONFIG, 45000);
        // Maximum time without heartbeat before consumer is considered dead
        // Detected by group coordinator; triggers rebalance
        
        props.put(ConsumerConfig.HEARTBEAT_INTERVAL_MS_CONFIG, 15000);
        // Heartbeat sent to coordinator periodically
        // Should be <= session.timeout.ms / 3 for reliable detection
        
        // Isolation level for transactional reads
        props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
        // read_committed: only see records from committed transactions
        // read_uncommitted: see all records including uncommitted
        // Required for exactly-once semantics end-to-end
        
        // Partition assignment strategy
        props.put(ConsumerConfig.PARTITION_ASSIGNMENT_STRATEGY_CONFIG, 
            "org.apache.kafka.clients.consumer.CooperativeStickyAssignor");
        // CooperativeStickyAssignor: incremental rebalancing (recommended)
        //   Only revokes partitions that need to move
        // RangeAssignor: assigns contiguous partition ranges (default)
        // RoundRobinAssignor: distributes partitions evenly
        
        KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
        
        // Subscribe to topics with rebalance listener
        consumer.subscribe(Arrays.asList("order-events", "payment-events"), 
            new ConsumerRebalanceListener() {
                @Override
                public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
                    // Called before partitions are revoked (rebalance start)
                    // Commit pending offsets to avoid reprocessing
                    System.out.println("Partitions revoked: " + partitions);
                    consumer.commitSync(Duration.ofSeconds(30));
                }
                
                @Override
                public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
                    // Called after new partitions are assigned
                    // Seek to specific offsets if needed
                    System.out.println("Partitions assigned: " + partitions);
                }
            });
        
        try {
            while (running.get()) {
                // poll() fetches records and maintains group membership
                // Timeout controls how long to wait if no records available
                ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
                
                for (ConsumerRecord<String, String> record : records) {
                    processRecord(record);
                }
                
                // Commit offsets after successful processing
                // commitSync: blocks until successful or timeout
                // commitAsync: non-blocking, may fail silently
                if (!records.isEmpty()) {
                    consumer.commitSync(Duration.ofSeconds(30));
                }
            }
        } finally {
            // Close releases consumer from group, triggers rebalance
            // May want to delay close for graceful shutdown
            consumer.close();
        }
    }
    
    private static void processRecord(ConsumerRecord<String, String> record) {
        System.out.printf("Topic: %s, Partition: %d, Offset: %d, Key: %s, Value: %s%n",
            record.topic(), record.partition(), record.offset(), 
            record.key(), record.value());
        // Access record.timestamp() for event-time processing
        // Access record.headers() for message metadata
    }
}

Python Producer Example

from kafka import KafkaProducer
import json
import time

# Initialize producer with configuration
producer = KafkaProducer(
    bootstrap_servers=['kafka-broker-0:9092', 'kafka-broker-1:9092'],
    # Comma-separated broker addresses for metadata discovery
    
    key_serializer=lambda k: k.encode('utf-8'),
    # Serialize key to bytes; determines partition via hash
    
    value_serializer=lambda v: json.dumps(v).encode('utf-8'),
    # Serialize value to JSON bytes
    
    acks='all',
    # Wait for all ISR replicas to acknowledge
    # 'none' = 0 (no ack), 'one' = 1 (leader only), 'all' = all ISR
    
    retries=5,
    # Retry transient failures up to 5 times
    
    retry_backoff_ms=100,
    # 100ms delay between retries
    
    batch_size=65536,
    # 64KB batch size for network efficiency
    
    linger_ms=10,
    # Wait up to 10ms to fill batch
    
    compression_type='lz4',
    # Compress batches with lz4 (good balance of speed/ratio)
    
    max_in_flight_requests_per_connection=5,
    # Allow 5 concurrent requests per broker (pipelining)
    
    enable_idempotence=True,
    # Prevent duplicate messages during retries
    
    buffer_memory=33554432,
    # 32MB buffer for pending records
    
    request_timeout_ms=30000,
    # 30s timeout for broker responses
)

# Send messages with callback
for i in range(1000):
    key = f"order-{i % 100}"
    value = {
        "orderId": str(i),
        "amount": round(1000.0, 2),
        "timestamp": int(time.time() * 1000)
    }
    
    future = producer.send(
        topic='order-events',
        key=key,
        value=value,
        headers=[
            ("correlation-id", str(i).encode('utf-8')),
            ("source", b"python-producer")
        ]
    )
    
    # Add callback for async result handling
    future.add_callback(lambda metadata: print(
        f"Sent to partition {metadata.partition()}, "
        f"offset {metadata.offset()}, "
        f"timestamp {metadata.timestamp}"
    ))
    
    future.add_errback(lambda exc: print(f"Error: {exc}"))

# Flush ensures all pending records are sent
producer.flush()
producer.close()

Python Consumer Example

from kafka import KafkaConsumer
import json

# Initialize consumer with configuration
consumer = KafkaConsumer(
    'order-events',
    # Topic(s) to subscribe to; can be list for multiple topics
    
    bootstrap_servers=['kafka-broker-0:9092', 'kafka-broker-1:9092'],
    # Initial broker list
    
    group_id='order-processing-service',
    # Consumer group for partition assignment
    
    key_deserializer=lambda k: k.decode('utf-8') if k else None,
    # Deserialize key from bytes
    
    value_deserializer=lambda v: json.loads(v.decode('utf-8')) if v else None,
    # Deserialize value from JSON bytes
    
    auto_offset_reset='earliest',
    # 'earliest': read from beginning if no offset
    # 'latest': read from end if no offset
    # 'none': raise exception if no offset
    
    enable_auto_commit=False,
    # Disable auto-commit for manual offset management
    
    isolation_level='read_committed',
    # Only read committed transactional records
    
    max_poll_records=500,
    # Max records per poll() call
    
    max_poll_interval_ms=300000,
    # Max time between polls before eviction (300s)
    
    session_timeout_ms=45000,
    # Consumer failure detection timeout
    
    heartbeat_interval_ms=15000,
    # Heartbeat frequency to coordinator
    
    fetch_min_bytes=1,
    # Minimum bytes to fetch (batch optimization)
    
    fetch_max_wait_ms=500,
    # Max wait if fetch_min_bytes not reached
)

# Subscribe with rebalance listener
def on_partitions_revoked(partitions):
    print(f"Partitions revoked: {partitions}")
    consumer.commitSync()

def on_partitions_assigned(partitions):
    print(f"Partitions assigned: {partitions}")

consumer.subscribe(
    topics=['order-events', 'payment-events'],
    listener=ConsumerRebalanceListener(
        on_partitions_revoked=on_partitions_revoked,
        on_partitions_assigned=on_partitions_assigned
    )
)

# Process records
try:
    while True:
        records = consumer.poll(timeout_ms=100)
        
        for tp, records_list in records.items():
            for record in records_list:
                print(f"Topic: {record.topic}, "
                      f"Partition: {record.partition}, "
                      f"Offset: {record.offset}, "
                      f"Key: {record.key}, "
                      f"Value: {record.value}")
                
                # Process record here
                # After processing, commit offset
                consumer.commitSync({
                    tp: OffsetAndMetadata(record.offset + 1, "processed")
                })
                
except KeyboardInterrupt:
    consumer.close()

Performance Metrics

Metric	Single Broker	3-Broker Cluster	6-Broker Cluster
Throughput (writes/sec)	100K	300K	600K
Throughput (reads/sec)	200K	600K	1.2M
Latency (p99)	5ms	10ms	15ms
Latency (p999)	15ms	25ms	40ms
Message Size (avg)	1KB	1KB	1KB
Replication Lag	0ms	<100ms	<200ms
Recovery Time	0	30s	60s
Disk Usage	1TB	3TB	6TB
Memory (JVM Heap)	6GB	6GB	6GB
Network Bandwidth	1Gbps	3Gbps	6Gbps

Broker Configuration	Default	Recommended	Impact
`num.network.threads`	3	8	Network I/O throughput
`num.io.threads`	8	16	Disk I/O throughput
`socket.send.buffer.bytes`	100KB	1MB	Network send throughput
`socket.receive.buffer.bytes`	100KB	1MB	Network receive throughput
`log.flush.interval.messages`	Long.MAX_VALUE	10000	Durability vs performance
`log.retention.hours`	168 (7 days)	Varies	Storage cost

Best Practices

Partition Count Planning: Start with partitions = (target_throughput / broker_throughput) * replication_factor. Monitor consumer lag to determine if rebalancing is needed. Avoid over-partitioning (max 200K partitions per cluster for ZooKeeper, 2M for KRaft).
Replication Factor: Always use replication-factor ≥ 3 for production topics. Set min.insync.replicas=2 to ensure at least two replicas acknowledge writes before acknowledging to the producer. This prevents data loss during single broker failure.
Consumer Group Sizing: Ensure number of consumers ≤ number of partitions. More consumers than partitions results in idle consumers. Use CooperativeStickyAssignor for incremental rebalancing to minimize pause time during scaling events.
Offset Management: Disable auto-commit (enable.auto.commit=false) and commit offsets explicitly after successful processing to avoid data loss or duplicates. Use commitSync for critical offsets, commitAsync for bulk commits.
Broker Configuration: Use dedicated disks for Kafka data (avoid shared with OS). Configure num.io.threads and num.network.threads based on CPU cores. Set log.flush.interval.messages and log.flush.interval.ms carefully to balance durability with performance.
Topic Configuration: Use cleanup.policy=compact for topics that need to retain latest state. Set retention.ms and retention.bytes appropriately based on data volume and replay requirements. Use compression.type=lz4 for optimal throughput-to-latency ratio.
Monitoring: Monitor consumer lag, ISR shrink/expand events, under-replicated partitions, and disk usage. Use tools like Burrow, Cruise Control, or Confluent Control Center. Set up alerts for ISR shrinks and under-replicated partitions.
Security: Enable SSL/TLS for inter-broker and client-broker communication. Use SASL for authentication (SCRAM-SHA-256 or SASL/PLAIN). Implement ACLs for authorization. Use separate listeners for internal and external traffic.
ZooKeeper/KRaft: For clusters > 100K partitions, consider migrating to KRaft mode to eliminate ZooKeeper dependency. Use dedicated ZooKeeper nodes (3 or 5) with sufficient I/O capacity. KRaft provides faster controller failover and better metadata throughput.
Data Retention: Implement tiered storage for long-term retention. Use log compaction for topics that serve as materialized views. Archive to object storage (S3, GCS) for compliance requirements. Set retention.bytes per partition to cap storage usage.

See also: Data Engineering Pipeline patterns (data-engineering/019) | PySpark Structured Streaming (pyspark/11-structured-streaming) | Kafka Producer API (kafka/02) | Exactly-Once Semantics (kafka/04)

Apache Kafka Architecture: A Deep Dive into Distributed Event Streaming

Apache Kafka Architecture: A Deep Dive into Distributed Event Streaming

Architecture Diagram: Partition Distribution & Replication

Architecture Diagram: Consumer Group Coordination

Architecture Diagram: Log Segment Internals

Why Kafka? The Industry Context

The Problem Kafka Solves

Who Uses Kafka in Production?

Kafka vs. The Competition

Real-World Case Study: E-commerce Event Streaming

Common Kafka Failures and How to Handle Them

Formal Definitions

Detailed Explanation

Broker Layer Deep Dive

Topic Partitioning and Distribution

Replication and Fault Tolerance

Consumer Group Coordination

Storage Layer Internals

Key Formulas

Key Concepts Table

Code Examples

Creating a Topic with Advanced Configuration

Cluster Monitoring Script

Java Producer Configuration with Detailed Comments

Java Consumer with Advanced Configurations

Python Producer Example

Python Consumer Example

Performance Metrics

Best Practices

See Also

Need Expert Kafka Help?