Kafka Monitoring & Alerting

Difficulty: Senior Level | Companies: LinkedIn, Uber, Netflix, Spotify, Confluent

Content

Effective monitoring is critical for maintaining Kafka cluster health. Understanding key metrics and alerting strategies prevents outages and ensures performance.

Monitoring Stack

Architecture Diagram

Monitoring Architecture:
┌─────────────────────────────────────────────────┐
│  Grafana Dashboards                              │
│  └── Visualization & Alerting                    │
└─────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────┐
│  Prometheus                                       │
│  └── Metrics Collection & Storage                │
└─────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────┐
│  JMX Exporter                                     │
│  └── Kafka JMX Metrics → Prometheus Format       │
└─────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────┐
│  Kafka Brokers                                    │
│  └── JMX MBeans                                  │
└─────────────────────────────────────────────────┘

Key Metrics to Monitor

Broker Metrics

# Critical broker metrics
kafka_server_BrokerTopicMetrics:
  - MessagesInPerSec          # Message ingestion rate
  - BytesInPerSec             # Bytes received
  - BytesOutPerSec            # Bytes sent
  - TotalFetchRequestsPerSec  # Consumer fetch requests
  - TotalProduceRequestsPerSec # Producer requests
  - FailedProduceRequestsPerSec # Failed producer requests
  - FailedFetchRequestsPerSec  # Failed fetch requests

kafka_server_ReplicaManager:
  - UnderReplicatedPartitions  # Partitions with replication issues
  - IsrShrinkPerSec           # ISR shrink rate
  - IsrExpandsPerSec          # ISR expand rate
  - ActiveControllerCount     # Controller status (should be 1)
  - OfflinePartitionsCount    # Partitions without leader

Consumer Metrics

kafka_consumer_group_metrics:
  - records_lag_max           # Consumer lag
  - records_consumed_rate     # Consumption rate
  - fetch_rate                # Fetch rate

Producer Metrics

kafka_producer_metrics:
  - record_send_rate          # Send rate
  - record_error_rate         # Error rate
  - request_latency_avg       # Latency
  - batch_size_avg            # Batch size

JMX Exporter Configuration

# jmx-prometheus-exporter.yml
hostPort: localhost:9999
lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
  # Broker metrics
  - pattern: kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>Count
    name: kafka_server_brokertopicmetrics_$1_total
    labels:
      topic: $2
    type: COUNTER
    
  - pattern: kafka.server<type=ReplicaManager, name=(.+)><>Value
    name: kafka_server_replicamanager_$1
    type: GAUGE
    
  # Consumer metrics
  - pattern: kafka.consumer<type=consumer-fetch-manager-metrics, name=(.+), client-id=(.+)><>Value
    name: kafka_consumer_fetch_manager_$1
    labels:
      client_id: $2
    type: GAUGE

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets:
          - kafka1:9404
          - kafka2:9404
          - kafka3:9404
    metrics_path: /metrics
    
  - job_name: 'kafka-exporter'
    static_configs:
      - targets:
          - kafka-exporter:9308

Grafana Dashboard Metrics

Broker Health Dashboard

{
  "panels": [
    {
      "title": "Under Replicated Partitions",
      "type": "graph",
      "targets": [
        {
          "expr": "kafka_server_replicamanager_underreplicatedpartitions",
          "legendFormat": "{{instance}}"
        }
      ],
      "thresholds": [
        {
          "value": 0,
          "color": "green"
        },
        {
          "value": 1,
          "color": "red"
        }
      ]
    },
    {
      "title": "Consumer Lag",
      "type": "graph",
      "targets": [
        {
          "expr": "kafka_consumergroup_current_offset",
          "legendFormat": "{{consumergroup}} - {{topic}} - {{partition}}"
        }
      ]
    },
    {
      "title": "Message Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "rate(kafka_server_brokertopicmetrics_messagesinpersec_total[5m])",
          "legendFormat": "{{topic}}"
        }
      ]
    }
  ]
}

Alerting Rules

Prometheus Alert Rules

# kafka-alerts.yml
groups:
  - name: kafka-alerts
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka under replicated partitions"
          description: "{{ $value }} partitions are under replicated"
          
      - alert: KafkaConsumerLagHigh
        expr: kafka_consumergroup_current_offset - kafka_consumergroup_end_offset > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High consumer lag"
          description: "Consumer lag is {{ $value }} for {{ $labels.consumergroup }}"
          
      - alert: KafkaBrokerDown
        expr: up{job="kafka-brokers"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka broker down"
          description: "Broker {{ $labels.instance }} is down"
          
      - alert: KafkaHighRequestLatency
        expr: kafka_network_requestmetrics_totaltimems{request="Produce"} > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High produce latency"
          description: "Average produce latency is {{ $value }}ms"

Python Monitoring Script

from kafka import KafkaConsumer, KafkaProducer
from prometheus_client import Gauge, Counter, start_http_server
import time

# Define metrics
consumer_lag = Gauge(
    'kafka_consumer_lag',
    'Consumer lag per partition',
    ['topic', 'partition', 'consumer_group']
)

message_rate = Counter(
    'kafka_messages_total',
    'Total messages processed',
    ['topic', 'status']
)

processing_latency = Gauge(
    'kafka_processing_latency_seconds',
    'Message processing latency',
    ['topic']
)

def monitor_consumer_group(bootstrap_servers, topic, group_id):
    consumer = KafkaConsumer(
        topic,
        bootstrap_servers=bootstrap_servers,
        group_id=group_id,
        enable_auto_commit=False
    )
    
    while True:
        partitions = consumer.partitions_for_topic(topic)
        
        for partition in partitions:
            tp = TopicPartition(topic, partition)
            
            # Get end offset
            consumer.seek_to_end(tp)
            end_offset = consumer.position(tp)
            
            # Get committed offset
            committed = consumer.committed(tp)
            
            lag = end_offset - committed if committed else end_offset
            consumer_lag.labels(
                topic=topic,
                partition=partition,
                consumer_group=group_id
            ).set(lag)
        
        time.sleep(10)

# Start metrics server
start_http_server(8000)

# Monitor consumer
monitor_consumer_group(
    bootstrap_servers='kafka1:9092',
    topic='orders',
    group_id='order-processor'
)

Health Check Endpoints

// Spring Boot health check
@Component
public class KafkaHealthIndicator implements HealthIndicator {
    
    private final AdminClient adminClient;
    
    @Override
    public Health health() {
        try {
            DescribeTopicsResult result = adminClient.describeTopics(
                Collections.singleton("__consumer_offsets")
            );
            
            result.all().get(5, TimeUnit.SECONDS);
            
            return Health.up()
                .withDetail("kafka", "available")
                .build();
        } catch (Exception e) {
            return Health.down()
                .withDetail("kafka", e.getMessage())
                .build();
        }
    }
}

Follow-Up Questions

What are the most critical Kafka metrics to monitor?
How would you set up alerting for consumer lag?
Explain the difference between UnderReplicatedPartitions and OfflinePartitionsCount.
How would you monitor Kafka performance across multiple datacenters?
What are best practices for Kafka monitoring in Kubernetes?