Kafka Monitoring & Alerting
Difficulty: Senior Level | Companies: LinkedIn, Uber, Netflix, Spotify, Confluent
Content
Effective monitoring is critical for maintaining Kafka cluster health. Understanding key metrics and alerting strategies prevents outages and ensures performance.
Monitoring Stack
Architecture Diagram
Monitoring Architecture:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Grafana Dashboards β
β βββ Visualization & Alerting β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Prometheus β
β βββ Metrics Collection & Storage β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β JMX Exporter β
β βββ Kafka JMX Metrics β Prometheus Format β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Kafka Brokers β
β βββ JMX MBeans β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Metrics to Monitor
Broker Metrics
# Critical broker metrics
kafka_server_BrokerTopicMetrics:
- MessagesInPerSec # Message ingestion rate
- BytesInPerSec # Bytes received
- BytesOutPerSec # Bytes sent
- TotalFetchRequestsPerSec # Consumer fetch requests
- TotalProduceRequestsPerSec # Producer requests
- FailedProduceRequestsPerSec # Failed producer requests
- FailedFetchRequestsPerSec # Failed fetch requests
kafka_server_ReplicaManager:
- UnderReplicatedPartitions # Partitions with replication issues
- IsrShrinkPerSec # ISR shrink rate
- IsrExpandsPerSec # ISR expand rate
- ActiveControllerCount # Controller status (should be 1)
- OfflinePartitionsCount # Partitions without leader
Consumer Metrics
kafka_consumer_group_metrics:
- records_lag_max # Consumer lag
- records_consumed_rate # Consumption rate
- fetch_rate # Fetch rate
Producer Metrics
kafka_producer_metrics:
- record_send_rate # Send rate
- record_error_rate # Error rate
- request_latency_avg # Latency
- batch_size_avg # Batch size
JMX Exporter Configuration
# jmx-prometheus-exporter.yml
hostPort: localhost:9999
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
# Broker metrics
- pattern: kafka.server<type=BrokerTopicMetrics, name=(.+), topic=(.+)><>Count
name: kafka_server_brokertopicmetrics_$1_total
labels:
topic: $2
type: COUNTER
- pattern: kafka.server<type=ReplicaManager, name=(.+)><>Value
name: kafka_server_replicamanager_$1
type: GAUGE
# Consumer metrics
- pattern: kafka.consumer<type=consumer-fetch-manager-metrics, name=(.+), client-id=(.+)><>Value
name: kafka_consumer_fetch_manager_$1
labels:
client_id: $2
type: GAUGE
Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kafka-brokers'
static_configs:
- targets:
- kafka1:9404
- kafka2:9404
- kafka3:9404
metrics_path: /metrics
- job_name: 'kafka-exporter'
static_configs:
- targets:
- kafka-exporter:9308
Grafana Dashboard Metrics
Broker Health Dashboard
{
"panels": [
{
"title": "Under Replicated Partitions",
"type": "graph",
"targets": [
{
"expr": "kafka_server_replicamanager_underreplicatedpartitions",
"legendFormat": "{{instance}}"
}
],
"thresholds": [
{
"value": 0,
"color": "green"
},
{
"value": 1,
"color": "red"
}
]
},
{
"title": "Consumer Lag",
"type": "graph",
"targets": [
{
"expr": "kafka_consumergroup_current_offset",
"legendFormat": "{{consumergroup}} - {{topic}} - {{partition}}"
}
]
},
{
"title": "Message Rate",
"type": "graph",
"targets": [
{
"expr": "rate(kafka_server_brokertopicmetrics_messagesinpersec_total[5m])",
"legendFormat": "{{topic}}"
}
]
}
]
}
Alerting Rules
Prometheus Alert Rules
# kafka-alerts.yml
groups:
- name: kafka-alerts
rules:
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka under replicated partitions"
description: "{{ $value }} partitions are under replicated"
- alert: KafkaConsumerLagHigh
expr: kafka_consumergroup_current_offset - kafka_consumergroup_end_offset > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "High consumer lag"
description: "Consumer lag is {{ $value }} for {{ $labels.consumergroup }}"
- alert: KafkaBrokerDown
expr: up{job="kafka-brokers"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka broker down"
description: "Broker {{ $labels.instance }} is down"
- alert: KafkaHighRequestLatency
expr: kafka_network_requestmetrics_totaltimems{request="Produce"} > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High produce latency"
description: "Average produce latency is {{ $value }}ms"
Python Monitoring Script
from kafka import KafkaConsumer, KafkaProducer
from prometheus_client import Gauge, Counter, start_http_server
import time
# Define metrics
consumer_lag = Gauge(
'kafka_consumer_lag',
'Consumer lag per partition',
['topic', 'partition', 'consumer_group']
)
message_rate = Counter(
'kafka_messages_total',
'Total messages processed',
['topic', 'status']
)
processing_latency = Gauge(
'kafka_processing_latency_seconds',
'Message processing latency',
['topic']
)
def monitor_consumer_group(bootstrap_servers, topic, group_id):
consumer = KafkaConsumer(
topic,
bootstrap_servers=bootstrap_servers,
group_id=group_id,
enable_auto_commit=False
)
while True:
partitions = consumer.partitions_for_topic(topic)
for partition in partitions:
tp = TopicPartition(topic, partition)
# Get end offset
consumer.seek_to_end(tp)
end_offset = consumer.position(tp)
# Get committed offset
committed = consumer.committed(tp)
lag = end_offset - committed if committed else end_offset
consumer_lag.labels(
topic=topic,
partition=partition,
consumer_group=group_id
).set(lag)
time.sleep(10)
# Start metrics server
start_http_server(8000)
# Monitor consumer
monitor_consumer_group(
bootstrap_servers='kafka1:9092',
topic='orders',
group_id='order-processor'
)
Health Check Endpoints
// Spring Boot health check
@Component
public class KafkaHealthIndicator implements HealthIndicator {
private final AdminClient adminClient;
@Override
public Health health() {
try {
DescribeTopicsResult result = adminClient.describeTopics(
Collections.singleton("__consumer_offsets")
);
result.all().get(5, TimeUnit.SECONDS);
return Health.up()
.withDetail("kafka", "available")
.build();
} catch (Exception e) {
return Health.down()
.withDetail("kafka", e.getMessage())
.build();
}
}
}
Follow-Up Questions
- What are the most critical Kafka metrics to monitor?
- How would you set up alerting for consumer lag?
- Explain the difference between
UnderReplicatedPartitionsandOfflinePartitionsCount. - How would you monitor Kafka performance across multiple datacenters?
- What are best practices for Kafka monitoring in Kubernetes?