πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Kafka Monitoring

🟒 Free Lesson

Advertisement

Kafka Monitoring

Kafka ClusterBroker 1Broker 2Broker 3JMX Port: 9999JMX ExporterMetrics EndpointPort: 9404Prometheus FormatPrometheusScrape Interval: 15sRetention: 15 daysAlert RulesService DiscoveryTSDB StorageGrafanaDashboardsVisualizationPanelsAlertmanagerEmailSlackPagerDuty

Overview

Monitoring Apache Kafka is essential for maintaining reliable, scalable, and performant event streaming platforms. A comprehensive monitoring strategy covers broker health, consumer lag, throughput metrics, and resource utilization across the entire Kafka ecosystem.

Why Monitoring Matters

  • Early Detection: Identify issues before they impact production workloads
  • Capacity Planning: Understand growth patterns for infrastructure scaling
  • Performance Optimization: Tune configurations based on real metrics
  • SLA Compliance: Ensure message delivery guarantees are met

Key Kafka Metrics

Broker Metrics

MetricDescriptionWarning Threshold
kafka.server.BrokerTopicMetrics.MessagesInPerSecMessages received per secondVaries by workload
kafka.server.BrokerTopicMetrics.BytesInPerSecBytes received per second> 80% network capacity
kafka.server.BrokerTopicMetrics.BytesOutPerSecBytes sent per second> 80% network capacity
kafka.network.RequestMetrics.TotalRequestsPerSecTotal requests per second> 10,000 req/s
kafka.server.ReplicaManager.UnderReplicatedPartitionsPartitions with missing replicas> 0

Consumer Group Metrics

Architecture Diagram
kafka_consumer_group_lag{topic="orders", partition="0", group="order-processor"} 15234
kafka_consumer_group_lag{topic="orders", partition="1", group="order-processor"} 8921
kafka_consumer_group_lag{topic="orders", partition="2", group="order-processor"} 23456

Producer Metrics

  • record-error-rate: Failed record sends
  • record-send-rate: Successful record sends
  • batch-size-avg: Average batch size in bytes
  • compression-rate-avg: Compression ratio

JMX Metrics Deep Dive

Enabling JMX on Kafka Brokers

# Set JMX options in server.properties or environment
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote \
  -Dcom.sun.management.jmxremote.port=9999 \
  -Dcom.sun.management.jmxremote.authenticate=false \
  -Dcom.sun.management.jmxremote.ssl=false"

Key JMX MBeans

# Python example using jmxquery
from jmxquery import JMXQuery

jmx_url = "service:jmx:rmi:///jndi/rmi://localhost:9999/jmxrmi"

queries = [
    JMXQuery(
        metric_name="kafka_server_brokertopicmetrics_messagesinpersec_count",
        metric_type="gauge",
        metric_tags={"topic": "Orders"}
    ),
    JMXQuery(
        metric_name="kafka_server_replicamanager_underreplicatedpartitions",
        metric_type="gauge"
    )
]

results = JMXQuery(jmx_url).query(queries)
print(results)

Critical JMX MBeans

# jmx_exporter_config.yml
rules:
  - pattern: kafka.server<type=BrokerTopicMetrics, name=MessagesInPerSec><>Count
    name: kafka_server_brokertopicmetrics_messagesinpersec_count
    type: counter
    
  - pattern: kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value
    name: kafka_server_replicamanager_underreplicatedpartitions
    type: gauge
    
  - pattern: kafka.network<type=RequestMetrics, name=TotalRequestsPerSec, request=(.+), error=(.+)><>Count
    name: kafka_network_requestmetrics_totalrequestspersec_count
    labels:
      request: $1
      error: $2

Prometheus Setup

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kafka-brokers'
    static_configs:
      - targets:
          - 'kafka-broker-1:9404'
          - 'kafka-broker-2:9404'
          - 'kafka-broker-3:9404'
    metrics_path: /metrics
    
  - job_name: 'kafka-connect'
    static_configs:
      - targets:
          - 'kafka-connect-1:9404'
          - 'kafka-connect-2:9404'
          
  - job_name: 'kafka-consumer-groups'
    static_configs:
      - targets:
          - 'kafka-consumer-metrics:9404'

Prometheus Alert Rules

# kafka_alerts.yml
groups:
  - name: kafka_alerts
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Under-replicated partitions detected"
          description: "{{ $value }} partitions are under-replicated"
          
      - alert: KafkaConsumerLagHigh
        expr: kafka_consumer_group_lag > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Consumer lag is high"
          description: "Consumer group {{ $labels.group }} has lag of {{ $value }}"
          
      - alert: KafkaBrokerDown
        expr: up{job="kafka-brokers"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka broker is down"
          description: "Broker {{ $labels.instance }} is unreachable"

Grafana Dashboard Setup

Essential Dashboard Panels

{
  "panels": [
    {
      "title": "Messages In Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(kafka_server_brokertopicmetrics_messagesinpersec_count[5m]))",
          "legendFormat": "{{instance}}"
        }
      ]
    },
    {
      "title": "Consumer Group Lag",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum by (group) (kafka_consumer_group_lag)",
          "legendFormat": "{{group}}"
        }
      ]
    },
    {
      "title": "Under Replicated Partitions",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(kafka_server_replicamanager_underreplicatedpartitions)"
        }
      ],
      "thresholds": {
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 1, "color": "red"}
        ]
      }
    }
  ]
}

Dashboard Best Practices

  • Broker Overview: CPU, memory, disk usage, network I/O
  • Topic Metrics: Messages/sec, bytes/sec, partition distribution
  • Consumer Metrics: Lag, throughput, rebalance frequency
  • Cluster Health: ISR size, controller status, log segments

Alerting Strategies

Alert Severity Levels

LevelResponse TimeExamples
CriticalImmediateBroker down, under-replicated partitions
Warning1 hourHigh consumer lag, disk space low
InfoNext business dayConfig changes, rebalances

Common Alert Patterns

# Consumer lag alert with dynamic threshold
- alert: KafkaConsumerLagExceedsThreshold
  expr: |
    (
      kafka_consumer_group_lag 
      / ignoring(group) group_left() 
      kafka_server_brokertopicmetrics_messagesinpersec_count
    ) > 300
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Consumer {{ $labels.group }} is falling behind"

Monitoring Best Practices

1. Establish Baselines

  • Monitor metrics during normal operations for 1-2 weeks
  • Set thresholds based on standard deviations from baseline
  • Account for daily/weekly patterns in traffic

2. Implement Health Checks

def kafka_health_check(brokers):
    """Check Kafka cluster health"""
    checks = {
        'brokers_reachable': check_broker_connectivity(brokers),
        'all_partitions_isr': check_isr_status(),
        'consumer_lags_healthy': check_consumer_lags(),
        'disk_space_adequate': check_disk_usage(),
        'jmx_accessible': check_jmx_endpoint()
    }
    return all(checks.values())

3. Log Aggregation

  • Collect Kafka logs with Filebeat or Fluentd
  • Index logs in Elasticsearch or Loki
  • Create log-based alerts for ERROR messages

4. Distributed Tracing

  • Use OpenTelemetry for end-to-end tracing
  • Track message flow from producer to consumer
  • Monitor end-to-end latency across services

Summary

Effective Kafka monitoring requires a multi-layered approach combining JMX metrics, Prometheus collection, Grafana visualization, and intelligent alerting. Focus on consumer lag, broker health, and throughput metrics to ensure your Kafka deployment remains reliable and performant.

⭐

Premium Content

Kafka Monitoring

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Kafka Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement