How to Monitor Kafka


By David Mytton,
CEO & Founder of Server Density.

Published on the 19th January, 2016.

Distributed systems and microservices are all the rage these days, and Apache Kafka seems to be getting most of that attention.

Here at Server Density we use it as part of our payloads processing (see: Tech chat: processing billions of events a day with Kafka, Zookeeper and Storm).

For the uninitiated, Kafka is a Scala project—originally developed by LinkedIn—that provides a publish-subscribe messaging service across distributed nodes.

Kafka is fast. A single node can handle hundreds of read/writes from thousands of clients in real time. Kafka is also distributed and scalable. It creates and takes down nodes in an elastic manner, without incurring any downtime. Data streams are split into partitions and spread over different brokers for capability and redundancy.

Before we delve deeper, here is some useful terminology:

Topic: a feed of messages or packages

Partition: group of topics split for scalability and redundancy

Producer: process that introduces messages into the queue

Consumer: process that subscribes to various topics and processes from a feed of published messages

Broker: a node that is part of the Kafka cluster

Here is a diagram of a Kafka cluster alongside the required Zookeeper ensemble: 3 Kafka brokers plus 3 Zookeeper servers (2n+1 redundancy) with 6 producers writing in 2 partitions for redundancy.


With that in mind, here is our very own checklist of best practices, including key Kafka metrics and alerts we monitor with Server Density.

Monitor Kafka: Metrics and Alerts

Once again, our general rule of thumb is “collect all possible/reasonable metrics that can help when troubleshooting, alert only on those that require an action from you”.

Kafka process is running

Metric Comments Suggested Alert
Kafka process Is the right binary daemon process running? When a process list contains the regexp /usr/bin/java*kafka.Kafka*$.

You can also use:

$INSTALL_PREFIX/bin/ config/

Or if you run Zookeeper via supervisord (recommended) you can alert the supervisord resource instead.

System Metrics

Metric Comments Suggested Alert
Memory usage Kafka should run entirely on RAM. JVM heap size shouldn’t be bigger than your available RAM. That is to avoid swapping. None
Swap usage Watch for swap usage, as it will degrade performance on Kafka and lead to operations timing out (set vm.swappiness = 0). When used swap is > 128MB.
Network bandwidth Kafka servers can incur a high network usage. Keep an eye on this, especially if you notice any performance degradation. Also look out for dropped packet errors. None
Disk usage Make sure you always have free space for new data, temporary files, snapshot or backups. When disk is > 85% usage.
Disk IO Kafka partitions are stored asynchronously as a sequential write ahead log. Thus, disk reads and writes in Kafka are sequential, with very few random seeks. None

Here is how Server Density graphs some Kafka metrics:


Kafka Metrics

Metric Comments Suggested Alert
UnderReplicatedPartitions kafka.server: type=ReplicaManager, name=UnderReplicatedPartitions – Number of under-replicated partitions. When UnderReplicatedPartitions > 0.
OfflinePartitionsCount kafka.controller: type=KafkaController, name=OfflinePartitionsCount – Number of partitions without an active leader, therefore not readable nor writeable. When OfflinePartitionsCount > 0.
ActiveControllerCount kafka.controller: type=KafkaController, name=ActiveControllerCount – Number of active controller brokers. When ActiveControllerCount != 1.
MessagesInPerSec kafka.server: type=BrokerTopicMetrics, name=MessagesInPerSec – Incoming messages per second. None
BytesInPerSec / BytesOutPerSec kafka.server: type=BrokerTopicMetrics, name=BytesInPerSec – kafka.server: type=BrokerTopicMetrics, name=BytesOutPerSec – Incoming/outgoing bytes per second. None
RequestsPerSec type=RequestMetrics, name=RequestsPerSec, request={Produce|FetchConsumer|FetchFollower} – Number of requests per second. None
TotalTimeMs type=RequestMetrics, name=TotalTimeMs, request={Produce|FetchConsumer|FetchFollower} – Total time it takes to process a request. You can also monitor split times for QueueTimeMs, LocalTimeMs, RemoteTimeMs and RemoteTimeMs. None
UncleanLeaderElectionsPerSec kafka.controller: type=ControllerStats, name=LeaderElectionRateAndTimeMs – Number of disputed leader elections rate. When UncleanLeaderElectionsPerSec != 0.
LogFlushRateAndTimeMs kafka.log: type=LogFlushStats, name=LogFlushRateAndTimeMs – Asynchronous disk log flush and time in ms. None
UncleanLeaderElectionsPerSec kafka.controller: type=ControllerStats, name=UncleanLeaderElectionsPerSec – Unclean leader election rate. When UncleanLeaderElectionsPerSec != 0.
PartitionCount kafka.server: type=ReplicaManager, name=PartitionCount – Number of partitions on your system. When PartitionCount != your_num_partitions.
ISR shrink/expansion rate kafka.server: type=ReplicaManager, name=IsrShrinksPerSec – kafka.server: type=ReplicaManager, name=IsrExpandsPerSec – When a broker goes down, ISR will shrink for some of the partitions. When that broker is up again, ISR will be expanded once the replicas have fully caught up. IsrShrinksPerSec | IsrExpandsPerSec != 0.
NetworkProcessorAvgIdlePercent type=SocketServer, name=NetworkProcessorAvgIdlePercent – The average fraction of time the network processors are idle. When NetworkProcessorAvgIdlePercent < 0.3.
RequestHandlerAvgIdlePercent kafka.server: type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent – The average fraction of time the request handler threads are idle. When RequestHandlerAvgIdlePercent < 0.3.
Heap Memory Usage Memory allocated dynamically by the Java process, Zookeeper in this case. None

Kafka Consumer Metrics

Metric Comments Suggested Alert
MaxLag kafka.consumer: type=ConsumerFetcherManager, name=MaxLag, clientId=([-.\w]+) Number of messages by which the consumer lags behind the producer. When MaxLag > 50.
MinFetchRate kafka.consumer: type=ConsumerFetcherManager, name=MinFetchRate, clientId=([-.\w]+) Minimum rate at which consumer sends requests to the Kafka broker. If stalled or dead, this drops to 0. When MinFetchRate < 0.5.
MessagesPerSec kafka.consumer: type=ConsumerTopicMetrics, name=MessagesPerSec, clientId=([-.\w]+) Messages consumed per second. None
BytesPerSec kafka.consumer: type=ConsumerTopicMetrics, name=BytesPerSec, clientId=([-.\w]+) Byes consumed per second. None
KafkaCommitsPerSec kafka.consumer: type=ZookeeperConsumerConnector, name=KafkaCommitsPerSec, clientId=([-.\w]+) Rate at which consumer commits offsets to Kafka. None
OwnedPartitionsCount kafka.consumer: type=ZookeeperConsumerConnector, name=OwnedPartitionsCount, clientId=([-.\w]+),groupId=([-.\w]+) Number of partitions owned by this consumer. When OwnedPartitionsCount != your_count. (might require a dynamic adjustment if you change your cluster)

Kafka Monitoring Tools

Any monitoring tools with JMX support should be able to monitor a Kafka cluster. Here are 3 monitoring tools we liked:

First one is from Hari Sekhon. It performs a complete end to end test, i.e. it inserts a message in Kafka as a producer and then extracts it as a consumer. This makes our life easier when measuring service times.

Another useful tool is KafkaOffsetMonitor for monitoring Kafka consumers and their position (offset) in the queue. It aids our understanding of how our queue grows and which consumers groups are lagging behind.

Last but not least, the LinkedIn folks have developed what we think is the smartest tool out there: Burrow. It analyzes consumer offsets and lags over a window of time and determines the consumer status. You can retrieve this status over an HTTP endpoint and then plug it into your favourite monitoring tool (Server Density for example).

Oh, and we would be amiss if we didn’t mention Yahoo’s Kafka-Manager. While it does include some basic monitoring, it is more of a management tool. If you are just looking for a Kafka management tool, check out AirBnb’s kafkat.

Further reading

Did this article pique your interest in Kafka? Nice, keep reading.

The LinkedIn engineering team have written a lot about how they use Kafka. Kafka might not be for everyone of course. The Auth0 Webtasks team has recently switched to a much simpler solution based on ZeroMQ and they wrote about it here.

So what about you? Do you have a checklist or any best practices for monitoring Kafka? What systems do you have in place and how do you monitor them? Any interesting reads to suggest?

Free eBook: 4 Steps to Successful DevOps

This eBook will show you how we i) hacked our on-call rotation to increase code resilience, ii) broke our infrastructure, on purpose, to debug quicker and increase uptime, and iii) borrowed practices from the healthcare and aviation industry, to reduce complexity, stress and fatigue. And speaking of stress and fatigue, we’ve devoted an entire chapter on how we placed humans at the centre of Ops, in order to increase their productivity and boost the uptime of the systems they manage. What are you waiting for, download your free copy now.

Help us speak your language. What is your primary tech stack?

What infrastructure do you currently work with?

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time