Kafka

This Kafka tab is shown on the Clusters page only when you have enabled it while connecting to the Kafka cluster. Refer to Connecting to the Kafka cluster.

From this tab, you can monitor various metrics of Kafka streaming. Before using Unravel to monitor Kafka, you must connect the Unravel server to a Kafka cluster.

You can also improve the Kafka cluster security by having Kafka authenticate connections to brokers from clients using either SSL or SASL. Refer to Kafka security.

The following KPIs are displayed on the top of the Kafka page. The first three ( Under Replicated Partitions, Offline Partitions, Controller) are color-coded as follows:

Green - Indicates that the streaming process is healthy.
Red - Indicates that the streaming process is unhealthy. This is an alert for further investigations.

The remaining metrics are data input and output metrics and are always displayed in blue.

Metrics	Description
Under Replicated Partitions	A total number of under-replicated partitions. This metric indicates if the partitions are not replicating as configured. If under-replicated partitions are shown, you can drill down further in the Broker tab for further investigation.
Offline Partitions	A total number of offline partitions. If this metric is greater than 0, then broker-level issues must be addressed.
Controller	Shows the number of brokers in the cluster that delegate the function of a controller. If this metric is showing as 0, it indicates no active controllers.
Bytes in per Sec	The total number of incoming bytes received per second from all servers.
Bytes Out per Sec	The total number of outgoing bytes sent per second to all servers.
Messages in per Sec	The total rate of incoming messages. A unit of data is called a message. The messages are written into Kafka in batches.
Total Fetch Requests per Sec	Rate of the fetch request.

See Kafka Metrics Reference and Analysis for a complete list of Kafka metrics that Unravel monitors.

Metrics

You can monitor all the metrics associated with your Kafka clusters from the Kafka > Metrics tab. The following graphs plot the various Kafka metrics for a selected cluster, in the specified time range:

Bytes In per Second: This graph plots the total number of bytes received per second from all topics and brokers, over a specified time range.
Bytes Out per Sec: This graph plots the total number of outgoing bytes sent per second to all Kafka consumers, over a specified time range.
Messages in per Sec: This graph plots the messages produced in the cluster across all topics and brokers, over a specified time range.
Total Fetch Requests per Sec: This graph plots the total rate of fetch requests within the cluster over a specified time.
Under-Replicated Partitions: This graph plots the number of under-replicated partitions per second, within a cluster, over a specified period.
Active Controller Trend: This graph plots the trend of the brokers in the cluster that delegate the function of a controller over a specified period. You can select or deselect the brokers to view the corresponding trend.
Request Handler Idle Ratio Average per Minute: It indicates the percentage of the time the request handlers are not in use. This graph plots the average ratio per minute that the request handler threads are idle in a specified time range.
Partition Count: The number of partitions. The count includes both leader and follower replicas. This graph plots the total number of partitions across all brokers in a cluster in a specified time range. The trend line indicates the metric's value across the cluster averaged over all brokers.
Leader Partition Count:t This graph plots the total number of leader partitions across all brokers in a cluster in a specified time range.
Offline Partition Count: This graph plots the total number of partitions that do not have an active leader in the specified time range. Such partitions are neither writable nor readable. If the count exceeds 0, more investigation is required to resolve broker-level issues.
Fetch Total Time, 99 Percentile: This graph plots the time value for the entire cluster averaged across all brokers.
Produce Total Time, 99 Percentile
Fetch Requests per Sec: This graph plots the rate of fetch requests in a specified time range.
Produce Request per Sec: This graph shows the summation of the produce requests being processed across all brokers per second. This graph plots the rate of produce requests in a specified time range.
Produce Purgatory Size: This graph plots the product purgatory size over a specified time range. Purgatory holds a request that has not yet succeeded or resulted in an error.
Fetch Purgatory Size: The number of Fetch requests sitting in the Fetch Request Purgatory. It is a holding pen for requests waiting to be satisfied (Delayed). Of all Kafka request types, it is used only for Fetch requests. It tracks the number of requests in purgatory (including the watcher's map and expiration. This graph plots the purgatory fetch size over a specified time range.
Log Flush Latency, 99th Percentile:The 99th percentile value of the latency incurred by a log flush, that is, write to disk in milliseconds. 99% of all values in the group are less than the metric's value. This graph plots the time taken for the brokers to flush logs to disk over a specified time range.

An average of the following metrics are displayed in the table below:

Items	Description
Topic	Name of the Kafka topic.
Consumer Group	The name of the consumer group that works together to consume a topic. Consumers read messages. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced.
Brokers	The name of the Kafka broker. A single Kafka server is called a broker. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk.
Bytes in per Sec	The total number of incoming bytes received per second from all servers.
Bytes Out per Sec	The total number of outgoing bytes sent per second to all servers.
Messages in per Sec	The total rate of incoming messages.
Total Fetch Requests per Sec	Rate of the fetch request.

You can click a row in the average metrics table to view the corresponding Topic summary page. Click the Download CSV icon to download the topic summary as a CSV file.

Broker

A single Kafka server is called a broker. Kafka brokers are designed to operate as part of a cluster. One broker can function as the cluster controller within a cluster of brokers. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk. It also services consumers, responding to fetch requests for partitions and responding with the messages committed to disk.

You can monitor all the metrics associated with each Kafka broker in your cluster from the Kafka > Broker tab.

The latest metrics of the Kafka brokers for the specified time range are displayed in a table as shown:

The following graphs plot the various Kafka metrics for a selected broker, in the selected cluster, for the specified time range:

Bytes In per Second: This graph plots the total number of bytes received per second for the selected broker, over a specified time range.
Bytes Out per Sec: This graph plots the total number of outgoing bytes sent per second by the selected broker, over a specified time range.
Messages in per Sec: This graph plots the total rate of the incoming messages received by the broker, over a specified time range.
Total Fetch Requests per Sec: This graph plots the total rate of fetch requests made by the broker, over a specified time.
Under Replicated Partitions: This graph plots the number of under-replicated partitions per second in the selected broker over a specified time range.
Active Controller Trend: This graph plots the trend of the selected broker to delegate the function of a controller over a specified period.
Request Handler Idle Ratio Average per Minute: This graph plots the average ratio per minute that the request handler threads are idle, in the selected broker, for the specified time range.
Partition Count: This graph plots the total number of partitions in the selected broker for the specified time range.
Leader Partition Count:This graph plots the number of leader partitions in the selected broker for the specified time range.
Offline Partition Count: This graph plots the total number of partitions that do not have an active leader in the selected broker for the specified time range. Such partitions are neither writable nor readable. If the count exceeds 0, more investigation is required to resolve the broker-level issues.
Fetch Total Time, 99 Percentile:
Produce Total Time, 99 Percentile
Fetch Requests per Sec: This graph plots the rate of fetch requests made by the selected broker in a specified time range.
Produce Request per Sec:This graph plots the rate of produce requests in a specified time range.
Product Purgatory Size: This graph plots the product purgatory size, for the selected broker, over a specified time range. Purgatory holds a request that has not yet succeeded or resulted in an error.
Fetch Purgatory Size: This graph plots the fetch purgatory size for the selected broker over a specified time range.
Log Flush Latency, 99th Percentile:This graph plots the time taken for the selected broker to flush logs to disk over a specified time range.

An average of the following metrics are displayed in the table below:

You can click a row in the average metrics table to view the corresponding Topic summary page.

Topic

A Topic is a category into which the Kafka records are organized. Topics are additionally divided into several partitions. From the Clusters > Topic tab, you can monitor all the metrics associated with a topic.

The latest metrics of topics are displayed in a table as shown.

Items	Description
Topic	Name of the Kafka topic.
Brokers	The name of the Kafka broker. A single Kafka server is called as a broker. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk.
Bytes in per Sec	The total number of incoming bytes received per second for the topic.
Bytes Out per Sec	The total number of bytes received per second for the selected topic over a specified time range.
Messages in per Sec	The total rate of the incoming messages the topic receives over a specified time range.
Total Fetch Requests per Sec	The total rate of fetch requests the topic makes over a specified time.

The corresponding graphs are plotted for the topic metrics:

Bytes In per Second: This graph plots the total number of bytes received per second for the selected topic over a specified time range.
Bytes Out per Second: This graph plots the total number of bytes received per second for the selected topic over a specified time range.
Messages In per Second: This graph plots the total rate of the incoming messages received by the topic over a specified time range.
Total Fetch Request per Second: This graph plots the total rate of fetch requests the topic makes over a specified time.
Under-Replicated Partitions per Second: This graph plots the number of under-replicated partitions per second, within a cluster, over a specified period.

The average metrics of the topics in the specified period are displayed in a table. If you click a topic, the corresponding topic summary page is displayed.

Topic Summary page

The Topic summary page displays the latest and average values of the topic metrics along with the corresponding graphs.

Consumer Summary page

Consumers read messages. They are also called subscribers or readers. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. Consumer tracks which of the messages it has already consumed by keeping track of the offset of messages.

The following latest metrics of the topics are displayed on the Consumer Summary page.

Items	Description
Consumer Name	Name of the Consumer.
Max Lag	A number of messages the consumer lags behind the producer.
Total Lag	The total number of incoming bytes received per second from all servers.
Mean Lag	The total number of outgoing bytes sent per second to all servers.
Partitions	The total rate of incoming messages.
Lagging Partitions	If the Consumer lag for the topic partition is increasing consistently, and an increase in lag from the start of the window to the last value is greater than the lag threshold
Stalled Partitions	If the Consumer commit offset for the topic partition is not increasing and the lag is greater than zero
Status	Status

The following metrics are tracked on this page:

Number of Topics
Number of Partitions

The Topic list displays the KPIs; when details are available, a more info icon is displayed. Click it to bring up the Kafka view for the topic. Below the list are two tabs that display graphs of the Topic and Partition details. By default, the window opens with the Topic Detail graph displayed.

You can choose both the Partition and the Metric for the display. By default, the 0^th partitions is displayed using the metric offset. The Partition Details' list is populated if the details are available.

Unravel insights for Kafka

Unravel provides auto-detection of lagging/stalled Consumer Groups. It lets you drill down into your cluster and determine which consumers, topics, and partitions are lagging or stalling.

Unravel determines Consumer status by evaluating the consumer's behavior over a sliding window. For example, we use an average lag trend for 10 intervals (of 5 minutes duration each), covering 50 minutes. Consumer Status is evaluated on several factors during the window for each partition it consumes.

For a topic partition, Consumer status is:

Stalled: If the Consumer commits offset for the topic partition is not increasing, and the lag is greater than zero.
Lagging: If the Consumer lag for the topic partition is increasing consistently, and an increase in lag from the start of the window to the last value is greater than the lag threshold.

The information is divided into a status for each partition and then into a single status for the consumer. A consumer is either in one of the following states:

OK: The consumer is working and is current.
Warning: The consumer is working but falling behind.
Error: The consumer has stopped or stalled.

Kafka Metrics Reference and Analysis

The Metrics and Broker tab graphs all the following Metrics. The Topic tab displays a subset of the Metrics. In addition, subsets of these Metrics are used in the Topic and Consumer Group pages.

While the definition of the metric is the same throughout, the breadth of the collected data is different. Therefore, the implication and interpretation of the metric varies by context for Metrics, Brokers, and Topics.

The following table lists the Kafka metrics, the definition and a brief analysis in context with monitoring.

Metric	Definition	Analysis
Bytes In Per Second	The one minute rate of all the bytes flowing into the Kafka network, specifically the Kafka Producers publishing messages. It indicates how much traffic the brokers are receiving from the Producer clients.	The Brokers tab can help you determine if you need to expand your cluster. There might be situations where it could be possible to identify a broker that is receiving more traffic than others which indicates that there is a need for a partition rebalance.
Bytes Out Per Second	The one minute rate of all the bytes flowing out of the Kafka network, specifically Kafka Consumers consuming messages from the topics.	It possible for the outgoing rate to be two to three times the incoming rate for topics being consumed. Before Kafka 0.11.0.0, this metric included the internal replica traffic. Therefore, it showed an outbound rate for a topic that was an actively producing client, even when no consumers were consuming from the same.
Messages In Per Second	The number of messages produced per second. While Bytes In Per Second displays the broker traffic in absolute terms of bytes, this is the number (count) of individual messages that flow in, regardless of their size.	Useful for end-users who are more interested in the message count rather than the network throughput. When used in conjunction with Bytes in Per Second you can estimate the size of a single message. Like Bytes in Per Second it can help identify broker imbalance so you can rebalance your partitions.
Total Fetch Requests Per Second	The number of consumer fetch requests sent per second. Like Bytes Out Per Second this is an indicator of the rate consumers are uptaking/requesting messages from the Kafka cluster.	Care must be taken when interpreting this metric because the value also includes replica traffic, that is, the fetch requests that get raised by the brokers to keep the topic partitions in sync.
Under Replicated Partitions Per Second	The count of under replicated partitions existing in a cluster.	This metric can provide insight into a number of problems on a Kafka cluster ranging from a broker being down to resource exhaustion. For example, a constant number of under replicated partitions for many brokers might indicate that one of the brokers is down. The count across the cluster is equal to the number of partitions on the faulty broker.
Active Controller Trend	Indicates the presence of an active controller	On the cluster level the value for this metric should always be one, that is, at all times only one broker should be the active controller. Any other value could indicate the cluster might be susceptible to administrative issues such as partition reassignment, etc. On a broker level, except for the active controller, the brokers should return zero.
Request Handler Idle Ratio Average Per Minute	The request handler idle ratio trend over a time period. It indicates the percentage of the time the request handlers are not in use.	Lower numbers indicate increased load on the broker. Anything under 0.2 (20 percent) is a cause for concern. Values under 0.1 (10 percent) usually indicate an active performance problem. Typically, there are two reasons for high thread utilization: There aren't enough threads. The threads are devoting some of their resources to overhead tasks and not servicing the client requests itself.
Partition Count	The number of partitions. The count includes both leader and follower replica. It should be basically constant over time.	Expect this to be basically constant over time. This count includes both leader and follower replica.
Leader Partition Count	The number of partition leaders.	Consider monitoring this metric because it gives you a qualitative distinction between the replicas on a broker in regard to leaders and followers. Even if the replicas are balanced in size or number across the cluster, there might be a broker who has an indiscriminately large/small number of leaders. This is an indication of imbalance. You can use this metric in conjunction with Partition Count to calculate the percentage of partitions on the Broker for the partition the broker is a leader. In well-balanced clusters, this should be uniform. For example, if you have a replication factor of two all brokers should be leaders of approximately 50% of their partitions.
Offline Partition Count	The number of offline partitions in the cluster indicating the number of partitions in the cluster that have no leader.	Along with Under Replicated Partitions, this is a critical metric to monitor. Such partitions are inaccessible to clients because produce and fetch requests are sent only to leaders. This could result in issues like producer clients losing messages. One potential reason for the presence of offline partitions is the broker/brokers hosting the leader replicas are down.
Fetch Total Time, 99th Percentile	The 99th percentile of the total turnaround time (time from receiving the request to sending a response back) the broker spends processing the fetch request. 99% of all values in a group of fetch request timing are less than this metric's value.	You can gain insight into how an average request performs and what are the outliers by correlating this metric value with the average value of the metric.
Produce Total Time, 99th Percentile	The 99th percentile of the total turnaround time (time from receiving the request to sending a response back) the broker spends processing the produce request. 99% of all values in a group of produce request timing are less than this metric's value.	You can gain insight into how an average request performs and what are the outliers by correlating this metric value with the average value of the metric. You can use this to set a baseline of sorts. Much like Under Replicated Partitions, a spike in the 99 percentile for Produce requests can alert you to a wide range of performance issues.
Fetch Requests Per Sec	Displays the one minute rate for all the fetch requests being received and processed. Like Bytes Out Per Sec, it is an indicator of the rate at which consumers are uptaking/requesting messages from the Kafka cluster.	You must take care when interpreting this metric. In addition to the consumer client fetch request, the value also includes replica traffic, that is, the fetch requests that get raised by the brokers to keep the topic partitions in sync.
Produce Requests Per Sec	Displays the one minute rate of all the Produce Requests being received and processed. Like Bytes In Per Sec, it indicates how much traffic the brokers are receiving from the Producer clients.	This metric can be useful in deciding whether to expand your cluster. Additionally, it can be helpful when tracked at the broker level. On a normal Kafka cluster the value across all brokers should be basically the same. It helps you identify a broker that is receiving more traffic than others indicating there is a need for a partition rebalance.
Produce Purgatory Size	The number of Produce requests sitting in the Producer Request Purgatory. It is a holding pen for requests waiting to be satisfied (Delayed). Of all Kafka request types, it is used only for Produce requests. It tracks the number of requests sitting in purgatory (including both watchers map and expiration queue).	You can use this metric as a rough gauge of memory usage.
Fetch Purgatory Size	The number of Fetch requests sitting in the Fetch Request Purgatory. It is a holding pen for requests waiting to be satisfied (Delayed). Of all Kafka request types, it is used only for Fetch requests. It tracks the number of requests sitting in purgatory (including both watchers map and expiration.	You can use this metric as a rough gauge of memory usage.
Log Flush Latency, 99th Percentile	The 99th percentile value of the latency incurred by a log flush, that is, write to disk in milliseconds. 99% of all values in the group are less than the value of the metric.	Log flushes are generally one of the most expensive operations and can affect the durability, latency, and throughput of the Kafka cluster. This trend line depicts what's the log latency like for the cluster and can help in configuring the log flush interval.

In this section:

Home