Clusters

You can monitor all aspects of your clusters from this page, including generating various reports. The following tabs are included on the Clusters page:

By default, the page opens showing the Overview tab.

The data displayed is aggregated across all clusters, unless you select a specific cluster. For example, the Overview Jobs charts display the jobs for all clusters.

Overview

By default, the page opens showing the Overview tab. This tab displays an overview of the selected cluster with various KPIs and graphs. You can view the aggregated details for a single cluster. You can select a cluster and the time period to get an overview of the clusters in that specified time period.

The graphs plot the KPIs and other metrics for the selected cluster, over the selected time range. If there are multiple items plotted in a graph, you can select the corresponding checkboxes to hide or show the item.

The Overview page displays the following tiles:

Cluster

The KPIs in Cluster tile shows the current value and changes based on the selected time range. The following KPIs and graphs are shown in this tile:

Nodes: Shows the total count of the current number of active nodes plus the number of bad nodes in the selected clusters. From the Nodes graph, you can plot the active and bad nodes in a cluster, over a specified period.
Available Vcores: Shows the current number of available VCores in the selected clusters. From the VCores graph, you can plot the available VCores and the allocated VCores in a cluster, over a specified period.
Available Memory: Shows the current amount of the available memory in the selected cluster. From the Memory graph, you can plot the available memory and allocated memory in a cluster, over a specified period.
Alerts: Shows the count of alert notifications. The list of the alert notifications are displayed in the right panel.
Note
Alerts KPI is displayed only if the count is more than zero.

Jobs

This displays the KPIs pertaining to the jobs run by various application types in the selected cluster, over a period of time.

The following KPIs and graphs are shown in this section.

Running: The current number of running jobs (apps). The Running Jobs chart graphs the running and accepted (pending jobs) over the period selected.
Pending: The current number of pending jobs. The Running Jobs chart graphs this metric.
Success/Failed/Killed: Displays the number of successful, failed, or killed jobs for the selected period. The By Status graph plots all these statuses.

Inefficient Events: Shows the number of events that have occurred, in a cluster, due to some inefficiencies in the job run, during the specified period. Check Jobs > Inefficient Apps for the list of jobs (apps) experiencing these events.
Note
Jobs can experience multiple events; therefore, the number of jobs listed in the Jobs > Inefficient Apps tab is typically less than the number of events.
By Type: Display the count of jobs run in the specified time period by various application types. This is shown in pie chart segments as well as plotted in a graph

Alerts

Lists all the alerts and events triggered by jobs for the selected time range. Click to create a new AutoAction.

Resources

The Resources tab displays the usage of resources by the clusters, for a specific period. The resources usage can be grouped by any of the following filters:

Application Type
User
Queue
User-defined tag key (see What is tagging)
Input/output tables

Unravel polls the Resource Manager every 90 seconds to get the resources (VCores, Memory, and running containers) for all the running and pending apps. Applications that run for less than 90 seconds are not captured. Unless the application is running at the exact moment the Resource Manager is polled, the application data is not captured. Therefore, for any specific point in time, the total number of applications running may not match the number of applications in Jobs > All Applications if you filter at the same time.

Viewing resources usage

To view the resources usage, do the following:

Go to the Clusters > Resources tab.
From the Cluster drop-down, select a cluster.
From the Group By drop-down, select an option. You can filter the Group By options further from the Filter options box. Click the next to an option to deselect it.
For example, if you select Application Type as a Group By option, then all the application types are listed in the Filter options box. You can then drill down to a specific application type such as Spark.
Select the period range from the date picker drop-down. You can also provide a custom period range. The graphs corresponding to the selected filters are displayed.
Click any point in the graph. The details of the resources used for the running jobs are displayed in a table.

Graphs (Resources)

The following graphs plot the resource usage for a selected filter in the specified period range. You can hover the mouse pointer over the graph to view the corresponding values on the trend line. You can select a section of the graph and drag the pointer to zoom in. Click Reset Zoom to zoom out.

VCores
The Vcores graph plots the total available and total allocated VCores in your cluster for the specified period. Select or deselect the checkboxes to hide or show the available and allocated VCores.
Memory
The Memory graph plots the total available and total allocated memory in your clusters for the specified period. Select or deselect the checkboxes to hide or show the total available and total allocated memory.
VCores (Group By, Filter) and Memory (Group By, Filter)
This graph plots the VCores and memory resources usage by any of the following options in a cluster for the specified period.
- Application type
- User
- Queue
- User-defined tag key (see What is tagging)
- Input/output tables
Further Filter options can be determined based on the selected Group By option.

You can click in the graph to export the graph and the table data to any of the following formats:

PDF
PNG
JPEG
SVG
CSV
XLS

When you click any point in the graph, the following details of the resources used for a job run are displayed. Click a row in this table and the APM page of the corresponding application is displayed. Click columns in the table to select the columns that must be displayed.

Items	Description
Type	The type of application where the job is running.
User	Name of the user running the job.
State	Status of the job.
App Name/ID	Name or ID of the application where the job is running.
Start Time	The time when the job was started.
Duration	Period till when the job has been running.
Queue	Queue
Progress	Percentage of the progress of the running job.
Memory MB	Memory in MB that is used by the job.
VCores	No of VCores that are used by the job.
Running Containers	No of running containers.

Clusters > Jobs

The Clusters > Jobs tab shows all the jobs running in a cluster, for a specific period, which can be grouped by any of the following options:

State
Application type
User
Queue

You can further filter the Group by options. The Clusters > Jobs tab has the following two sections:

The graph plots the trend of the job count over a specified period which can be grouped by options and further filtered. The filter checkboxes can be selected or deselected to enable the corresponding trendline in the graph.

The list of the running applications with the following details is displayed for the specified period.

Items	Description
Type	The type of application where the job is running.
State	Status of the job.
User	Name of the user running the job.
App Name/ID	Name or ID of the application where the job is running.
Start Time	The time when the job was started.
Duration	Period till when the job has been running.
Queue	Queue
Progress	Percentage of the progress of the running job.
Memory MB	Memory in MB that is used by the job.
VCores	No of VCores that are used by the job.
Running Containers	No of running containers.

Viewing Jobs

To view the resources usage, do the following:

Go to the Clusters > Jobs tab.
From the Cluster drop-down, select a cluster.
From the Group By drop-down, select an option. You can filter the Group By options further from the Filter options box. Click the next to an option to deselect it.
For example, if you select Application Type as a Group By option, then all the application types are listed in the Filter options box. You can then drill down to a specific application type such as Spark.
Select the period range from the date picker drop-down. You can also provide a custom period range. The graphs corresponding to the selected filters are displayed.
Click any point in the graph. The details of the resources used for the running jobs are displayed in a table.

Impala

From this tab, you can monitor the details of the running Impala queries. These details are plotted into graphs based on the selected cluster, period, group by options, and filters.

Monitoring Impala queries

Go to the Clusters > Impala tab.
From the Cluster drop-down, select a specific cluster. All the nodes in the cluster are displayed in the Memory Consumption graph.
Select the period range from the date picker drop-down. You can also provide a custom period range. The graphs corresponding to the selected filters are displayed.
From the Group By drop-down, select an option. All the items corresponding to the selected option are listed in the Filter box.
You can further edit the Group By options from the Filter box. Click the next to any option to deselect it. The following graphs are displayed:
- Memory Consumption graph
  The Memory Consumption graph plots the memory consumed for the Impala queries, by each node in the selected cluster, for the specified period range. You can select or deselect the nodes displayed below the graph to view the corresponding details on the graph. Hover over any point in the Memory Consumption graph to view the details of the memory consumption for each of the selected node in the selected cluster, at a specific period.
- Queries graph
  The Queries graph displays the number of queries that are run based on the specified Group by options. You can select or deselect the filters corresponding to the Group by option that you have selected to view the corresponding details on the graph. Hover over any point in the Queries graph to view the number of queries run by each of the specified Group by filters, in the selected cluster, at a specific period.

Click any point in the graph. The details of the running Impala queries are displayed in a table below. Click to set the columns for display in the table.

Items	Description
Type	Type of query.
State	Status of the query.
User	Name of the user running the query.
App Name / ID	Query string.
Start Time	Time when the query was run.
Duration	Duration of the query.
Queue/pool	Queue or pool.
Peak Mem Limit per Host	The limit of peak memory for each host.
Estimated Peak Mem Limit per Host	The estimated threshold of peak memory for each host.
Actual Peak Mem per Host	Actual peak memory attained for each host while running the Impala queries.

Kafka

From this tab, you can monitor various metrics of Kafka streaming, in multiple clusters, from a single Unravel installation. Before using Unravel to monitor Kafka, you must connect the Unravel server to a Kafka cluster. Refer to Connecting to the Kafka cluster. You can also improve the Kafka cluster security by having Kafka authenticate connections to brokers from clients using either SSL or SASL. Refer to Kafka security.

The following KPIs are displayed on the top of the Kafka page. The first three ( Under Replicated Partitions, Offline Partitions, Controller) are color-coded as follows:

Green - Indicates that the streaming process is healthy.
Red - Indicates that the streaming process is unhealthy. This is an alert for further investigations.

The remaining metrics are data input and output metrics and are always displayed in blue

Metrics	Description
Under Replicated Partitions	Total number of under-replicated partitions. This metric indicates if the partitions are not replicating as configured. If under-replicated partitions are shown then you can drill down further in the Broker tab for further investigation.
Offline Partitions	Total number of offline partitions. If this is metric is greater than 0 then there are broker-level issues to be addressed.
Controller	Shows the number of brokers in the cluster that delegate the function of a controller. If this metric is showing as 0, it indicates that there are no active controllers.
Bytes in per Sec	The total number of incoming bytes received per second from all servers.
Bytes Out per Sec	The total number of outgoing bytes sent per second to all servers.
Messages in per Sec	Total rate of the incoming messages. A unit of data is called a message. The messages are written into Kafka in batches.
Total Fetch Requests per Sec	Rate of the fetch request.

See Kafka Metrics Reference and Analysis for a complete list of Kafka metrics that are monitored by Unravel.

Metrics

You can monitor all the metrics associated with your Kafka clusters from the Kafka > Metrics tab. The following graphs plot the various Kafka metrics for a selected cluster, in the specified time range:

Bytes In per Second: This graph plots the total number of bytes received per second from all topics and broker, over a specified time range.
Bytes Out per Sec: This graph plots the total number of outgoing bytes sent per second to all Kafka consumers, over a specified time range.
Messages in per Sec: This graph plots the messages produced in the cluster across all topics and brokers, over a specified time range.
Total Fetch Requests per Sec: This graph plots the total rate of fetch requests within the cluster, over a specified time.
Under Replicated Partitions: This graph plots the number of under replicated partitions per second, within a cluster, over a specified period.
Active Controller Trend: This graph plots the trend of the brokers in the cluster that delegate the function of a controller, over a specified period. You can select or deselect the brokers to view the corresponding trend.
Request Handler Idle Ratio Average per Minute: It indicates the percentage of the time the request handlers are not in use. This graph plots the average ratio per minute that the request handler threads are idle in a specified time range.
Partition Count: The number of partitions. The count includes both leader and follower replica. This graph plots the total number of partitions across all brokers in a cluster, in a specified time range. The trend line indicates the metric's value across the entire cluster averaged over all brokers.
Leader Partition Count:t This graph plots the total number of leader partitions across all brokers in a cluster, in a specified time range.
Offline Partition Count: This graph plots the total number of partitions that do not have an active leader, in the specified time range, . Such partitions are neither writable or readable. If the count is more than 0, more investigation is required to resolve broker level issues.
Fetch Total Time, 99 Percentile: This graph plots the time value for the entire cluster averaged across all brokers.
Produce Total Time, 99 Percentile
Fetch Requests per Sec: This graph plots the rate of fetch requests in a specified time range.
Produce Request per Sec:Shows the summation of the produce requests being processed across all brokers per second. This graph plots the rate of produce requests in a specified time range.
Product Purgatory Size: TThis graph plots the product purgatory size over a specified time range. Purgatory holds a request that has not yet succeeded neither has it resulted in an error.
Fetch Purgatory Size: he number of Fetch requests sitting in the Fetch Request Purgatory. It is a holding pen for requests waiting to be satisfied (Delayed). Of all Kafka request types, it is used only for Fetch requests. It tracks the number of requests sitting in purgatory (including both watchers map and expiration. This graph plots the fetch purgatory size over a specified time range.
Log Flush Latency, 99th Percentile:The 99th percentile value of the latency incurred by a log flush, that is, write to disk in milliseconds. 99% of all values in the group are less than the value of the metric. This graph plots the time taken for the brokers to flush logs to disk over a specified time range.

An average of the following metrics are displayed in a table below:

Items	Description
Topic	Name of the Kafka topic.
Consumer Group	The name of the consumer group that work together to consume a topic. Consumers read messages. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced.
Brokers	The name of the Kafka broker. A single Kafka server is called as a broker. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk.
Bytes in per Sec	The total number of incoming bytes received per second from all servers.
Bytes Out per Sec	The total number of outgoing bytes sent per second to all servers.
Messages in per Sec	Total rate of the incoming messages.
Total Fetch Requests per Sec	Rate of the fetch request.

You can click a row in the average metrics table to view the corresponding Topic summary page.

Broker

A single Kafka server is called a broker. Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker can function as the cluster controller. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk. It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to disk.

You can monitor all the metrics associated with each Kafka broker in your cluster, from the Kafka > Broker tab.

The latest metrics of the Kafka brokers for the specified time range are displayed in a table as shown:

The following graphs plot the various Kafka metrics for a selected broker, in the selected cluster, for the specified time range:

Bytes In per Second: This graph plots the total number of bytes received per second for the selected broker, over a specified time range.
Bytes Out per Sec: This graph plots the total number of outgoing bytes sent per second by the selected broker, over a specified time range.
Messages in per Sec: This graph plots the total rate of the incoming messages received by the broker, over a specified time range.
Total Fetch Requests per Sec: This graph plots the total rate of fetch requests made by the broker, over a specified time.
Under Replicated Partitions: This graph plots the number of under replicated partitions per second in the selected broker, over a specified time range.
Active Controller Trend: This graph plots the trend of the selected broker to delegate the function of a controller, over a specified period.
Request Handler Idle Ratio Average per Minute: This graph plots the average ratio per minute that the request handler threads are idle, in the selected broker, for the specified time range.
Partition Count: This graph plots the total number of partitions in the selected broker, for the specified time range.
Leader Partition Count:This graph plots the total number of leader partitions in the selected broker, for the specified time range.
Offline Partition Count: This graph plots the total number of partitions, that do not have an active leader in the selected broker, for the specified time range. Such partitions are neither writable or readable. If the count is more than 0, more investigation is required to resolve the broker level issues.
Fetch Total Time, 99 Percentile:
Produce Total Time, 99 Percentile
Fetch Requests per Sec: This graph plots the rate of fetch requests made by the selected broker, in a specified time range.
Produce Request per Sec:This graph plots the rate of produce requests in a specified time range.
Product Purgatory Size: This graph plots the product purgatory size, for the selected broker, over a specified time range. Purgatory holds a request that has not yet succeeded neither has it resulted in an error.
Fetch Purgatory Size: This graph plots the fetch purgatory size for the selected broker, over a specified time range.
Log Flush Latency, 99th Percentile:This graph plots the time taken for the selected broker to flush logs to disk over a specified time range.

An average of the following metrics are displayed in a table below:

You can click a row in the average metrics table to view the corresponding Topic summary page.

Topic

A Topic is a category into which the Kafka records are organized. Topics are additionally divided into several partitions. From the Clusters > Topic tab, you can monitor all the metrics associated with a topic.

The latest metrics of topics are displayed in a table as shown.

Items	Description
Topic	Name of the Kafka topic.
Brokers	The name of the Kafka broker. A single Kafka server is called as a broker. The broker receives messages from producers, assigns offsets to them, and commits the messages to storage on disk.
Bytes in per Sec	The total number of incoming bytes received per second for the topic.
Bytes Out per Sec	The total number of bytes received per second for the selected topic, over a specified time range.
Messages in per Sec	The total rate of the incoming messages received by the topic, over a specified time range.
Total Fetch Requests per Sec	The total rate of fetch requests made by the topic, over a specified time.

The corresponding graphs are plotted for the topic metrics:

Bytes In per Second: This graph plots the total number of bytes received per second for the selected topic, over a specified time range.
Bytes Out per Second: This graph plots the total number of bytes received per second for the selected topic, over a specified time range.
Messages In per Second: This graph plots the total rate of the incoming messages received by the topic, over a specified time range.
Total Fetch Request per Second: This graph plots the total rate of fetch requests made by the topic, over a specified time.
Under Replicated Partitions per Second: This graph plots the number of under replicated partitions per second, within a cluster, over a specified period.

The average metrics of the topics, in the specified period, are displayed in a table. If you click a topic, the corresponding topic summary page is displayed.

Topic Summary page

The Topic summary page displays the latest and average values of the topic metrics along with the corresponding graphs.

Consumer Summary page

Consumers read messages. They are also called as subscribers or readers. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. Consumer tracks which of the messages it has already consumed by keeping track of the offset of messages.

The following latest metrics of the topics are displayed in the Consumer Summary page.

Items	Description
Consumer Name	Name of the Consumer.
Max Lag	Number of messages the consumer lags behind the producer by.
Total Lag	The total number of incoming bytes received per second from all servers.
Mean Lag	The total number of outgoing bytes sent per second to all servers.
Partitions	Total rate of the incoming messages.
Lagging Partitions	If the Consumer lag for the topic partition is increasing consistently, and an increase in lag from the start of the window to the last value is greater than the lag threshold
Stalled Partitions	If the Consumer commit offset for the topic partition is not increasing and lag is greater than zero
Status	Status

The following metrics are tracked in this page:

Number of Topics
Number of Partitions

The Topic list displays the KPIs; when details are available a more info icon is displayed. Click it to bring up the Kafka view for the topic. Below the list are two tabs that display graphs of the Topic and Partition details. By default, the window opens with the Topic Detail graph displayed.

You can choose both the Partition and the Metric for the display. By default, the 0^th partition is displayed using the metric offset. The Partition Details' list is populated if the details are available.

Unravel insights for Kafka

Unravel provides auto-detection of lagging/stalled Consumer Groups. It lets you drill down into your cluster and determine which consumers, topics, partitions are lagging or stalled. See Kafka Insights for a use case example of drilling down into lagging or stalled Consumer Groups.

Unravel determines Consumer status by evaluating the consumer's behavior over a sliding window. For example, we use an average lag trend for 10 intervals (of 5 minutes duration each), covering a 50-minute period. Consumer Status is evaluated on several factors during the window for each partition it is consuming.

For a topic partition, Consumer status is:

Stalled: If the Consumer commits offset for the topic partition is not increasing and lag is greater than zero.
Lagging: If the Consumer lag for the topic partition is increasing consistently, and an increase in lag from the start of the window to the last value is greater than the lag threshold.

The information is distilled down into a status for each partition, and then into a single status for the consumer. A consumer is either in one of the following states:

OK: The consumer is working and is current.
Warning: The consumer is working, but falling behind.
Error: The consumer has stopped or stalled.

Kafka Metrics Reference and Analysis

The Metrics and Broker tab graphs all the following Metrics. The Topic tab displays a subset of the Metrics. In addition, subsets of these Metrics are used in the Topic and Consumer Group pages.

While the definition of the metric is the same throughout, the breadth of the collected data is different. Therefore, the implication and interpretation of the metric varies by context for Metrics, Brokers, and Topics.

The following table lists the Kafka metrics, the definition and a brief analysis in context with monitoring.

Metric	Definition	Analysis
Bytes In Per Second	The one minute rate of all the bytes flowing into the Kafka network, specifically the Kafka Producers publishing messages. It indicates how much traffic the brokers are receiving from the Producer clients.	The Brokers tab can help you determine if you need to expand your cluster. There might be situations where it could be possible to identify a broker that is receiving more traffic than others which indicates that there is a need for a partition rebalance.
Bytes Out Per Second	The one minute rate of all the bytes flowing out of the Kafka network, specifically Kafka Consumers consuming messages from the topics.	It possible for the outgoing rate to be two to three times the incoming rate for topics being consumed. Before Kafka 0.11.0.0, this metric included the internal replica traffic. Therefore, it showed an outbound rate for a topic that was an actively producing client, even when no consumers were consuming from the same.
Messages In Per Second	The number of messages produced per second. While Bytes In Per Second displays the broker traffic in absolute terms of bytes, this is the number (count) of individual messages that flow in, regardless of their size.	Useful for end-users who are more interested in the message count rather than the network throughput. When used in conjunction with Bytes in Per Second you can estimate the size of a single message. Like Bytes in Per Second it can help identify broker imbalance so you can rebalance your partitions.
Total Fetch Requests Per Second	The number of consumer fetch requests sent per second. Like Bytes Out Per Second this is an indicator of the rate consumers are uptaking/requesting messages from the Kafka cluster.	Care must be taken when interpreting this metric because the value also includes replica traffic, that is, the fetch requests that get raised by the brokers to keep the topic partitions in sync.
Under Replicated Partitions Per Second	The count of under replicated partitions existing in a cluster.	This metric can provide insight into a number of problems on a Kafka cluster ranging from a broker being down to resource exhaustion. For example, a constant number of under replicated partitions for many brokers might indicate that one of the brokers is down. The count across the cluster is equal to the number of partitions on the faulty broker.
Active Controller Trend	Indicates the presence of an active controller	On the cluster level the value for this metric should always be one, that is, at all times only one broker should be the active controller. Any other value could indicate the cluster might be susceptible to administrative issues such as partition reassignment, etc. On a broker level, except for the active controller, the brokers should return zero.
Request Handler Idle Ratio Average Per Minute	The request handler idle ratio trend over a time period. It indicates the percentage of the time the request handlers are not in use.	Lower numbers indicate increased load on the broker. Anything under 0.2 (20 percent) is a cause for concern. Values under 0.1 (10 percent) usually indicate an active performance problem. Typically, there are two reasons for high thread utilization: There aren't enough threads. The threads are devoting some of their resources to overhead tasks and not servicing the client requests itself.
Partition Count	The number of partitions. The count includes both leader and follower replica. It should be basically constant over time.	Expect this to be basically constant over time. This count includes both leader and follower replica.
Leader Partition Count	The number of partition leaders.	Consider monitoring this metric because it gives you a qualitative distinction between the replicas on a broker in regard to leaders and followers. Even if the replicas are balanced in size or number across the cluster, there might be a broker who has an indiscriminately large/small number of leaders. This is an indication of imbalance. You can use this metric in conjunction with Partition Count to calculate the percentage of partitions on the Broker for the partition the broker is a leader. In well-balanced clusters, this should be uniform. For example, if you have a replication factor of two all brokers should be leaders of approximately 50% of their partitions.
Offline Partition Count	The number of offline partitions in the cluster indicating the number of partitions in the cluster that have no leader.	Along with Under Replicated Partitions, this is a critical metric to monitor. Such partitions are inaccessible to clients because produce and fetch requests are sent only to leaders. This could result in issues like producer clients losing messages. One potential reason for the presence of offline partitions is the broker/brokers hosting the leader replicas are down.
Fetch Total Time, 99th Percentile	The 99th percentile of the total turnaround time (time from receiving the request to sending a response back) the broker spends processing the fetch request. 99% of all values in a group of fetch request timing are less than this metric's value.	You can gain insight into how an average request performs and what are the outliers by correlating this metric value with the average value of the metric.
Produce Total Time, 99th Percentile	The 99th percentile of the total turnaround time (time from receiving the request to sending a response back) the broker spends processing the produce request. 99% of all values in a group of produce request timing are less than this metric's value.	You can gain insight into how an average request performs and what are the outliers by correlating this metric value with the average value of the metric. You can use this to set a baseline of sorts. Much like Under Replicated Partitions, a spike in the 99 percentile for Produce requests can alert you to a wide range of performance issues.
Fetch Requests Per Sec	Displays the one minute rate for all the fetch requests being received and processed. Like Bytes Out Per Sec, it is an indicator of the rate at which consumers are uptaking/requesting messages from the Kafka cluster.	You must take care when interpreting this metric. In addition to the consumer client fetch request, the value also includes replica traffic, that is, the fetch requests that get raised by the brokers to keep the topic partitions in sync.
Produce Requests Per Sec	Displays the one minute rate of all the Produce Requests being received and processed. Like Bytes In Per Sec, it indicates how much traffic the brokers are receiving from the Producer clients.	This metric can be useful in deciding whether to expand your cluster. Additionally, it can be helpful when tracked at the broker level. On a normal Kafka cluster the value across all brokers should be basically the same. It helps you identify a broker that is receiving more traffic than others indicating there is a need for a partition rebalance.
Produce Purgatory Size	The number of Produce requests sitting in the Producer Request Purgatory. It is a holding pen for requests waiting to be satisfied (Delayed). Of all Kafka request types, it is used only for Produce requests. It tracks the number of requests sitting in purgatory (including both watchers map and expiration queue).	You can use this metric as a rough gauge of memory usage.
Fetch Purgatory Size	The number of Fetch requests sitting in the Fetch Request Purgatory. It is a holding pen for requests waiting to be satisfied (Delayed). Of all Kafka request types, it is used only for Fetch requests. It tracks the number of requests sitting in purgatory (including both watchers map and expiration.	You can use this metric as a rough gauge of memory usage.
Log Flush Latency, 99th Percentile	The 99th percentile value of the latency incurred by a log flush, that is, write to disk in milliseconds. 99% of all values in the group are less than the value of the metric.	Log flushes are generally one of the most expensive operations and can affect the durability, latency, and throughput of the Kafka cluster. This trend line depicts what's the log latency like for the cluster and can help in configuring the log flush interval.

HBase

Running real-time data injection and multiple concurrent workloads on HBase clusters in production are always challenging. There are multiple factors that affect a cluster's performance or health and dealing with them isn't easy. Timely, up-to-date, and detailed data is crucial to locating and fixing issues to maintain a cluster's health and performance.

Most cluster managers provide high-level metrics, which while helpful, aren't enough for understanding cluster and performance issues. Unravel provides detailed data and metrics to help you identify the root causes of cluster and performance issues specifically hot-spotting. This tutorial examines how you can use Unravel's HBase's (APM) to debug issues in your HBase cluster and improve its performance.

Cluster health

Unravel provides cluster metrics per cluster which provide an overview of HBase clusters in the Cluster > HBase tab where all the HBase clusters are listed. Click a cluster name to bring up the cluster's detailed information.

The metrics are color-coded so you can quickly ascertain your cluster's health and in case of issues you can drill-down further into them.

Green = Healthy
Red = Unhealthy, alert for metrics and investigation required

Live Region Servers: the number of running region servers.
Dead Region Servers: the number of dead region servers.
This metric gives you insight into the health of the region servers. In a healthy cluster this should be zero. When the number of dead region servers is one or greater you know something is wrong in the cluster.
Dead region servers can cause an imbalance in the cluster. When the server has stopped its regions are then distributed across the running region servers. This consolidation means some region servers handle a larger number of regions and consequently have a corresponding higher number of read, write and store operations. This can result in some servers processing a huge number of operations and while others are idle, causing hot-spotting.
Cluster Requests: the number of read and write requests aggregated across the entire cluster.
Average Load: the average number of regions per region server across all Servers
This metric is the average number of regions on each region server. Like Dead Region Servers, this metric helps you to triage imbalances on cluster and optimize the cluster's performance.
Rit Count: the number of regions in transition.
RitOver Threshold: the number of regions that have been in transition longer than a threshold time (default: 60 seconds).
RitOldestAge: the age, in milliseconds, of the longest region in transition.

Note

Region Server refers to the servers (hosts) while region refers to the specific regions on the servers.

Metrics

The following graphs are plotted in the Metrics tab:

Region server

Unravel provides a list of region servers, their metrics, and Unravel's insight into the server's health for all region servers across the cluster for a specific point of time.

For each region server, the table lists the Region Server Name and the server metrics Read Request Count, Write Request Count, Store File Size, Percent Files Local, Compaction Queue Length, Region Count, and Insights for each server. These metrics and insights are helpful in monitoring activity in your HBase cluster.

An icon is displayed in the Insights column and you can quickly find if the server is in trouble. Hover over the icon to see a tooltip listing the hot-spotting notifications with their associated threshold (Average value * 1.2). If any value is above the threshold the region server is hot-spotting.

Further Server Health and Context and Operational count metrics are provided on the right:

Region server metric graphs

Beneath the table are four graphs Total Read Request Count, Total Write Request Count, Total Store File Size, and Total Percent Files Local. These graphs are for all metrics across the entire cluster. The time period the metrics are graphed is noted above the graphs.

Tables

The last item in the Cluster tab is a list of tables. This list is for all the tables across all the region servers across the entire cluster. Unravel then uses these metrics to attempt to detect an imbalance. Averages are calculated within each category and alerts are raised accordingly. Just like with the region servers you can tell at glance the Health of the table.

The list is searchable and displays the Table Name, Table Size, Region Count, Average Region Size, Store File Size, Read Request Count, Write Request Count, and finally Health. Hover over the health status to see a tooltip listing the hot-spotting. Bad health indicates that a large number of store operations from different sources are redirected to this table. In this example, the Store File Size is more than 20% of the threshold.

You can use this list to drill down into the tables and get its details which can be useful to monitor your cluster.

Click a table to view its details, which include graphs of metrics overtime for the region, a list of the regions using the table, and the apps accessing the table. The following is an example of the graphed metrics, Region Count, Total Read Request Count, and Total Write Request Count.

Region

Once in the Table view, click the Region tab to see the list of regions accessing the table.

The table list shows the Region Name, Region Server Name, and the region metrics Store File Size, Read Request Count, Write Request Count, and Health. These metrics are useful in gauging activity and load in the region. The region health is important in deciding whether the region is functioning properly. In case any metric value is crossing the threshold, the status is listed as bad. A bad status indicates you should start investigating to locate and fix hot-spotting.

Elasticsearch

This tab lets you monitor various metrics in the Elasticsearch clusters. You can select a cluster and the period range to monitor the metrics for each cluster, each node in the cluster, as well as for the corresponding indices. The monitoring of the Elasticsearch clusters can be enabled by configuring the Elasticsearch properties in the unravel.properties file.

Enable monitoring for Elasticsearch clusters

To enable monitoring for Elasticsearch clusters in Unravel:

Go to /usr/local/unravel/etc/
cd /usr/local/unravel/etc/
Edit unravel.properties file.
vi unravel.properties
Set the com.unraveldata.elk.es.monitor.enabled property to true.
Configure the clusters for Elasticsearch monitoring in the com.unraveldata.elk.es.clusters property.
Set other properties for Elasticsearch. Refer to ELK properties for more details.

View Elasticsearch cluster metrics

By default, the cluster metrics are summarized and shown for all the nodes in the cluster for the past one hour. To view the Elasticsearch cluster KPIs for a specific time period:

Go to Clusters > Elasticsearch.
Select a cluster from the Cluster drop-down.

Optionally select a period range.

The following Elasticsearch KPIs of a cluster are displayed on top of the page:

Cluster Metrics

Metrics	Description
Status	Status of the cluster.
Nodes	Total number of nodes in the cluster.
Indices	Total count of indices in the cluster.
Total Shards	Total number of shards in the cluster.
Unassigned Shards	Total number of unassigned shards in the cluster.
Documents	Total count of indexed documents in the cluster.
Successful Shards	Total number of successful shards in the cluster.
Active Shards	Total number of active shards in the cluster.
Initializing Shards	Number of initializing shards in the cluster.
Relocating Shards	Number of relocating shards in the cluster.

Cluster Graphs

The following graphs plot various Elasticsearch KPIs for a selected cluster in a specified time range:

Search Rate: This graph plots the rate at which documents are queried in Elasticsearch for the selected cluster.
Search Latency: This graph plots the time taken to execute the search request for the selected cluster.
Indexing Rate: This graph plots the rate at which documents are indexed in the selected cluster.
Indexing Latency: This graph plots the time taken to index the documents in the selected cluster.

View Elasticsearch node metrics

You can monitor the Elasticsearch metrics at the node level. To view the metric details of a specific node in an Elasticsearch cluster:

Go to Clusters > Elasticsearch.
Select a cluster from the Cluster drop-down.
Optionally select a period range.
Click the Nodes tab. All the nodes in the cluster and the corresponding details are listed in the Nodes tab.

Select a node. The node details are displayed in a table. The node metrics are displayed on the upper right side.

You can use the toggle button to sort the columns, of the node details table, in an ascending and descending order. You can also click the icon to set the columns to be displayed in the Nodes tab.

Node Details

The following details are displayed for each node:

Column	Description
Name	Name of the node.
CPU usage	Total CPU usage for the node.
Disk Total	Total disk space available for the Elasticsearch node.
Disk Used	Total disk space used by the Elasticsearch node.
Load Average	The load average of a CPU in an operating system.
Jvm Memory	JVM heap memory used.

Node Metrics

The following node metrics are displayed for the selected node:

Metrics	Description
Documents	The total number of indexed documents in the node.
Free Disk Space	The available disk space in the node.
Heap Percent	Percentage of heap memory used.
Flush Latency	Time taken to flush the indices in a node.
Flush Rate	The rate at which the indices are flushed.
CPU	The CPU utilization by the operating system in percentage.
RAM	RAM utilization in percentage.
Total Read Operations	Total amount of read operations processed.
Total write Operations	Total amount of write operations processed.
Uptime	Time since the node is up and active.

Node Graphs

The following graphs plot various Elasticsearch KPIs for a selected node in a specified time range:

JVM Heap: This graph plots the total heap used by Elasticsearch in the JVM for a selected node.
Index Memory: This graph plots the heap memory used for indexing in a selected node.
CPU Utilization: This graph plots the percentage of CPU usage for the Elasticsearch process in the selected node.
System Load: The graph plots the load average per minute for the selected node.
Segment Count: This graph plots the maximum segment count for the selected node.
Search Latency: This graph plots the average search latency for the selected node.
Indexing Latency: This graph plots the average latency for indexing documents in the selected node.
Documents: This graph plots the count of indexed documents for the selected node.

View Elasticsearch indices metrics

To view the metric details of an index in an Elasticsearch cluster:

Go to Clusters > Elasticsearch.
Select a cluster from the Cluster drop-down.
Select a period range.
Click the Indices tab. All the indices in the cluster are listed in the Indices tab.

Select an index. The index details are displayed in a table. The index metrics are displayed on the upper right side.

You can use the toggle button to sort the columns, of the index details table, in an ascending and descending order. You can also click the icon to set the columns to be displayed in the Indices tab.

Indices Details

The following details are displayed for each of the indices:

Column	Description
Name	Name of the index.
Docs	Number of indexed documents.
Store Size	Storage size of the indexed documents.
Search Rate	Number of search requests being executed.
Active Shards	Total number of active shards in an index.
Unassigned Shards	Total number of unassigned shards in an index.

Index Metrics

The following metrics are displayed for the selected index:

Metrics	Description
Flush Total	Total number of flushed indices.
Flushed Latency	Time taken to flush an index.
Flush Rate	The rate at which indices are flushed.
Segments count	Count of segments.

The following graphs represent various Elasticsearch KPIs for a selected index in a specified time range. By default, all the indices are checked in the graphs. You can select single or multiple indices in the graph.

Indices Graphs

The following graphs plots various metrics for the selected indices in a specified time range:

Search Rate: This graph plots the number of search requests that are being executed in Elasticsearch.
Indexing Latency: This graph plots the time taken to index the documents in Elasticsearch.
Indexing Rate: This graph plots the rate at which documents are indexed in Elasticsearch
Segments Count: This graph plots the count of indexing segments operation.
Fetch Latency: This graph plots the fetch latency for the indices.
Fetch Rate: This graph plots the time taken to fetch the query results.
Flush Latency: This graph plots the time taken to flush indices.
Flush Rate: This graph plots the rate at which indices are flushed.

Logstash

This tab lets you monitor various metrics for Logstash. You can select a period to monitor the metrics for each Logstash instance and pipeline. The Logstash monitoring can be enabled by configuring the Logstash properties in the unravel.properties file.

Enable Logstash monitoring

To enable monitoring for Logstash in Unravel:

Go to /usr/local/unravel/etc/
cd /usr/local/unravel/etc/
Edit unravel.properties file.
vi unravel.properties
Set the com.unraveldata.elk.logstash.enable property to true.
Set other properties for Logstash. Refer to ELK properties for more details.

View Logstash metrics

By default, the cluster metrics are summarized and shown for all the nodes in the cluster for the past one hour. To view the Logstash cluster KPIs for a specific time period:

Go to Clusters > Logstash.

Select a period range.

The following Logstash KPIs are displayed on top of the page:

Cluster metrics

Metric	Description
Nodes	Total number of nodes in the cluster.
Events Received	Average of the number of events flowing into all the nodes, in a selected time period.
Events Emitted	Average of the number of events flowing out of all the nodes, in a selected time period.
Memory	The total heap memory used versus total heap memory available. (Heap memory used/Total heap memory available)
Pipelines	Total number of logical pipelines.

View Logstash node metrics

You can monitor the Logstash metrics at the node level. To view the metric details of a specific node in a Logstash cluster:

Go to Clusters > Logstash.
Select a period range.
Click the Nodes tab. All the nodes in the cluster and the corresponding details are listed in the Nodes tab.

Select a node. The node details are displayed in a table. The node metrics are displayed on the upper right side.

Node details

The following details are displayed for each node:

Column	Description
Name	Name of the node.
Jvm Heap Used	JVM heap memory used.
Events Received	Current number of events flowing into the selected node.
Events Emitted	Current number of events flowing out of the selected node.
Events Filtered	Current number of events that are filtered for the selected node.

Node metrics

The following node metrics are displayed for the selected node:

Node Metrics	Description
Version	Logstash version.
CPU Usage	Average CPU usage for the selected period of time.
Uptime	Time since the node has been up and active.
Reloads	Successful reloads versus the total number of reloads.

Node graphs

The following graphs plot various Logstash KPIs for a selected node in a specified time range:

Events Received: This graph plots the average of the total events flowing into the selected node, for the specified time period.
Events Received Rate: This graph plots the average rate at which events flow into the selected node, for the specified time period.
Events Emitted: This graph plots the average of the total events flowing out of the selected node, for the specified time period.
Events Emitted Rate: This graph plots the average rate at which events flow out of the selected node, for the specified time period.
Cpu Percent: This graph plots the percentage of CPU usage for the Logstash process in the selected node.
Jvm Heap: This graph plots the total heap used by Elasticsearch in the JVM for a selected node.
System Load: The graph plots the load average per minute for the selected node.

View Logstash pipeline metrics

To view the metric details of a pipeline in a Logstash cluster:

Go to Clusters > Logstash.
Select a period range.
Click the Pipelines tab. All the pipelines in the cluster are listed in the Pipelines tab.

Select a pipeline. The pipeline details are displayed in a table. The pipeline metrics are displayed on the upper right side.

You can use the toggle button to sort the columns, of the pipeline details table, in an ascending and descending order. You can also click the icon to set the columns to be displayed in the Pipelines tab.

Pipelines details

The following details are displayed for the corresponding pipelines:

Column	Description
Name	Name of the node.
Events Received	Current number of events flowing into the selected pipeline.
Events Emitted	Current number of events flowing out of the selected pipeline.
Events Filtered	Current number of events that are filtered for the selected pipeline.
Number of Nodes	Total number of nodes in the pipeline.

Pipelines metrics

The following metrics are displayed for the selected pipelines:

Pipelines Metrics	Description
Reload Interval	Time interval that specifies how often Logstash checks the config files for changes in the pipeline (in seconds).
Batch size	Maximum number events that an individual worker thread collects before executing filters and outputs
Reloads	Successful reloads versus total number of reloads
Workers	Number of threads to run for filter and output processing.

Pipelines graphs

The following graphs plots various metrics for the selected pipelines in a specified time range:

Events Received: This graph plots the average of the total events flowing into the selected pipeline, for the specified time period.
Events Received Rate: This graph plots the average rate at which events flow into the selected pipeline, for the specified time period.
Events Emitted: This graph plots the average of the total events flowing out of the selected pipeline, for the specified time period.
Events Emitted Rate: This graph plots the average rate at which events flow out of the selected pipeline, for the specified time period.

Kibana

This tab lets you monitor various metrics for Kibana. You can select a period to monitor the metrics for the Kibana instances in a cluster. You can enable Kibana monitoring by configuring the Kibana properties in the unravel.properties file.

Enable Kibana monitoring

To enable monitoring for Kibana in Unravel:

Go to /usr/local/unravel/etc/
cd /usr/local/unravel/etc/
Edit unravel.properties file.
vi unravel.properties
Set the com.unraveldata.elk.kibana.monitor.enabled property to true.
Set other properties for Kibana. Refer to ELK properties for more details.

View Kibana metrics

By default, the Kibana metrics are shown for all the cluster for the past one hour. To view the Kibana cluster KPIs for a specific time period:

Go to Clusters > Kibana.
Select a cluster from the Cluster drop-down

Select a period range. The following Kibana KPIs are displayed on top of the page:

Cluster KPIs

Metric	Description
Status	Status of the cluster.
Heap Total	Total heap memory available.
Heap Used	Total heap memory used.
Requests Disconnects	Number of disconnected client requests.
Request Total	Total number of client requests received by the Kibana instance.
Response Time Max	Maximum time taken to respond to the client requests received by the Kibana instance.
Concurrent Connections	Total number of concurrent connections to the Kibana instance.
Collection Interval	Time period between data sampling for the metrics.

Kibana Graphs

The following graphs plot the Kibana KPIs for a selected cluster in a specified time range:

Heap Used: This graph plots the total heap memory used by Kibana, in a cluster, for the specified period.
Request Total: This graph plots the total number of client requests received by the Kibana instance in the specified period.
Request Disconnects: This graph plots the totals no of disconnected requests in a specified period.
Concurrent Connections: This graph plots the concurrent connections to the Kibana instance in a specified period.

Chargeback

This tab allows you to generate chargeback reports for your clusters' usage costs for Yarn and Impala jobs. Multi-cluster feature is supported for Chargeback reports. However, you can view the report of a single cluster at a time.

You can generate the report using any of the following Group By options. At a time, you can choose a maximum of two options.

Application Type: The type of application, such as MR, Spark, etc.
Note
The Application Type option is supported only for Yarn jobs. It is also the default Group By option for Yarn Chargeback reports.
User: The user who submitted the app, for example, Hive.
Note
User is the default Group By option for Impala Chargeback reports.
Real User: The actual app owner/user.
Queue.
Other: Tags. (See What is tagging, if you are unfamiliar with the concept.)

The following sections are displayed in the Chargeback tab:

Chargeback Report
- Donut graphs showing the proportional results for Group By selection by App Count, CPU Hours, and Memory Hours.
- Chargeback report of the cluster costs, which is combined by the Group By choices.
List of all the jobs run during the selected time period.

In case there are no Yarn jobs or Impala jobs running on the selected cluster at the specified time, then chargeback reports are not shown.

Generate chargeback report

From the Chargeback drop-down on the upper left corner, select either Yarn or Impala.
Use the Cluster and date picker pull-down menus to select a specific cluster or change the date range.
Click in the Group By box and select an option. Select a maximum of two Group By options at a time. You can click next to the option to deselect an option if you have selected more than one option.
The chargeback report is generated. If you have selected two Group By options, the combined results are displayed in the donut charts (Jobs, CPU hours, Memory hours) and also in the table below the donut charts. Refer to Drilling down the Chargeback results for more details.
Example:
In the following image, the report is grouped by two tags, dept and project. (See What is tagging, if you are unfamiliar with the concept.)
There are times when a job cannot be grouped by the selected option.In such cases NULL is listed in the Group By option column. In this case, there are jobs that have neither a dept or project tag.
Hover over a donut section to see the slice name, the value, and percentage of the whole. Click Download CSV to download the chargeback report.
The list of all the Yarn jobs/Impala jobs is provided in the tables as shown. You can view 15 records at a time and download the list in a CSV format.

Drilling down the Chargeback results

The chargeback results are displayed in the table as shown in the following image:

Click any row to drill down to the next Group By option.

if you click the row again, you can drill down further till all the associated Group By options are exhausted for the Application type option in Yarn jobs and User option in Impala jobs.

Use the navigation path corresponding to the Group by to move back to a previous option.

Estimating chargeback cost

From the Chargeback drop-down on the upper right corner, select one of the following options:
- Yarn
- Impala
Use the Cluster and date picker pull-down menus to select a specific cluster or change the date range.
Click in the Group By box and select an option. You can select a maximum of two Group By options only at a time. The Chargeback report is displayed.
The chargeback report is generated.
Enter the vCore/Hour or Memory MB/Hour and click Apply. A quick estimation is displayed in the results table, in the CPU hours or Memory hours column.

Top X report

This topic provides information about the Top X applications. The TopX reports are enabled by default.

Generating TopX report

Click the button to generate a new report. The parameters are:
- History (Date Range): Use the date picker drop-down to specify the date range.
- Top X: Enter a number for the top Hive, Spark, Impala apps that you want to view.
- Cluster: Select a cluster from where you want to generate the report.
- Users: Users who submitted the app.
- Real User: Users who actually submitted the app. For instance, the user might be a hive, but the real user is joan@mycompany.com.
- Queue: Select queues.
- Tags: Select tags.
Click Run to generate the report.
The progress of the report generation is shown on the top of the page and you are notified about the successful creation of the report. The latest successful report is shown in the Clusters > Top X page.
All reports (successful or failed attempts) are in the Reports Archive.

Scheduling TopX report

Click to generate the report regularly and provide the following details:
- History (Date Range): Use the date picker drop-down to specify the date range.
- Top X: Enter a number for the top Hive, Spark, Impala apps that you want to view.
- Cluster: Select a cluster from where you want to generate the report.
- Users: Users who submitted the app.
- Real User: Users who actually submitted the app. For instance, the user might be a hive, but the real user is joan@mycompany.com.
- Queue: Select queues.
- Tags: Select tags.
- Schedule to Run: Select any of the following schedule options from the drop-down and set the time from the hours and minutes drop-down:
  - Daily
  - Select a day in the week. (Sun, Mon, Tue, Wed, Thu, Fri, Sat)
  - Every two weeks
  - Every month
- Notification: Provide either a single or multiple email IDs to receive the notification of the reports generated.
Click Schedule.

Viewing TopX report

The following sections are included in the TopX reports which can be grouped by Hive, Spark, or Impala.

Input Parameters

This section provides the basic information about the report, that is the date range selected, the number of top results included in the TopX report, and the cluster from where the report is generated.

Applications

The Applications section shows the following app metrics for the generated TopX report:

Total Apps: Number of apps on the cluster.
Successful Apps: Number of apps that are completed successfully.
Failed Apps: Number of apps that have failed to complete.
Others: Number of apps that were killed, or those that are in a pending, running, waiting, or unknown state.

Further, the detailed TopX report for applications are generated which are grouped based on the following categories:

Longest Duration: The Top N number of applications that ran for the longest duration.
Highest Disk I/O: The Top N number of applications that have the highest disk input/output operations.
Highest Cluster Usage: The Top N number of applications that have the highest cluster usage.
Highest CPU Usage: The Top N number of applications that have the highest CPU usage.
Highest Memory Usage: The Top N number of applications that have the highest memory usage.

Click any of the tabs, to view further TopX details of the selected app.

Resources

Resources breakdown the Map/reduce time for Hive apps. The information about total mappers, successful mappers, total reducers, and successful reducers is shown.

Queues

The Queues section shows the app count for the selected queue plus the other criteria, which were selected for generating the report.

Data

The Data tile displays the cumulative total by Read and Write I/O

Tuning

This report is an OnDemand report which analyzes your cluster workload over a specified period. It provides insights and configuration recommendations to optimize throughput, resources, and performance. Currently, this feature only supports Hive on MapReduce.

You can use these reports to:

Fine-tune your cluster to maximize its performance and minimize your costs.
Compare your cluster's performance between two time periods.

Reports are generated on an ad hoc or scheduled basis. All reports are archived and can be accessed via the Reports Archive tab. The tab opens displaying the last report, if any, generated.

Generating Tuning report

Click the button to generate a new report. The parameters are:
- Date Range: Select a period from the date picker.
- Cluster: In a multi-cluster setup, you can select the cluster from where you want to generate the report.
Click Run to generate the report.
The progress of the report generation is shown on the top of the page and you are notified about the successful creation of the report.
All reports (successful or failed attempts) are in the Reports Archive.

Scheduling Tuning report

Click Schedule to generate the report regularly and provide the following details:
- Date Range: Select a period from the date picker.
- Cluster: In a multi-cluster setup, you can select the cluster from where you want to generate the report.
- Schedule Name: Name of the schedule.
- Schedule to Run: Select any of the following schedule option from drop-down and set the time from the hours and minutes drop-down:
  - Daily
  - Weekdays (Sun-Sat)
  - Every two weeks
  - Every month
- Notification: Provide an email ID to receive the notification of the reports generated.
Click Schedule.

Tuning report

The Report has three sections.

Header: This section contains the following basic information about the report:

Item	Description
Author	Unravel user who has generated the report.
Report Id	Unique identification of the report.
Cluster Id	Unique identification of the cluster from where the report is generated.
Cluster Name	Name of the cluster from where the report is generated.
Time stamp	The date and time when the report was generated.
Start Date	Start date in the selected time range for the report.
End Date	End date in the selected time range for the report.
Number of Days	Select time range for the report in days.

KPIs: This section provides the following KPI information. The KPIs are calculated as per-day average of the workload in the selected time range.
- Number of Jobs
- Number of vCore Hours
- Number of MapReduce Containers
- % containers for Map
- % containers for Reduce
- Amount of memory (in MB) from MapReduce containers
- % containers from Map containers
- % containers from Reduce containers

Insights/Recommendations
This section provides tuning instructions, recommendations, and insights. Click to tuning instructions. Click for viewing the related properties. Click to close any information box. Click to expand or collapse any of the sub-sections.

Top X

Top X is an On-Demand report, which presents the details of the Top N number of applications. From this page, you can either generate the TopX reports instantly or you can schedule the report. The same TopX reports can be also generated using customized configurations for multiple users. This is known as the User Report, which is a scheduled report that can be mailed to multiple recipients.

Top X is generated for Hive, Spark, and Impala apps in the following categories:

Longest Duration: Total duration of the app.
Highest Disk I/O: Total DFS bytes read and written.
Highest Cluster Usage: Map/reduce time for Hive apps or Slot duration for Spark apps or Total processing time for Impala apps.
Highest CPU Usage: vCore seconds (Hive on Tez not supported)
Highest Memory Usage: Memory seconds (Hive on Tez not supported)

Top X report

This topic provides information about the Top X applications. The TopX reports are enabled by default.

Generating TopX report

Click the button to generate a new report. The parameters are:
- History (Date Range): Use the date picker drop-down to specify the date range.
- Top X: Enter a number for the top Hive, Spark, Impala apps that you want to view.
- Cluster: Select a cluster from where you want to generate the report.
- Users: Users who submitted the app.
- Real User: Users who actually submitted the app. For instance, the user might be a hive, but the real user is joan@mycompany.com.
- Queue: Select queues.
- Tags: Select tags.
Click Run to generate the report.
The progress of the report generation is shown on the top of the page and you are notified about the successful creation of the report. The latest successful report is shown in the Clusters > Top X page.
All reports (successful or failed attempts) are in the Reports Archive.

Scheduling TopX report

Click to generate the report regularly and provide the following details:
- History (Date Range): Use the date picker drop-down to specify the date range.
- Top X: Enter a number for the top Hive, Spark, Impala apps that you want to view.
- Cluster: Select a cluster from where you want to generate the report.
- Users: Users who submitted the app.
- Real User: Users who actually submitted the app. For instance, the user might be a hive, but the real user is joan@mycompany.com.
- Queue: Select queues.
- Tags: Select tags.
- Schedule to Run: Select any of the following schedule options from the drop-down and set the time from the hours and minutes drop-down:
  - Daily
  - Select a day in the week. (Sun, Mon, Tue, Wed, Thu, Fri, Sat)
  - Every two weeks
  - Every month
- Notification: Provide either a single or multiple email IDs to receive the notification of the reports generated.
Click Schedule.

Viewing TopX report

The following sections are included in the TopX reports which can be grouped by Hive, Spark, or Impala.

Input Parameters

Applications

The Applications section shows the following app metrics for the generated TopX report:

Total Apps: Number of apps on the cluster.
Successful Apps: Number of apps that are completed successfully.
Failed Apps: Number of apps that have failed to complete.
Others: Number of apps that were killed, or those that are in a pending, running, waiting, or unknown state.

Further, the detailed TopX report for applications are generated which are grouped based on the following categories:

Longest Duration: The Top N number of applications that ran for the longest duration.
Highest Disk I/O: The Top N number of applications that have the highest disk input/output operations.
Highest Cluster Usage: The Top N number of applications that have the highest cluster usage.
Highest CPU Usage: The Top N number of applications that have the highest CPU usage.
Highest Memory Usage: The Top N number of applications that have the highest memory usage.

Click any of the tabs, to view further TopX details of the selected app.

Resources

Resources breakdown the Map/reduce time for Hive apps. The information about total mappers, successful mappers, total reducers, and successful reducers is shown.

Queues

The Queues section shows the app count for the selected queue plus the other criteria, which were selected for generating the report.

Data

The Data tile displays the cumulative total by Read and Write I/O

User report

User report is a scheduled Top X report that can be generated using customized configurations for multiple users. This report is scheduled on a regular basis, for example, every Thursday, or every two weeks, and is designed for a variety of end-users, such as :

App developers who can use it to determine what apps need attention to improve duration, resource usage, etc.
Team leaders or managers who can use it to track how the team uses resources and identify user's who overuse resources (rogue users).

It provides a concise and clear view of:

How apps are performing on the platform.
The resources the apps are using.
The number of recommendations or insights Unravel has for the app.

You can schedule a report and either add the configurations manually or import the configurations via a CSV file. The report is sent to the specified email ID.

Scheduling User report

Go to Clusters > TopX.
Click the button, on the upper right side of the page, to schedule a new user report. The following page is displayed:
Enter the following details:
- Schedule Name: Name of the schedule.
- Schedule to Run: Select any of the following schedule option from drop-down and set the time from the hours and minutes drop-down:
  - Daily
  - A weekday (Sun, Mon, Tue, Wed, Thu, Fri, Sat)
  - Every two weeks
  - Every month
Click to add the following parameters to manually create a user report.
- History (Date Range): Use the date picker drop-down to specify the date range.
- Top X: Enter a number for the top Hive, Spark, Impala apps that you want to view.
- Cluster: Select a cluster from where you want to generate the report.
- Users: Users who submitted the app.
- Real User: Users who actually submitted the app. For instance, the user might be a hive, but the real user is joan@mycompany.com.
- Queue: Select queues.
- Tags: Select tags.
- Email: Provide either a single or multiple email IDs to post the generated reports.
Click Save schedule.

Example User report

Importing configurations to generate User report

Go to Clusters > TopX.
Click the button to schedule a new user report. The following page is displayed:
Enter the following details:
- Schedule Name: Name of the schedule.
- Schedule to Run: Select any of the following schedule option from drop-down and set the time from the hours and minutes drop-down:
  - Daily
  - Weekdays (Sun-Sat)
  - Every two weeks
  - Every month
Click to import the configurations from a CSV file to generate the User report.
Import Configurations lets you upload a CSV file containing configurations for multiple users.
Here is the format to create the CSV file for the Top X User report.
```
Col header1,Col header2,Col header3, ... // up to the number of possible options
entry1, entry2, entry3, ...              // up to the number of columns defined
```
Following is a sample of the configuration created in CSV.
```
"email","topx","users","realusers","queues"
xyz@unraveldata.com,56,[hdfs],hive,default
abc@unraveldata.com,8,[hdfs],hive,default
```
You can create the CSV based on the headers shown here. You need not include (define) all the column headers or order them in the same sequence. However, whatever columns/tags you include must be defined for all entries in the order you chose. You must mandatorily define an email column. In this image, there are eleven possible columns that can be included in the CS
Any column/tag which is not added to the CSV, or left undefined for a configuration is unfiltered. For instance, if Users are undefined, the report is generated across all users.
CSV Format:
In the following example, the CSV file only defines five columns Email (required), Top X, Users, RealUsers, and Dept. Therefore, the reports are not filtered by Queue, Project, DBS, Inputtables, or Outputtables.
```
topx,email,Users,realUsers,dept

6,a@unraveldata.com,hdfs,hdfs,eng  
#all columns are defined, entry is accepted

10,b@unraveldata.com,root,,eng 
#realUsers is defined as empty, entry is accepted

7,b@unraveldata.com,root,,eng          
#duplicate email, the entry is rejected 

7,z@unraveldata.com,,,test,root,fin    
#two extra columns were added, the entry is rejected

7,y@unraveldata.com,root,,eng         
#realUsers is defined as empty, entry is accepted
```
The UI lists the results of the CSV import. In this case only lines 1, 2, and 5 were loaded. See above for the explanation. Once you have loaded the configuration you can click to edit it a configuration manually or to delete it.
Click Save schedule.

Workload

This report presents the workload of your cluster's YARN apps, in the selected cluster, for the specified date range, in one of the following views:

Month - by date, for example, October 10.
Hour - by hour, regardless of date, for example, 10.00 - 11.00.
Day - by weekday, regardless of date, for example, Tuesday.
Hour/Day - by hour for a given weekday, for example, 10.00 -11.00 on Tuesday.

You can filter each view by Jobs, vCores Hour, and Memory Hour.

See Drilling Down in a Workload view for information on how to retrieve the detailed information within each view.

Note

To measure the vCores or Memory Hour usage is straightforward; at any given point the Memory or vCore is being used or not.

The App Count isn't a count of unique app instances because apps can span boundaries, i.e., begin and end in different hours/days.

Jobs reflect the apps that were running within that interval up to and including the boundary, i.e., date, hour, day. Therefore, an app can be counted multiple times in time slice.

On multiple dates, for example, October 11 and 12.
In multiple hours, for example, 10 PM, 11 PM, and 12 AM.
On multiple days, Thursday and Friday.
In multiple hour/day slots.

This results in anomalies where the Sum(24 hours in Hour/Day App Count) > Sum(App Counts for dates representing the day). For instance, in the below example:

App Count for Wednesdays (October 10, 17, and 24) = 2492, and
App Count across Hour/Day intervals for Wednesday = 2526.

This is pointed out only to inform you about the existence of such variations.

The tab opens in the Month view filtered on App Count for the past 24 hours.

Go to the Clusters > Workload tab.
From the Cluster drop-down, select a cluster.
Select the period range from the date picker drop-down. You can also provide a custom period range. It is recommended to use a short-range, as the longer the range the more processing time is consumed.
Note
The maximum date range that you can select is 60 days. It can vary in Day, Hour, Hour/Day view based on the cluster load.
From the Cluster Workload drop-down on the upper left corner, select any one option from Jobs, vCore Hours, Memory Hours to change the display metric. The metric you select is used for all subsequent views until changed.
From the View By drop-down, select one of the following options:
- Month
- Day
- Hour
- Hour/Day
You can toggle within these View by options. When the date range is greater than one day the Hour, Day, and Hour/Day views allow you to display the data by either as an Average or Sum.

Month view

Displays the monthly view that is run for a specific period range. The color indicates how the day's workload is in comparison with the other days within the selected date range. The day with the least jobs/hours is , while the days with the highest load are . The color legend is provided on the right side of the view for reference. Use Previous and Next in the month's title bar to navigate between months.

Hour, day and hour/day view

These graphs do not link jobs to any specific date at the graph level. For instance, the Hour graph shows that 856 jobs ran at 2 AM (between 2 AM and 3 AM); the Day graph that 2,492 jobs ran on Wednesday, and the Hour/Day that 68 jobs ran at 2 am on a Wednesday. But none of these graphs directly indicate the date these jobs ran on. Only the Month view visually links job counts to a specific date.

Each view opens using the metric selected for the prior view. For instance, if vCores Hour is used to display Month and you switch to Day it is filtered using vCores Hour.

When the DATE RANGE spans multiple days, you have the choice to display the data as either the:

Sum - aggregated sum of job count, vCore, or memory hour during the time range (default view).
Average - Sum / (# of Days in Date Range).

Day view

Displays the jobs run on a specific weekday. Hover over an interval for its details. Click the interval to drill down into it.

Hour view

Plots the information by hour. The interval label indicates the start, i.e., 2 AM is 2 AM - 3 AM. Hover over an interval for its details. Click the interval to drill down into it.

Hour/Day view

This view shows the intersection of Hour and Day graphs. The Hour graph showed 856 jobs ran between at 2 AM - 3 AM while the Day graph (immediately above) that 2,492 jobs ran on Wednesday. Below we see that 68 of Wednesday's jobs (2.7%) were running between 2 AM - 3 AM.

Drilling down in a workload view

In the View By charts, click on a date, day, or hour. The Jobs list for that specific date, day, or hour is displayed. The table provides the following details:

Job Type: The type of job that was run on the selected date, day, or hour.
Job Count: The number of jobs that were run on the selected date, day, or hour.
Group By: Click the link to view the details of the jobs based on any one of the following group by options:
- User
- Queue
Get Jobs : Click this icon to view the list of Jobs based on the application types. Refer Jobs > Applications. Click this icon in the Group by list corresponding to a user or a queue, the list of jobs run is shown based on the application type and specifically sorted for the selected user or queue.

Queue Analysis

From this tab, you can generate a report of active queues for a specified cluster. The report analyzes queue activity by jobs and vCores, memory. As with all reports, it can be generated instantly or can be scheduled. The tab opens displaying the last successfully generated report if any. Reports are archived and can be accessed via the Reports Archive tab.

Configuring the Queue Analysis report

To enable and configure the Queue Analysis report, You must set the Queue Analysis report properties as follows:

Stop Unravel

<Unravel installation directory>/unravel/manager stop

Set the Queue Analysis report properties as follows:

<Installation directory>/manager config properties set <KEY> <VALUES>

##For example:
<Unravel installation directory>/manager config properties set com.unraveldata.report.queue.http.retries 2
<Unravel installation directory>/manager config properties set com.unraveldata.report.queue.http.timeout.msec 15000

Refer to Queue Analysis properties for the complete list of properties.

Apply the changes.

<Unravel installation directory>/unravel/manager config apply

Start Unravel

<Unravel installation directory>/unravel/manager start

Generating Queue Analysis report

Click the button to generate a new report. The parameters are:
- Date Range: Select a period from the date picker.
- Cluster: In a multi-cluster setup, you can select the cluster from where you want to generate the report.
Click Run to generate the report.
The progress of the report generation is shown on the top of the page and you are notified about the successful creation of the report.
All reports (successful or failed attempts) are in the Reports Archive.

Scheduling Queue Analysis report

Click Schedule to generate the report regularly and provide the following details:
- Date Range: Select a period from the date picker.
- Cluster: In a multi-cluster setup, you can select the cluster from where you want to generate the report.
- Schedule Name: Name of the schedule.
- Schedule to Run: Select any of the following schedule option from drop-down and set the time from the hours and minutes drop-down:
  - Daily
  - (Sun, Mon, Tue, Wed, Thu, Fri, Sat)
  - Every two weeks
  - Every month
- Notification: Provide either a single or multiple email IDs to receive the notification of the reports generated.
Click Schedule.

Queue Analysis report

The active queues along with the following details are listed:

Column	Description
Queue	Name of the queue.
Jobs running	Number of jobs running in the queue. These are average values.
vCore Allocated	Number of vCores allocated for the queue. These are average values.
Memory Allocated	Memory allocated for the queue. These are average values.

Click to download the complete report of the selected queue in a CSV or JSON format.

Drilling down in a Queue Analysis report

Click a row in the active queue list, the details of the running jobs, vCores usage, and memory usage are displayed in the following graphs:

Click to expand a graph. Each of the graphs can be filtered further.

Jobs: This graph plots the number of jobs running along with their status, in the specified period. The status is shown in color-coded lines. The status can be any of the following which can be used to filter and change the trends shown in the graph.
vCore: This graph plots the usage of the allocated vCore along with the following metrics, for the selected queue, in the specified period. The metrics can be used to further filter the graph:
- Reserved
- Fair Share
- Pending
- Steady Fair Share
- Available
- Allocated
Memory Usage: This graph plots the memory usage along with the following metrics, for the selected queue, in the specified period. The metrics can be used to further filter the graph:
- Reserved
- Fair Share
- Pending
- Steady Fair Share
- Available
- Allocated

Click any point in the graph, the Resources page corresponding to the point of time clicked in the graph, for that queue is displayed.

In this section:

Home

Clusters

Overview

Cluster

Note

Jobs

Note

Alerts

Resources

Viewing resources usage

Graphs (Resources)

Clusters > Jobs

Viewing Jobs

Impala

Monitoring Impala queries

Kafka

Metrics

Broker

Topic

Topic Summary page

Consumer Summary page

Unravel insights for Kafka

Kafka Metrics Reference and Analysis

HBase

Cluster health

Note

Metrics

Region server

Region server metric graphs

Tables

Region

Elasticsearch

Enable monitoring for Elasticsearch clusters

View Elasticsearch cluster metrics

View Elasticsearch node metrics

View Elasticsearch indices metrics

Logstash

Enable Logstash monitoring

View Logstash metrics

View Logstash node metrics

View Logstash pipeline metrics

Kibana

Enable Kibana monitoring

View Kibana metrics

Chargeback

Note

Note

Generate chargeback report

Drilling down the Chargeback results

Estimating chargeback cost

Top X report

Generating TopX report

Scheduling TopX report

Viewing TopX report

Input Parameters

Applications

Resources

Queues

Data

Tuning

Generating Tuning report

Scheduling Tuning report

Tuning report

Top X

Top X report

Generating TopX report

Scheduling TopX report

Viewing TopX report

User report

Scheduling User report

Importing configurations to generate User report

Workload

Note

Note

Month view

Hour, day and hour/day view

Day view

Hour view

Hour/Day view

Drilling down in a workload view