HBase alerts and metrics

Skip to main content

Home

HBase alerts and metrics

Alerts

Alerts generated and stored along with metrics. Unravel UI plots this information as appropriate.

Category	Alert	Suggested Action
Data availability	Table offline	Run `hbase hbck` to see if your HBase cluster has corruptions and use -repair flag if required. Check master logs for more information.
	Region offline	Run `hbase hbck` to see if your HBase cluster has corruptions and use -repair flag if required. Check master logs for more information.
	Region in transition beyond threshold period.	If a region server is dead, this is common. If not run `hbase hbck` to see if your HBase cluster has corruptions.
Server availability	Dead region servers	Check region server logs for more information.
Performance	Region servers with reads > 20% of average	Region server hotspotting. Split regions or randomize the keys.
	Region servers with writes > 20% of average	Region server hotspotting. Split regions or randomize the keys.
	Regions within a table with reads > 20% of average for that table	Table hotspotting - Split regions or randomize the keys.
	Regions within a table with writes > 20% of average for that table	Table hotspotting - Split regions or randomize the keys.
	Regions within a regionserver with reads > 20% of average for that table	Region server hotspotting - Split regions or randomize the keys.
	Regions within a regionserver with writes > 20% of average for that table	Region server hotspotting - Split regions or randomize the keys.
	Load, osload > 20% of average	Check for compactions, regions in transition and server logs.
	Balancer not running	Enable Balancer.
	Number of compactions and length of compaction	Disable periodic automatic major compactions by setting - hbase.hregion.majorcompaction to 0
Storage	Regionservers with storage (storefilesie sum) > 20% of average	Split or randomize the keys.
Storage	Regions within a table with storage (storefilesie sum) > 20% of average for that table	Split or randomize the keys.
Temporal	e.g. requests > 20% higher for the last 1 hour as compared to the prior 3 hours (just an example)	Check master and region server alerts or environment issues which could be slowing down the read/write.

Metrics

Master/Cluster & JMX metrics

Metric	Description	Unit
averageLoad	Average number of Regions per Region Server.	percentage
clusterRequests	Number of read and write requests across Cluster.	count
masterActiveTime	Master Active Time	epoch in milliseconds
masterStartTime	Master Start Time	epoch in milliseconds
numDeadRegionServers	Number of dead Region Servers.	count
numRegionServers	Number of live Region Servers.	count
ritCount	The number of regions in transition.	count
ritCountOverThreshold	The number of regions that have been in transition longer than a threshold time.	seconds
ritOldestAge	The age of the longest region in transition, in milliseconds.	millliseconds

OS Metrics (Ambari Only)

OS Metrics	Description	Unit
jvm_*	jvm metrics	number
rpc_*	rpc metrics	number

Region server metrics

JMX metrics

JMX Metrics	Description	Unit
compactionQueueLength	Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction.	count
hlogFileSize	Size of all WAL Files.	bytes
percentFilesLocal	Percent of store file data that can be read from the local DataNode, 0-100.	percentage
readRequestCount	The number of read requests received.	count
regionCount	The number of regions hosted by the regionserver.	count
slowOPCount	The number of operations we thought were slow. OP: delete, get, put, increment, append.	count
storeFileSize	Aggregate size of the store files on disk.	bytes
writeRequestCount	The number of write requests received.	count

OS Metrics (Ambari Only)

OS Metrics	Description	Unit
cpu_user	cpu	percentage
disk.disk_free	Amount of free disk space.	bytes
disk.write_bps	Number of bytes written per second to disk.	bytes per second
disk.read_bps	Number of bytes read per second to disk.	bytes per second
load.load_one	load	number
memory.mem_free	Percentage of free memory.	percentage
network.bytes_in	Total number incoming bytes to network.	bytes
network.bytes_out	Total number outgoing bytes to network.	bytes

Table/Region Metrics

Table and Region Metrics	Description	Unit
tableSize	Total table size in the region server.	bytes
regionCount	Number of regions.	count
averageRegionSize (Table only)	Average region size over the region server including memstore and storefile sizes.	bytes
storeFileSize	Size of storefiles being served.	bytes
readRequestCount	Number of read requests this region server has answered.	count
writeRequestCount	Number of mutation requests this region server has answered.	count

In this section:

Would you like to provide feedback? Just click here to suggest edits.