HBase alerts and metrics
Alerts
Alerts generated and stored along with metrics. Unravel UI plots this information as appropriate.
Category  | Alert  | Suggested Action  | 
|---|---|---|
Data availability  | Table offline  | Run   | 
Region offline  | Run   | |
Region in transition beyond threshold period.  | If a region server is dead, this is common. If not run   | |
Server availability  | Dead region servers  | Check region server logs for more information.  | 
Performance  | Region servers with reads > 20% of average  | Region server hotspotting. Split regions or randomize the keys.  | 
Region servers with writes > 20% of average  | Region server hotspotting. Split regions or randomize the keys.  | |
Regions within a table with reads > 20% of average for that table  | Table hotspotting - Split regions or randomize the keys.  | |
Regions within a table with writes > 20% of average for that table  | Table hotspotting - Split regions or randomize the keys.  | |
Regions within a regionserver with reads > 20% of average for that table  | Region server hotspotting - Split regions or randomize the keys.  | |
Regions within a regionserver with writes > 20% of average for that table  | Region server hotspotting - Split regions or randomize the keys.  | |
Load, osload > 20% of average  | Check for compactions, regions in transition and server logs.  | |
Balancer not running  | Enable Balancer.  | |
Number of compactions and length of compaction  | Disable periodic automatic major compactions by setting - hbase.hregion.majorcompaction to 0  | |
Storage  | Regionservers with storage (storefilesie sum) > 20% of average  | Split or randomize the keys.  | 
Regions within a table with storage (storefilesie sum) > 20% of average for that table  | Split or randomize the keys.  | |
Temporal  | e.g. requests > 20% higher for the last 1 hour as compared to the prior 3 hours (just an example)  | Check master and region server alerts or environment issues which could be slowing down the read/write.  | 
Metrics
Metric  | Description  | Unit  | 
|---|---|---|
averageLoad  | Average number of Regions per Region Server.  | percentage  | 
clusterRequests  | Number of read and write requests across Cluster.  | count  | 
masterActiveTime  | Master Active Time  | epoch in milliseconds  | 
masterStartTime  | Master Start Time  | epoch in milliseconds  | 
numDeadRegionServers  | Number of dead Region Servers.  | count  | 
numRegionServers  | Number of live Region Servers.  | count  | 
ritCount  | The number of regions in transition.  | count  | 
ritCountOverThreshold  | The number of regions that have been in transition longer than a threshold time.  | seconds  | 
ritOldestAge  | The age of the longest region in transition, in milliseconds.  | millliseconds  | 
OS Metrics  | Description  | Unit  | 
|---|---|---|
jvm_*  | jvm metrics  | number  | 
rpc_*  | rpc metrics  | number  | 
Region server metrics
JMX Metrics  | Description  | Unit  | 
|---|---|---|
compactionQueueLength  | Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction.  | count  | 
hlogFileSize  | Size of all WAL Files.  | bytes  | 
percentFilesLocal  | Percent of store file data that can be read from the local DataNode, 0-100.  | percentage  | 
readRequestCount  | The number of read requests received.  | count  | 
regionCount  | The number of regions hosted by the regionserver.  | count  | 
slowOPCount  | The number of operations we thought were slow. OP: delete, get, put, increment, append.  | count  | 
storeFileSize  | Aggregate size of the store files on disk.  | bytes  | 
writeRequestCount  | The number of write requests received.  | count  | 
OS Metrics  | Description  | Unit  | 
|---|---|---|
cpu_user  | cpu  | percentage  | 
disk.disk_free  | Amount of free disk space.  | bytes  | 
disk.write_bps  | Number of bytes written per second to disk.  | bytes per second  | 
disk.read_bps  | Number of bytes read per second to disk.  | bytes per second  | 
load.load_one  | load  | number  | 
memory.mem_free  | Percentage of free memory.  | percentage  | 
network.bytes_in  | Total number incoming bytes to network.  | bytes  | 
network.bytes_out  | Total number outgoing bytes to network.  | bytes  | 
Table/Region Metrics
Table and Region Metrics  | Description  | Unit  | 
|---|---|---|
tableSize  | Total table size in the region server.  | bytes  | 
regionCount  | Number of regions.  | count  | 
averageRegionSize (Table only)  | Average region size over the region server including memstore and storefile sizes.  | bytes  | 
storeFileSize  | Size of storefiles being served.  | bytes  | 
readRequestCount  | Number of read requests this region server has answered.  | count  | 
writeRequestCount  | Number of mutation requests this region server has answered.  | count  |