Skip to main content

Home

HBase alerts and metrics

Alerts

Alerts generated and stored along with metrics. Unravel UI plots this information as appropriate.

Category

Alert

Suggested Action

Data availability

Table offline

Run hbase hbck to see if your HBase cluster has corruptions and use -repair flag if required. Check master logs for more information.

Region offline

Run hbase hbck to see if your HBase cluster has corruptions and use -repair flag if required. Check master logs for more information.

Region in transition beyond threshold period.

If a region server is dead, this is common. If not run hbase hbck to see if your HBase cluster has corruptions.

Server availability

Dead region servers

Check region server logs for more information.

Performance

Region servers with reads > 20% of average

Region server hotspotting. Split regions or randomize the keys.

Region servers with writes > 20% of average

Region server hotspotting. Split regions or randomize the keys.

Regions within a table with reads > 20% of average for that table

Table hotspotting - Split regions or randomize the keys.

Regions within a table with writes > 20% of average for that table

Table hotspotting - Split regions or randomize the keys.

Regions within a regionserver with reads > 20% of average for that table

Region server hotspotting - Split regions or randomize the keys.

Regions within a regionserver with writes > 20% of average for that table

Region server hotspotting - Split regions or randomize the keys.

Load, osload > 20% of average

Check for compactions, regions in transition and server logs.

Balancer not running

Enable Balancer.

Number of compactions and length of compaction

Disable periodic automatic major compactions by setting -

hbase.hregion.majorcompaction to 0

Storage

Regionservers with storage (storefilesie sum) > 20% of average

Split or randomize the keys.

Regions within a table with storage (storefilesie sum) > 20% of average for that table

Split or randomize the keys.

Temporal

e.g. requests > 20% higher for the last 1 hour as compared to the prior 3 hours (just an example)

Check master and region server alerts or environment issues which could be slowing down the read/write.

Metrics
Master/Cluster & JMX metrics

Metric

Description

Unit

averageLoad

Average number of Regions per Region Server.

percentage

clusterRequests

Number of read and write requests across Cluster.

count

masterActiveTime

Master Active Time

epoch in milliseconds

masterStartTime

Master Start Time

epoch in milliseconds

numDeadRegionServers

Number of dead Region Servers.

count

numRegionServers

Number of live Region Servers.

count

ritCount

The number of regions in transition.

count

ritCountOverThreshold

The number of regions that have been in transition longer than a threshold time.

seconds

ritOldestAge

The age of the longest region in transition, in milliseconds.

millliseconds

OS Metrics (Ambari Only)

OS Metrics

Description

Unit

jvm_*

jvm metrics

number

rpc_*

rpc metrics

number

Region server metrics
JMX metrics

JMX Metrics

Description

Unit

compactionQueueLength

Current depth of the compaction request queue. If increasing, we are falling behind with storefile compaction.

count

hlogFileSize

Size of all WAL Files.

bytes

percentFilesLocal

Percent of store file data that can be read from the local DataNode, 0-100.

percentage

readRequestCount

The number of read requests received.

count

regionCount

The number of regions hosted by the regionserver.

count

slowOPCount

The number of operations we thought were slow. OP: delete, get, put, increment, append.

count

storeFileSize

Aggregate size of the store files on disk.

bytes

writeRequestCount

The number of write requests received.

count

OS Metrics (Ambari Only)

OS Metrics

Description

Unit

cpu_user

cpu

percentage

disk.disk_free

Amount of free disk space.

bytes

disk.write_bps

Number of bytes written per second to disk.

bytes per second

disk.read_bps

Number of bytes read per second to disk.

bytes per second

load.load_one

load

number

memory.mem_free

Percentage of free memory.

percentage

network.bytes_in

Total number incoming bytes to network.

bytes

network.bytes_out

Total number outgoing bytes to network.

bytes

Table/Region Metrics

Table and Region Metrics

Description

Unit

tableSize

Total table size in the region server.

bytes

regionCount

Number of regions.

count

averageRegionSize (Table only)

Average region size over the region server including memstore and storefile sizes.

bytes

storeFileSize

Size of storefiles being served.

bytes

readRequestCount

Number of read requests this region server has answered.

count

writeRequestCount

Number of mutation requests this region server has answered.

count