AutoActions examples

Note

Limitations on AutoActions can be found here. Information on demos can be found here.

Sample JSON rules

Unless otherwise noted, all JSON rules are entered into the Rule Box in the Expert Mode template.

Alert examples

Alert if Hive query duration > 10 minutes.

{
  "scope": "multi_app",
  "user_metric": "duration",
  "type": "HIVE",
  "state": "RUNNING",
  "compare": ">",
  "value": 600000
}

Alert if Tez query duration > 10 minutes.

{
  "scope": "multi_app",
  "user_metric": "duration",
  "type": "TEZ",
  "state": "RUNNING",
  "compare": ">",
  "value": 600000
}

Alert if any workflow's duration > 20 minutes.

{
  "scope": "multi_app",
  "type": "WORKFLOW",
  "state": "RUNNING",
  "user_metric": "duration",
  "compare": ">",
  "value": 1200000
}

Alert if workflow named “foo” and duration > 10 minutes.

{
  "scope":"by_name",
  "target":"foo",
  "type":"WORKFLOW",
  "state":"RUNNING",
  "user_metric":"duration",
  "compare":">",
  "value":600000
}

Alert if workflow named “foo” and totalDfsBytesRead > 100 MB and duration > 20 minutes.

{
  "AND":[
    {
      "scope":"by_name",
      "target":"foo",
      "type":"WORKFLOW",
      "user_metric":"duration",
      "compare":">",
      "value":1200000
    },
    {
      "scope":"by_name",
      "target":"foo",
      "type":"WORKFLOW",
      "user_metric":"totalDfsBytesRead",
      "compare":">",
      "value":104857600
    }
  ]
}

Alert if Hive query in Queue “foo” and duration > 10 minutes.

{
  "scope": "multi_app",
  "type": "HIVE",
  "state": "RUNNING",
  "user_metric": "duration",
  "compare": ">",
  "value": 600000
}

And select global rule condition Queue only “foo”:

Kill App Example

When workflow name is “prod_ml_model” and duration > 2h then kill jobs with allocated_vcores >= 20 and queue != ‘sla_queue’

In Rule Box enter:

{
  "scope": "by_name",
  "target": "prod_ml_model",
  "type": "WORKFLOW",
  "user_metric": "duration",
  "compare": ">",
  "value": 7200000
}

In Action Box enter:

{
  "action": "kill_app",
  "max_vcores": 20,
  "not_in_queues": ["sla_queue"],
  "if_triggered": false
}

AutoActions rules, predefined templates versus expert mode

AutoActions demo package documentation is here.

Predefined templates cover a variety of jobs, yet they can lack the specificity or complexity you need for monitoring.

For instance, you can use the Rogue Application template to determine if jobs are using too much memory or vCore resources by alerting for jobs using more than 1 TB of memory. However, if you only want to know if only Map Reduce jobs are using > 1 TB, the template won't suffice. For such instances, you must write your AutoActions using the Expert Mode template with the rules and some actions written in JSON.

The following is a variety of AutoActions written using JSON.

MapReduce

Alert on MapReduce jobs using > 1 TB of memory.

{
  "scope": "multi_app",
  "type": "MAPREDUCE",
  "metric": "allocated_mb",
  "compare": ">",
  "value": 1073741824
}

Alert on MapReduce jobs using > 1000 vCores.

{
  "scope": "multi_app",
  "type": "MAPREDUCE",
  "metric": "allocated_vcores",
  "compare": ">",
  "value": 1000
}

Alert on MapReduce jobs running more than 1 hour.

{
 "scope": "multi_app",
 "type": "MAPREDUCE",
 "metric": "elapsed_time",
 "compare": ">",
 "value": 3600000
}

Alert on MapReduce jobs that may affect any production SLA jobs running on a cluster.

Check for MapReduce jobs not in the SLA queue, running between 12 am and 3 am, and using > 1 TB of memory.

Use the JSON rule specifying Map Reduce jobs using > 1 TB and set the rule conditions as shown.

Alert on ad hoc MapReduce jobs use a majority of cluster resources which may impact the cluster performance.

Check for MapReduce Jobs in the “root.adhocd” queue, running between 1 am and 5 am, and using > 1 TB of memory.

Use the JSON rule specifying Map Reduce jobs using > 1 TB and set the rule conditions as shown.

Spark

The JSON rules to alert if a Spark app is grabbing the majority of cluster resources are exactly like the Map Reduce rules except SPARK is used for the "type".

Alert on only Spark jobs using > 1 TB of memory.

{
  "scope": "multi_app",
  "type": "SPARK",
  "metric": "allocated_mb",
  "compare": ">",
  "value": 1073741824
}

Alert on only Spark jobs using > 1000 vCores.

{
  "scope": "multi_app",
  "type": "MAPREDUCE",
  "metric": "allocated_vcores",
  "compare": ">",
  "value": 1000
}

Alert if a Spark SQL query has unbalanced input versus output, which may indicate inefficient or “rogue” queries.

Check if any Spark app is generating lots of rows in comparison with input. In this example, ‘outputToInputRowRatio’ > 1000

{
  "scope": "multi_app",
  "type": "SPARK",
  "user_metric": "outputToInputRowRatio",
  "compare": ">",
  "value": 1000
}

Alert if a Spark SQL has lots of output partitions.

Check if any Spark app ‘outputPartitions’ > 10000.

{
  "scope": "multi_app",
  "type": "SPARK",
  "user_metric": "outputPartitions",
  "compare": ">",
  "value": 10000
}

Hive

Alert if a Hive query duration is running longer than expected.

Check if a Hive query duration > 5 hours.

{
  "scope": "multi_app",
  "type": "HIVE",
  "user_metric": "duration",
  "compare": ">",
  "value": 18000000
}

Alert if SLA bound query is taking longer than expected.

Check if a Hive query started between 1 am and 3 am in the queue ‘prod’ runs longer than > 20 minutes.
```
{
  "scope": "multi_app",
  "type": "HIVE",
  "user_metric": "duration",
  "compare": ">",
  "value": 1200000
}
```
Set the rule conditions as shown.
Check if any Hive query is started between 1 am and 3 am in any queue except ‘prod’.
```
{
  "scope": "multi_app",
  "type": "HIVE",
  "metric": "app_count",
  "compare": ">",
  "value": 0
}
```
Set the rule conditions as shown.

Alert if a Hive query has extensive I/O, which may affect HDFS and other apps.

Check if a Hive query writes out more than 100 GB in total.

{
  "scope": "multi_app",
  "type": "HIVE",
  "user_metric": "totalDfsBytesWritten",
  "compare": ">",
  "value": 107374182400
}

Check if a Hive query reads in more than 100 GB in total.

{
  "scope": "multi_app",
  "type": "HIVE",
  "user_metric": "totalDfsBytesRead",
  "compare": ">",
  "value": 107374182400
}

Detect inefficient and “stuck” Hive queries, that is, alert if a Hive query has not read lots of data but running for a longer time.

Check if any Hive query has read less than 10GB in total and its duration is longer than 1 hour.

{
  "SAME":[
    {
      "scope":"multi_app",
      "type":"HIVE",
      "user_metric":"duration",
      "compare":">",
      "value":3600000
    },
    {
      "scope":"multi_app",
      "type":"HIVE",
      "user_metric":"totalDfsBytesRead",
      "compare":"<",
      "value":10485760
    }
  ]
}

Tez

Alert if a Tez query duration is running longer than expected.

Check if a Tez query duration > 5 hours.

{
  "scope": "multi_app",
  "type": "TEZ",
  "user_metric": "duration",
  "compare": ">",
  "value": 18000000
}

Alert if SLA bound query is taking longer than expected.

Check if a Tez query started between 1 am and 3 am in queue ‘prod’ runs longer than > 20 minutes.
```
{
  "scope": "multi_app",
  "type": "TEZ",
  "user_metric": "duration",
  "compare": ">",
  "value": 1200000
}
```
Set the rule conditions as shown.
Check if any Tez query is started between 1 am and 3 am in any queue except ‘prod’.
```
{
  "scope": "multi_app",
  "type": "TEZ",
  "metric": "app_count",
  "compare": ">",
  "value": 0
}
```
Set the rule conditions as shown.

Alert if a Tez query has extensive I/O, which may affect HDFS and other apps.

Check if a Tez query writes out more than 100 GB in total.

{
  "scope": "multi_app",
  "type": "TEZ",
  "user_metric": "totalDfsBytesWritten",
  "compare": ">",
  "value": 107374182400
}

Check if a Tez query reads in more than 100 GB in total.

{
  "scope": "multi_app",
  "type": "TEZ",
  "user_metric": "totalDfsBytesRead",
  "compare": ">",
  "value": 107374182400
}

Detect inefficient and “stuck” Tez queries. For example, an alert if a Tez query has not read lots of data but running for a longer time.

Check if any Tez query has read less than 10 GB in total and its duration is longer than 1 hour.

{
  "SAME":[
    {
      "scope":"multi_app",
      "type":"TEZ",
      "user_metric":"duration",
      "compare":">",
      "value":3600000
    },
    {
      "scope":"multi_app",
      "type":"TEZ",
      "user_metric":"totalDfsBytesRead",
      "compare":"<",
      "value":10485760
    }
  ]
}

Workflow

Alert if a workflow is taking longer than expected.

Check if any workflow is running for longer than 5 hours.

{
  "scope": "multi_app",
  "type": "WORKFLOW",
  "user_metric": "duration",
  "compare": ">",
  "value": 18000000
}

Check if an SLA-bound workflow named ‘market_report’ runs longer than 30 minutes.

{
  "scope": "multi_app",
  "type": "WORKFLOW",
  "user_metric": "duration",
  "compare": ">",
  "value": 18000000
}

Alert if an SLA-bound workflow reads more data than expected.

Check if workflow named '‘market_report’' and 'totalDfsBytesRead' > 100 GB.

{
  "scope": "by_name",
  "target": "market_report",
  "type": "WORKFLOW",
  "user_metric": "totalDfsBytesRead",
  "compare": ">",
  "value": 107374182400
}

Alert if an SLA-bound workflow takes longer and kills bigger apps not run by the SLA user.

Check if Workflow is named ‘prod_ml_model’ and duration > 2h, then kill jobs with allocated_vcores >= 20 and user != ‘sla_user'.

{
  "scope": "by_name",
  "target": "prod_ml_model",
  "type": "WORKFLOW",
  "user_metric": "duration",
  "compare": ">",
  "value": 7200000
}

Enter the following code in the Export Mode template's Action box.

{
  "action": "kill_app",
  "max_vcores": 20,
  "not_in_queues": ["sla_queue"],
  "if_triggered": false
}

USER

User Alert for Rogue User - Any user consuming a major portion of cluster resources.

Check for any user where the allocated vCores aggregated over all their apps is > 1000.
You can use the Rogue User template,
or the JSON rule.
```
{
  "scope": "multi_user",
  "metric": "allocated_vcores",
  "compare": ">",
  "value": 1000
}
```
Check for any user whose allocated memory aggregated over all their apps is > 1 TB.
You can use the Rouge User template or the JSON rule.
```
{
  "scope": "multi_user",
  "metric": "allocated_mb",
  "compare": ">",
  "value": 1073741824
}
```

Queue

Alert for the rogue queue - any queue consuming a major portion of cluster resources.

Check for any queue where the allocated vCores aggregated overall its apps for any queue > 1000.
```
{
  "scope": "multi_queue",
  "metric": "allocated_vcores",
  "compare": ">",
  "value": 1000
}
```

Check for any queue where the allocated memory aggregated overall its apps is > 1 TB.

{
  "scope": "multi_queue",
  "metric": "allocated_mb",
  "compare": ">",
  "value": 1073741824
}

Applications

While apps in the quarantine queue continue to run, the queue is preemptable and has a low resource allocation. If any other queue needs resources, it can preempt apps in the quarantine queue. Moving rogue apps to quarantine queue frees resources for other apps. The following examples are alerting on vCores; to alert on memory, substitute memory for vCores in the following rules.

Alert for a rogue app

If any app (not SLA bound) consumes more than certain vCores at midnight, move it to a quarantine queue.

You can use the Rogue Application template to specify vCores.

Or the Expert Mode template and set the JSON rule for vCores as

{
  "scope": "multi_app",
  "metric": "allocated_vcores",
  "compare": ">",
  "value": 1000
}

Set the Time rule condition as:

Set the Move app rule as:

Any app that needs greater than X amount of resources has to be approved; otherwise, the app is moved to the quarantine queue.