Overview
Unravel's AutoActions automates the monitoring of your compute cluster by allowing you to define complex actionable rules on different cluster metrics. You can use an AutoAction to alert you to a situation needing manual intervention, for example, resource contention or stuck jobs. Additionally, it can be set to automatically kill an app or move it to a different queue.
The Unravel Server processes AutoActions by:
Collecting various metrics from the cluster.
Aggregating the collected metrics according to user-defined rules.
Detecting rule violations.
Applying defined actions for each rule violation.
Each rule consists of:
A logical expression that is used to aggregate cluster metrics and to evaluate the rule. A rule has two conditions:
Prerequisite conditions: The conditions which cause a violation, for example, the number of jobs running, memory used.
Defining conditions: Who/what/when can cause the violation, for example, user, apps.
Actions for Unravel Server to execute whenever it detects a rule violation.
Manage > AutoActions
The AutoActions tab provides a quick way to view AutoActions and quickly see their status, along with its defined actions and scope. The tab displays all defined AutoActions separated into an Active and Inactive list. You enable/disable by clicking the check box on the left. You can edit () or delete () an AutoAction regardless of its status. Click to copy the AutoActions JSON code. At the top are buttons which allow you to define new AutoActions.
Hovering over the AutoAction's name gives you the description which was entered when defining the AutoAction. Hovering over action or scope glyph brings up its detail. For example, for the active AutoAction above:
Rule description: .
Email action: an email is sent to only one person, .
Queue scope: is three queues, .
The Actions and Scope columns contains all available options. When an option has been set, i.e., no longer using the default setting, it is highlighted. It is possible to set an AutoAction which contains no actions, see quicktest below. Such an AutoAction simply has when it was triggered logs and retains the data. Every AutoAction must have a scope. When the AutoAction has an action or scope defined via an Expert Rule, that action or scope isn't noted in the table. The AutoAction, quicktest below, doesn't note a scope; however one was specified via an Expert Rule. The History of Runs column lists the number of times, if any, the AutoAction was triggered. Click the number to bring up its history.
By default, all actions are off. Possible actions are:
Send an Email ().
Kill the App ().
Move the app to another queue ().
Send an HTTP post ().
By default, the various scopes apply to all, i.e., all apps and constantly on. The scopes are:
User ().
Queue ().
Cluster ().
App ().
Time ().
Sustained Violation (This isn't shown in the AutoActions list).
If you haven't defined a particular action or scope, that is, it's using the default, the glyph is gray (). When defined the glyph is blue in an active AutoAction (), and darkened for a disabled action ().
Click the History of Runs for detailed information when the AutoAction was triggered. The history notes the time the action was triggered, contains a link a Cluster View (see below), and a link to the offending app. Hover over the app's link to see the app's type and click it to bring the app's APM.
Click the run's Link button for the Cluster View (Operations > Usage Details > Infrastructure) for that particular run. The Cluster View shows a time slice, ±5 minutes from when the AutoAction was triggered and lists all the apps running during that period. This app table is similar to the app table shown under Applications > Applications. Not all the running apps will have triggered an AutoAction during the time slice. Click in the graph to show the apps running at that point in time. The Notifications column () notes if the app triggered the AutoAction (), has tuning suggestions (), or both (). In the example below, two apps were running at the time; both triggered the action but neither has tuning suggestions. Hovering over brings up a pop-up listing the violations; in this example the first app violated two separate AutoActions.
'Snoozing' AutoActions
AutoActions violations can become "noisy", i.e., an app continues to violate an AutoAction but the violation adds no new information. 'Snoozing' AutoActions helps you filter out noise by preventing automatic actions from repeating during a specified period, if and only if,
it is the same violation context, and
the action adds no further information to the violation.