diff options
Diffstat (limited to 'docs/monitor/configure-alarms.md')
-rw-r--r-- | docs/monitor/configure-alarms.md | 152 |
1 files changed, 0 insertions, 152 deletions
diff --git a/docs/monitor/configure-alarms.md b/docs/monitor/configure-alarms.md deleted file mode 100644 index 4b5b8134..00000000 --- a/docs/monitor/configure-alarms.md +++ /dev/null @@ -1,152 +0,0 @@ -<!-- -title: "Configure health alarms" -description: "Netdata's health monitoring watchdog is incredibly adaptable to your infrastructure's unique needs, with configurable health alarms." -custom_edit_url: "https://github.com/netdata/netdata/edit/master/docs/monitor/configure-alarms.md" -sidebar_label: "Configure health alarms" -learn_status: "Published" -learn_topic_type: "Tasks" -learn_rel_path: "Setup" ---> - -# Configure health alarms - -Netdata's health watchdog is highly configurable, with support for dynamic thresholds, hysteresis, alarm templates, and -more. You can tweak any of the existing alarms based on your infrastructure's topology or specific monitoring needs, or -create new entities. - -You can use health alarms in conjunction with any of Netdata's [collectors](https://github.com/netdata/netdata/blob/master/docs/collect/how-collectors-work.md) (see -the [supported collector list](https://github.com/netdata/netdata/blob/master/collectors/COLLECTORS.md)) to monitor the health of your systems, containers, and -applications in real time. - -While you can see active alarms both on the local dashboard and Netdata Cloud, all health alarms are configured _per -node_ via individual Netdata Agents. If you want to deploy a new alarm across your -[infrastructure](https://github.com/netdata/netdata/blob/master/docs/quickstart/infrastructure.md), you must configure each node with the same health configuration -files. - -## Edit health configuration files - -All of Netdata's [health configuration files](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#health-configuration-files) are in Netdata's config -directory, inside the `health.d/` directory. Navigate to your [Netdata config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md) and -use `edit-config` to make changes to any of these files. - -For example, to edit the `cpu.conf` health configuration file, run: - -```bash -sudo ./edit-config health.d/cpu.conf -``` - -Each health configuration file contains one or more health _entities_, which always begin with `alarm:` or `template:`. -For example, here is the first health entity in `health.d/cpu.conf`: - -```yaml -template: 10min_cpu_usage - on: system.cpu - os: linux - hosts: * - lookup: average -10m unaligned of user,system,softirq,irq,guest - units: % - every: 1m - warn: $this > (($status >= $WARNING) ? (75) : (85)) - crit: $this > (($status == $CRITICAL) ? (85) : (95)) - delay: down 15m multiplier 1.5 max 1h - info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal) - to: sysadmin -``` - -To tune this alarm to trigger warning and critical alarms at a lower CPU utilization, change the `warn` and `crit` lines -to the values of your choosing. For example: - -```yaml - warn: $this > (($status >= $WARNING) ? (60) : (75)) - crit: $this > (($status == $CRITICAL) ? (75) : (85)) -``` - -Save the file and [reload Netdata's health configuration](#reload-health-configuration) to make your changes live. - -### Silence an individual alarm - -Instead of disabling an alarm altogether, or even disabling _all_ alarms, you can silence individual alarms by changing -one line in a given health entity. To silence any single alarm, change the `to:` line in its entity to `silent`. - -```yaml - to: silent -``` - -## Write a new health entity - -While tuning existing alarms may work in some cases, you may need to write entirely new health entities based on how -your systems, containers, and applications work. - -Read Netdata's [health reference](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#health-entity-reference) for a full listing of the format, -syntax, and functionality of health entities. - -To write a new health entity into a new file, navigate to your [Netdata config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md), -then use `touch` to create a new file in the `health.d/` directory. Use `edit-config` to start editing the file. - -As an example, let's create a `ram-usage.conf` file. - -```bash -sudo touch health.d/ram-usage.conf -sudo ./edit-config health.d/ram-usage.conf -``` - -For example, here is a health entity that triggers a warning alarm when a node's RAM usage rises above 80%, and a -critical alarm above 90%: - -```yaml - alarm: ram_usage - on: system.ram -lookup: average -1m percentage of used - units: % - every: 1m - warn: $this > 80 - crit: $this > 90 - info: The percentage of RAM being used by the system. -``` - -Let's look into each of the lines to see how they create a working health entity. - -- `alarm`: The name for your new entity. The name needs to follow these requirements: - - Any alphabet letter or number. - - The symbols `.` and `_`. - - Cannot be `chart name`, `dimension name`, `family name`, or `chart variable names`. -- `on`: Which chart the entity listens to. -- `lookup`: Which metrics the alarm monitors, the duration of time to monitor, and how to process the metrics into a - usable format. - - `average`: Calculate the average of all the metrics collected. - - `-1m`: Use metrics from 1 minute ago until now to calculate that average. - - `percentage`: Clarify that we're calculating a percentage of RAM usage. - - `of used`: Specify which dimension (`used`) on the `system.ram` chart you want to monitor with this entity. -- `units`: Use percentages rather than absolute units. -- `every`: How often to perform the `lookup` calculation to decide whether or not to trigger this alarm. -- `warn`/`crit`: The value at which Netdata should trigger a warning or critical alarm. This example uses simple - syntax, but most pre-configured health entities use - [hysteresis](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#special-use-of-the-conditional-operator) to avoid superfluous notifications. -- `info`: A description of the alarm, which will appear in the dashboard and notifications. - -In human-readable format: - -> This health entity, named **ram_usage**, watches the **system.ram** chart. It looks up the last **1 minute** of -> metrics from the **used** dimension and calculates the **average** of all those metrics in a **percentage** format, -> using a **% unit**. The entity performs this lookup **every minute**. -> -> If the average RAM usage percentage over the last 1 minute is **more than 80%**, the entity triggers a warning alarm. -> If the usage is **more than 90%**, the entity triggers a critical alarm. - -When you finish writing this new health entity, [reload Netdata's health configuration](#reload-health-configuration) to -see it live on the local dashboard or Netdata Cloud. - -## Reload health configuration - -To make any changes to your health configuration live, you must reload Netdata's health monitoring system. To do that -without restarting all of Netdata, run `netdatacli reload-health` or `killall -USR2 netdata`. - -## What's next? - -With your health entities configured properly, it's time to [enable -notifications](https://github.com/netdata/netdata/blob/master/docs/monitor/enable-notifications.md) to get notified whenever a node reaches a warning or critical -state. - -To build complex, dynamic alarms, read our guide on [dimension templates](https://github.com/netdata/netdata/blob/master/docs/guides/monitor/dimension-templates.md). - - |