summaryrefslogtreecommitdiffstats
path: root/docs/monitor/configure-alarms.md
blob: 4b5b8134e4bc9aa514b834f1722f801841cd5e73 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
<!--
title: "Configure health alarms"
description: "Netdata's health monitoring watchdog is incredibly adaptable to your infrastructure's unique needs, with configurable health alarms."
custom_edit_url: "https://github.com/netdata/netdata/edit/master/docs/monitor/configure-alarms.md"
sidebar_label: "Configure health alarms"
learn_status: "Published"
learn_topic_type: "Tasks"
learn_rel_path: "Setup"
-->

# Configure health alarms

Netdata's health watchdog is highly configurable, with support for dynamic thresholds, hysteresis, alarm templates, and
more. You can tweak any of the existing alarms based on your infrastructure's topology or specific monitoring needs, or
create new entities.

You can use health alarms in conjunction with any of Netdata's [collectors](https://github.com/netdata/netdata/blob/master/docs/collect/how-collectors-work.md) (see
the [supported collector list](https://github.com/netdata/netdata/blob/master/collectors/COLLECTORS.md)) to monitor the health of your systems, containers, and
applications in real time.

While you can see active alarms both on the local dashboard and Netdata Cloud, all health alarms are configured _per
node_ via individual Netdata Agents. If you want to deploy a new alarm across your
[infrastructure](https://github.com/netdata/netdata/blob/master/docs/quickstart/infrastructure.md), you must configure each node with the same health configuration
files.

## Edit health configuration files

All of Netdata's [health configuration files](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#health-configuration-files) are in Netdata's config
directory, inside the `health.d/` directory. Navigate to your [Netdata config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md) and
use `edit-config` to make changes to any of these files.

For example, to edit the `cpu.conf` health configuration file, run:

```bash
sudo ./edit-config health.d/cpu.conf
```

Each health configuration file contains one or more health _entities_, which always begin with `alarm:` or `template:`.
For example, here is the first health entity in `health.d/cpu.conf`:

```yaml
template: 10min_cpu_usage
      on: system.cpu
      os: linux
   hosts: *
  lookup: average -10m unaligned of user,system,softirq,irq,guest
   units: %
   every: 1m
    warn: $this > (($status >= $WARNING)  ? (75) : (85))
    crit: $this > (($status == $CRITICAL) ? (85) : (95))
   delay: down 15m multiplier 1.5 max 1h
    info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
      to: sysadmin
```

To tune this alarm to trigger warning and critical alarms at a lower CPU utilization, change the `warn` and `crit` lines
to the values of your choosing. For example:

```yaml
    warn: $this > (($status >= $WARNING)  ? (60) : (75))
    crit: $this > (($status == $CRITICAL) ? (75) : (85))
```

Save the file and [reload Netdata's health configuration](#reload-health-configuration) to make your changes live.

### Silence an individual alarm

Instead of disabling an alarm altogether, or even disabling _all_ alarms, you can silence individual alarms by changing
one line in a given health entity. To silence any single alarm, change the `to:` line in its entity to `silent`.

```yaml
      to: silent
```

## Write a new health entity

While tuning existing alarms may work in some cases, you may need to write entirely new health entities based on how
your systems, containers, and applications work.

Read Netdata's [health reference](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#health-entity-reference) for a full listing of the format,
syntax, and functionality of health entities.

To write a new health entity into a new file, navigate to your [Netdata config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md),
then use `touch` to create a new file in the `health.d/` directory. Use `edit-config` to start editing the file.

As an example, let's create a `ram-usage.conf` file.

```bash
sudo touch health.d/ram-usage.conf
sudo ./edit-config health.d/ram-usage.conf
```

For example, here is a health entity that triggers a warning alarm when a node's RAM usage rises above 80%, and a
critical alarm above 90%:

```yaml
 alarm: ram_usage
    on: system.ram
lookup: average -1m percentage of used
 units: %
 every: 1m
  warn: $this > 80
  crit: $this > 90
  info: The percentage of RAM being used by the system.
```

Let's look into each of the lines to see how they create a working health entity.

-   `alarm`: The name for your new entity. The name needs to follow these requirements:
    -   Any alphabet letter or number.
    -   The symbols `.` and `_`.
    -   Cannot be `chart name`, `dimension name`, `family name`, or `chart variable names`.  
-   `on`: Which chart the entity listens to.
-   `lookup`: Which metrics the alarm monitors, the duration of time to monitor, and how to process the metrics into a
    usable format.
    -   `average`: Calculate the average of all the metrics collected.
    -   `-1m`: Use metrics from 1 minute ago until now to calculate that average.
    -   `percentage`: Clarify that we're calculating a percentage of RAM usage.
    -   `of used`: Specify which dimension (`used`) on the `system.ram` chart you want to monitor with this entity.
-   `units`: Use percentages rather than absolute units.
-   `every`: How often to perform the `lookup` calculation to decide whether or not to trigger this alarm.
-   `warn`/`crit`: The value at which Netdata should trigger a warning or critical alarm. This example uses simple
    syntax, but most pre-configured health entities use
    [hysteresis](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md#special-use-of-the-conditional-operator) to avoid superfluous notifications.
-   `info`: A description of the alarm, which will appear in the dashboard and notifications.

In human-readable format: 

> This health entity, named **ram_usage**, watches the **system.ram** chart. It looks up the last **1 minute** of
> metrics from the **used** dimension and calculates the **average** of all those metrics in a **percentage** format,
> using a **% unit**. The entity performs this lookup **every minute**. 
> 
> If the average RAM usage percentage over the last 1 minute is **more than 80%**, the entity triggers a warning alarm.
> If the usage is **more than 90%**, the entity triggers a critical alarm.

When you finish writing this new health entity, [reload Netdata's health configuration](#reload-health-configuration) to
see it live on the local dashboard or Netdata Cloud.

## Reload health configuration

To make any changes to your health configuration live, you must reload Netdata's health monitoring system. To do that
without restarting all of Netdata, run `netdatacli reload-health` or `killall -USR2 netdata`.

## What's next?

With your health entities configured properly, it's time to [enable
notifications](https://github.com/netdata/netdata/blob/master/docs/monitor/enable-notifications.md) to get notified whenever a node reaches a warning or critical
state.

To build complex, dynamic alarms, read our guide on [dimension templates](https://github.com/netdata/netdata/blob/master/docs/guides/monitor/dimension-templates.md).