diff options
Diffstat (limited to '')
-rw-r--r-- | health/README.md | 657 |
1 files changed, 657 insertions, 0 deletions
diff --git a/health/README.md b/health/README.md new file mode 100644 index 000000000..597bd3c32 --- /dev/null +++ b/health/README.md @@ -0,0 +1,657 @@ + +# Health monitoring + +Each netdata node runs an independent thread evaluating health monitoring checks. +This thread has lock free access to the database, so that it can operate as a watchdog. + +Health checks (alarms) are attached to netdata charts, allowing netdata to automatically +activate an alarm as soon as a chart is created. This is very important for +netdata, since many charts are dynamically created during runtime (for example, the +chart tracking network interface packet drops, is automatically created on the first +packet dropped). + +Netdata also supports alarm **templates**, so that an alarm can be attached to all +the charts of the same context (i.e. all network interfaces, or all disks, or all mysql servers, etc.) + +Each alarm can execute a single query to the database using statistical algorithms against past data, +but alarms can be combined. So, if you need 2 queries in the database, you can combine +2 alarms together (both will run a query to the database, and the results can be combined). + +Each alarm has unlimited access to all the metrics collected. So, a single alarm can +use expressions combining the latest value of any number of metrics. + +## Health configuration reference + +Stock netdata health configuration is in `/usr/lib/netdata/conf.d/health.d`. +These files can be overwritten by copying them and editing them in `/etc/netdata/health.d` +(run `/etc/netdata/edit-config` to edit them). + +In `/etc/netdata/health.d` you can also put any number of files (in any number of sub-directories) +with a suffix `.conf` to have them processed by netdata. + +Health configuration can be reloaded at any time, without restarting netdata. +Just send netdata the SIGUSR2 signal, like this: + +```sh +killall -USR2 netdata +``` + +### Entities in the health files + +There are 2 entities: + +1. **alarms**, which are attached to specific charts, and + +2. **templates**, which define rules that should be applied to all charts having a + specific `context`. You can use this feature to apply **alarms** to all disks, + all network interfaces, all mysql databases, all nginx web servers, etc. + +Both of these entities have exactly the same format and feature set. +The only difference is the label `alarm` or `template`. + +netdata supports overriding **templates** with **alarms**. +For example, when a template is defined for a set of charts, an alarm with exactly the +same name attached to the same chart the template matches, will have higher precedence +(i.e. netdata will use the alarm on this chart and prevent the template from being applied +to it). + +### The format + +The following lines are parsed. + +#### alarm line `alarm` or `template` + +This line starts an alarm or alarm template. + +``` +alarm: NAME +``` + +or + +``` +template: NAME +``` + +This line has to be first on each alarm or template. +`NAME` is anything you would like to name it (the only symbols allowed are `.` and `_`). + +--- + +#### alarm line `on` + +This line defines the data the alarm should be attached to. + +For alarms: + +``` +on: CHART +``` + +For `CHART` you can use a chart `id` or `name` of the chart, as shown on the dashboard. + +For alarm templates: + +``` +on: CONTEXT +``` + +`CONTEXT` is the template of a chart. For example the charts `mysql_local.net` and +`mysql_server2.net` have the same context: `mysql.net`. So, you can use this to apply +alarms to all `mysql.net` charts. + +To find the `CONTEXT` of a chart hover over its date, above the legend. A tooltip will +appear with this format `plugin:nodule, context`. For example, the bandwidth chart of +a network interface says: + +``` +proc:/proc/dev/dev, net.net +``` + +So, `plugin = proc`, `module = /proc/net/dev` and `context = net.net`. + +--- + +#### alarm line `os` + +This alarm or template will be used only if the O/S of the host loading it, matches this +pattern list. The value is a space separated list of simple patterns (use `*` as wildcard, +prefix with `!` for a negative match, order is important). + +``` +os: linux freebsd macos +``` + +--- + +#### alarm line `hosts` + +This alarm or template will be used only if the hostname of the host loading it, matches +this pattern list. The value is a space separated list of simple patterns (use `*` as wildcard, +prefix with `!` for a negative match, order is important). + +``` +hosts: server1 server2 database* !redis3 redis* +``` + +The above says: use this alarm on all hosts named `server1`, `server2`, `database*`, and +all `redis*` except `redis3`. + +This is useful when you centralize metrics from multiple hosts, to one netdata. + +--- + +#### alarm line `families` + +This line is only used in alarm templates. It filters the charts. So, if you need to create +an alarm template for a few of a kind of chart (a few of your disks, or a few of your network +interfaces, or a few your mysql servers, etc), you can create an alarm template that would +normally be applied to all of them, and filter them by family. + +The format is: + +``` +families: SIMPLE PATTERN LIST +``` + +Simple patterns list is a lists of space separated patterns. Use ` * ` as wildcard and ` ! ` +for a negative match. Processing is left to right, and on the first hit (positive or negative), +processing stops. + +So. `families: *` means, match anything, while `families: !bad*pattern* *` means anything +except `bad*pattern*` (where `*` is a wildcard to match any sequence of characters). + +The family of a chart is usually the submenu of the netdata dashboard it appears. + +--- + +#### alarm line `lookup` + +This lines makes a database lookup to find a value. This result of this lookup is available as `$this`. + +The format is: + +``` +lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] +``` + +Everything is the same with [badges](../web/api/badges/). In short: + +- `METHOD` is one of `average`, `min`, `max`, `sum`, `incremental-sum`. + This is required. + +- `AFTER` is a relative number of seconds, but it also accepts a single letter for changing + the units, like `-1s` = 1 second in the past, `-1m` = 1 minute in the past, `-1h` = 1 hour + in the past, `-1d` = 1 day in the past. You need a negative number (i.e. how far in the past + to look for the value). **This is required**. + +- `at BEFORE` is by default 0 and is not required. Using this you can define the end of the + lookup. So data will be evaluated between `AFTER` and `BEFORE`. + +- `every DURATION` sets the updated frequency of the lookup (supports single letter units as + above too). + +- `OPTIONS` is a space separated list of `percentage`, `absolute`, `min2max`, `unaligned`, + `match-ids`, `match-names`. Check the badges documentation for more info. + +- `of DIMENSIONS` is optional and has to be the last parameter. Dimensions have to be separated + by `,` or `|`. The space characters found in dimensions will be kept as-is (a few dimensions + have spaces in their names). This accepts netdata simple patterns and the `match-ids` and + `match-names` options affect the searches for dimensions. + +The result of the lookup will be available as `$this` and `$NAME` in expressions. +The timestamps of the timeframe evaluated by the database lookup is available as variables +`$after` and `$before` (both are unix timestamps). + +--- + +#### alarm line `calc` + +This expression is evaluated just after the `lookup` (if any). Its purpose is to apply some +calculation before using the value looked up from the db. + +You can also have an expression without a lookup, using other variables that are available. + +The result of the calculation will be available as `$this` in warning and critical expressions +(overwriting the `lookup` one). + +Format: + +``` +calc: EXPRESSION +``` + +Check [Expressions](#expressions) for more information. + +--- + +#### alarm line `every` + +Sets the update frequency of this alarm. This is the same to the `every DURATION` given +in the `lookup` lines. + +Format: + +``` +every: DURATION +``` + +`DURATION` accepts `s` for seconds, `m` is minutes, `h` for hours, `d` for days. + +--- + +#### alarm lines `green` and `red` + +Set the green and red thresholds of a chart. Both are available as `$green` and `$red` in +expressions. If multiple alarms define different thresholds, the ones defined by the first +alarm will be used. These will eventually visualized on the dashboard, so only one set of +them is allowed. If you need multiple sets of them in different alarms, use absolute numbers +instead of `$red` and `$green`. + +Format: + +``` +green: NUMBER +red: NUMBER +``` + +--- + +#### alarm lines `warn` and `crit` + +These expressions should evaluate to true or false (alternatively non-zero or zero). +They trigger the alarm. Both are optional. + +Format: + +``` +warn: EXPRESSION +crit: EXPRESSION +``` +Check [Expressions](#expressions) for more information. + +--- + +#### alarm line `to` + +This will be the first parameter of the script to be executed when the alarm switches status. +Its meaning is left up to the `exec` script. + +The default `exec` script, `alarm-notify.sh`, uses this field as a space separated list of roles, +which are then consulted to find the exact recipients per notification method. + +Format: + +``` +to: ROLE1 ROLE2 ROLE3 ... +``` + +--- + +#### alarm line `exec` + +The script that will be executed when the alarm changes status. + +Format: + +``` +exec: SCRIPT +``` + +The default `SCRIPT` is netdata's `alarm-notify.sh`, which supports all the notifications +methods netdata supports, including custom hooks. + +--- + +#### alarm line `delay` + +This is used to provide optional hysteresis settings for the notifications, to defend +against notification floods. These settings do not affect the actual alarm - only the time +the `exec` script is executed. + +Format: + +``` +delay: [[[up U] [down D] multiplier M] max X] +``` + +- `up U` defines the delay to be applied to a notification for an alarm that raised its status + (i.e. CLEAR to WARNING, CLEAR to CRITICAL, WARNING to CRITICAL). For example, `up 10s`, the + notification for this event will be sent 10 seconds after the actual event. This is used in + hope the alarm will get back to its previous state within the duration given. The default `U` + is zero. + +- `down D` defines the delay to be applied to a notification for an alarm that moves to lower + state (i.e. CRITICAL to WARNING, CRITICAL to CLEAR, WARNING to CLEAR). For example, `down 1m` + will delay the notification by 1 minute. This is used to prevent notifications for flapping + alarms. The default `D` is zero. + +- `mutliplier M` multiplies `U` and `D` when an alarm changes state, while a notification is + delayed. The default multiplier is `1.0`. + +- `max X` defines the maximum absolute notification delay an alarm may get. The default `X` + is `max(U * M, D * M)` (i.e. the max duration of `U` or `D` multiplied once with `M`). + + Example: + + `delay: up 10s down 15m multiplier 2 max 1h` + + The time is `00:00:00` and the status of the alarm is CLEAR. + + time of event|new status|delay|notification will be sent|why + -------------|----------|:---:|-------------------------|--- + 00:00:01 | WARNING | `up 10s` | 00:00:11 |first state switch + 00:00:05 | CLEAR | `down 15m x2`| 00:30:05 |the alarm changes state while a notification is delayed, so it was multiplied + 00:00:06 | WARNING | `up 10s x2 x2` | 00:00:26 |multiplied twice + 00:00:07|CLEAR|`down 15m x2 x2 x2`|00:45:07|multiplied 3 times. + + So: + - `U` and `D` are multiplied by `M` every time the alarm changes state (any state, not just + their matching one) and a delay is in place. + - All are reset to their defaults when the alarm switches state without a delay in place. + +--- + +### Expressions + +netdata has an internal [infix expression parser](../libnetdata/eval). +This parses expressions and creates an internal structure that allows fast execution of them. + +These operators are supported `+`, `-`, `*`, `/`, `<`, `<=`, `<>`, `!=`, `>`, `>=`, `&&`, `||`, +`!`, `AND`, `OR`, `NOT`. Boolean operators result in either `1` (true) or `0` (false). + +The conditional evaluation operator `?` is supported too. Using this operator IF-THEN-ELSE +conditional statements can be specified. The format is: `(condition) ? (true expression) : +(false expression)`. So, netdata will first evaluate the `condition` and based on the result +will either evaluate `true expression` or `false expression`. +Example: `($this > 0) ? ($avail * 2) : ($used / 2)`. +Nested such expressions are also supported (i.e. `true expression` and `false expression` can +contain conditional evaluations). + +Expressions also support the `abs()` function. + +Expressions can have variables. Variables start with `$`. Check below for more information. + +There are two special values you can use: + + - `nan`, for example `$this != nan` will check if the variable `this` is available. + A variable can be `nan` if the database lookup failed. All calculations (i.e. addition, + multiplication, etc) with a `nan` result in a `nan`. + + - `inf`, for example `$this != inf` will check if `this` is not infinite. A value or + variable can be infinite if divided by zero. All calculations (i.e. addition, + multiplication, etc) with a `inf` result in a `inf`. + +--- + +### Special use of the conditional operator + +A common (but not necessarily obvious) use of the conditional evaluation operator is +to provide [hysteresis](https://en.wikipedia.org/wiki/Hysteresis) around the critical +or warning thresholds. This usage helps to avoid bogus messages resulting from small +variations in the value when it is varying regularly but staying close to the threshold +value, without needing to delay sending messages at all. + +An example of such usage from the default CPU usage alarms bundled with netdata is: + +``` +warn: $this > (($status >= $WARNING) ? (75) : (85)) +crit: $this > (($status == $CRITICAL) ? (85) : (95)) +``` + +The above say: +* If the alarm is currently a warning, then the threshold for being considered a warning + is 75, otherwise it's 85. + +* If the alarm is currently critical, then the threshold for being considered critical + is 85, otherwise it's 95. + +Which in turn, results in the following behavior: +* While the value is rising, it will trigger a warning when it exceeds 85, and a critical + alert when it exceeds 95. + +* While the value is falling, it will return to a warning state when it goes below 85, + and a normal state when it goes below 75. + +* If the value is constantly varying between 80 and 90, then it will trigger a warning the + first time it goes above 85, but will remain a warning until it goes below 75 (or goes above 85). + +* If the value is constantly varying between 90 and 100, then it will trigger a critical alert + the first time it goes above 95, but will remain a critical alert goes below 85 (at which + point it will return to being a warning). + +--- + +### Variables + +netdata supports 3 new internal indexes for variables that will be used in health monitoring: + + - **chart local variables**. All the dimensions of the chart are exposed as local variables. + All chart alarms names are exposed as variables too. + + Charts also define a few special variables: + + - `$last_collected_t` is the unix timestamp of the last data collection + - `$collected_total_raw` is the sum of all the dimensions (their last collected values) + - `$update_every` is the update frequency of the chart + - `$green` and `$red` the threshold defined in alarms (these are per chart - the charts + inherits them from the the first alarm that defined them) + + Chart dimensions define their last calculated (i.e. interpolated) value, exactly as + shown on the charts, but also a variable with their name and suffix `_raw` that resolves + to the last collected value - as collected and another with suffix `_last_collected_t` + that resolves to unix timestamp the dimension was last collected (there may be dimensions + that fail to be collected while others continue normally). + + - **family variables**. Families are used to group charts together. For example all `eth0` + charts, have `family = eth0`. This index includes all local variables, but if there are + overlapping variables, only the first are exposed. + + - **host variables**. All the dimensions of all charts, including all alarms, in fullname. + Fullname is `CHART.VARIABLE`, where `CHART` is either the chart id or the chart name (both + are supported). + + - **special variables*** are: + + - `this`, which is resolved to the value of the current alarm. + + - `status`, which is resolved to the current status of the alarm (the current = the last + status, i.e. before the current database lookup and the evaluation of the `calc` line). + This values can be compared with `$REMOVED`, `$UNINITIALIZED`, `$UNDEFINED`, `$CLEAR`, + `$WARNING`, `$CRITICAL`. These values are incremental, ie. `$status > $CLEAL` works as + expected. + + - `now`, which is resolved to current unix timestamp. + +You can find all the variables that can be used for a given chart, using +`http://your.netdata.ip:19999/api/v1/alarm_variables?chart=NAME`. +This will dump all the indexes from the chart's perspective. +Example: [variables for the `system.cpu` chart of the registry](https://registry.my-netdata.io/api/v1/alarm_variables?chart=system.cpu). + +## Alarm Statuses + +Alarms can have the following statuses: + + - `REMOVED` - the alarm has been deleted (this happens when a SIGUSR2 is sent to netdata + to reload health configuration) + + - `UNINITIALIZED` - the alarm is not initialized yet + + - `UNDEFINED` - the alarm failed to be calculated (i.e. the database lookup failed, + a division by zero occurred, etc) + + - `CLEAR` - the alarm is not armed / raised (i.e. is OK) + + - `WARNING` - the warning expression resulted in true or non-zero + + - `CRITICAL` - the critical expression resulted in true or non-zero + +The external script will be called for all status changes. + +## Examples + + +Check the **[health.d directory](health.d)** for all alarms shipped with netdata. + +Here are a few examples: + +### Example 1 + +A simple check if an apache server is alive: + +``` +template: apache_last_collected_secs + on: apache.requests + calc: $now - $last_collected_t + every: 10s + warn: $this > ( 5 * $update_every) + crit: $this > (10 * $update_every) +``` + +The above checks that netdata is able to collect data from apache. In detail: + +``` +template: apache_last_collected_secs +``` + +The above defines a **template** named `apache_last_collected_secs`. +The name is important since `$apache_last_collected_secs` resolves to the `calc` line. +So, try to give something descriptive. + +``` + on: apache.requests +``` + +The above applies the **template** to all charts that have `context = apache.requests` +(i.e. all your apache servers). + +``` + calc: $now - $last_collected_t +``` + +- `$now` is a standard variable that resolves to the current timestamp. + +- `$last_collected_t` is the last data collection timestamp of the chart. + So this calculation gives the number of seconds passed since the last data collection. + +``` + every: 10s +``` + +The alarm will be evaluated every 10 seconds. + +``` + warn: $this > ( 5 * $update_every) + crit: $this > (10 * $update_every) +``` + +If these result in non-zero or true, they trigger the alarm. + +- `$this` refers to the value of this alarm (i.e. the result of the `calc` line. + We could also use `$apache_last_collected_secs`. + +`$update_every` is the update frequency of the chart, in seconds. + +So, the warning condition checks if we have not collected data from apache for 5 +iterations and the critical condition checks for 10 iterations. + +### Example 2 + +Check if any of the disks is critically low on disk space: + +``` +template: disk_full_percent + on: disk.space + calc: $used * 100 / ($avail + $used) + every: 1m + warn: $this > 80 + crit: $this > 95 +``` + +`$used` and `$avail` are the `used` and `avail` chart dimensions as shown on the dashboard. + +So, the `calc` line finds the percentage of used space. `$this` resolves to this percentage. + +### Example 3 + +Predict if any disk will run out of space in the near future. + +We do this in 2 steps: + +Calculate the disk fill rate: + +``` + template: disk_fill_rate + on: disk.space + lookup: max -1s at -30m unaligned of avail + calc: ($this - $avail) / (30 * 60) + every: 15s +``` + +In the `calc` line: `$this` is the result of the `lookup` line (i.e. the free space 30 minutes +ago) and `$avail` is the current disk free space. So the `calc` line will either have a positive +number of GB/second if the disk if filling up, or a negative number of GB/second if the disk is +freeing up space. + +There is no `warn` or `crit` lines here. So, this template will just do the calculation and +nothing more. + +Predict the hours after which the disk will run out of space: + +``` + template: disk_full_after_hours + on: disk.space + calc: $avail / $disk_fill_rate / 3600 + every: 10s + warn: $this > 0 and $this < 48 + crit: $this > 0 and $this < 24 +``` + +The `calc` line estimates the time in hours, we will run out of disk space. Of course, only +positive values are interesting for this check, so the warning and critical conditions check +for positive values and that we have enough free space for 48 and 24 hours respectively. + +Once this alarm triggers we will receive an email like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/17839993/87872b32-6802-11e6-8e08-b2e4afef93bb.png) + +### Example 4 + +Check if any network interface is dropping packets: + +``` +template: 30min_packet_drops + on: net.drops + lookup: sum -30m unaligned absolute + every: 10s + crit: $this > 0 +``` + +The `lookup` line will calculate the sum of the all dropped packets in the last 30 minutes. + +The `crit` line will issue a critical alarm if even a single packet has been dropped. + +Note that the drops chart does not exist if a network interface has never dropped a single packet. +When netdata detects a dropped packet, it will add the chart and it will automatically attach this +alarm to it. + +## Troubleshooting + +You can compile netdata with [debugging](../daemon#debugging) and then set in `netdata.conf`: + +``` +[global] + debug flags = 0x0000000000800000 +``` + +Then check your `/var/log/netdata/debug.log`. It will show you how it works. +Important: this will generate a lot of output in debug.log. + +You can find the context of charts by looking up the chart in either +`http://your.netdata:19999/netdata.conf` or `http://your.netdata:19999/api/v1/charts`. + +You can find how netdata interpreted the expressions by examining the alarm at +`http://your.netdata:19999/api/v1/alarms?all`. For each expression, netdata will return the +expression as given in its config file, and the same expression with additional parentheses +added to indicate the evaluation flow of the expression. + |