diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-06 01:22:31 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-06 01:22:31 +0000 |
commit | 8d4f58e49b9dc7d3545651023a36729de773ad86 (patch) | |
tree | 7bc7be4a8e9e298daa1349348400aa2a653866f2 /health | |
parent | Initial commit. (diff) | |
download | netdata-8d4f58e49b9dc7d3545651023a36729de773ad86.tar.xz netdata-8d4f58e49b9dc7d3545651023a36729de773ad86.zip |
Adding upstream version 1.12.0.upstream/1.12.0upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
108 files changed, 9902 insertions, 0 deletions
diff --git a/health/.keep b/health/.keep new file mode 100644 index 0000000..e69de29 --- /dev/null +++ b/health/.keep diff --git a/health/Makefile.am b/health/Makefile.am new file mode 100644 index 0000000..40592a9 --- /dev/null +++ b/health/Makefile.am @@ -0,0 +1,81 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +AUTOMAKE_OPTIONS = subdir-objects +MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +SUBDIRS = \ + notifications \ + $(NULL) + +CLEANFILES = \ + $(NULL) + +dist_noinst_DATA = \ + README.md \ + $(NULL) + +userhealthconfigdir=$(configdir)/health.d +dist_userhealthconfig_DATA = \ + .keep \ + $(NULL) + +healthconfigdir=$(libconfigdir)/health.d +dist_healthconfig_DATA = \ + health.d/adaptec_raid.conf \ + health.d/apache.conf \ + health.d/apcupsd.conf \ + health.d/backend.conf \ + health.d/bcache.conf \ + health.d/beanstalkd.conf \ + health.d/bind_rndc.conf \ + health.d/boinc.conf \ + health.d/btrfs.conf \ + health.d/ceph.conf \ + health.d/cpu.conf \ + health.d/couchdb.conf \ + health.d/disks.conf \ + health.d/dockerd.conf \ + health.d/elasticsearch.conf \ + health.d/entropy.conf \ + health.d/fping.conf \ + health.d/fronius.conf \ + health.d/haproxy.conf \ + health.d/httpcheck.conf \ + health.d/ipc.conf \ + health.d/ipfs.conf \ + health.d/ipmi.conf \ + health.d/isc_dhcpd.conf \ + health.d/lighttpd.conf \ + health.d/linux_power_supply.conf \ + health.d/load.conf \ + health.d/mdstat.conf \ + health.d/megacli.conf \ + health.d/memcached.conf \ + health.d/memory.conf \ + health.d/mongodb.conf \ + health.d/mysql.conf \ + health.d/named.conf \ + health.d/net.conf \ + health.d/netfilter.conf \ + health.d/nginx.conf \ + health.d/nginx_plus.conf \ + health.d/portcheck.conf \ + health.d/postgres.conf \ + health.d/qos.conf \ + health.d/ram.conf \ + health.d/redis.conf \ + health.d/retroshare.conf \ + health.d/softnet.conf \ + health.d/squid.conf \ + health.d/stiebeleltron.conf \ + health.d/swap.conf \ + health.d/tcp_conn.conf \ + health.d/tcp_listen.conf \ + health.d/tcp_mem.conf \ + health.d/tcp_orphans.conf \ + health.d/tcp_resets.conf \ + health.d/udp_errors.conf \ + health.d/varnish.conf \ + health.d/web_log.conf \ + health.d/zfs.conf \ + $(NULL) diff --git a/health/README.md b/health/README.md new file mode 100644 index 0000000..54f6a3e --- /dev/null +++ b/health/README.md @@ -0,0 +1,663 @@ +# Health monitoring + +Each netdata node runs an independent thread evaluating health monitoring checks. +This thread has lock free access to the database, so that it can operate as a watchdog. + +Health checks (alarms) are attached to netdata charts, allowing netdata to automatically +activate an alarm as soon as a chart is created. This is very important for +netdata, since many charts are dynamically created during runtime (for example, the +chart tracking network interface packet drops, is automatically created on the first +packet dropped). + +Netdata also supports alarm **templates**, so that an alarm can be attached to all the charts of the same context (i.e. all network interfaces, or all disks, or all mysql servers, etc.). + + +Each alarm can execute a single query to the database using statistical algorithms against past data, +but alarms can be combined. So, if you need 2 queries in the database, you can combine +2 alarms together (both will run a query to the database, and the results can be combined). + +Each alarm has unlimited access to all the metrics collected. So, a single alarm can +use expressions combining the latest value of any number of metrics. + +## Health configuration reference + +Stock netdata health configuration is in `/usr/lib/netdata/conf.d/health.d`. +These files can be overwritten by copying them and editing them in `/etc/netdata/health.d` +(run `/etc/netdata/edit-config` to edit them). + +In `/etc/netdata/health.d` you can also put any number of files (in any number of sub-directories) +with a suffix `.conf` to have them processed by netdata. + +Health configuration can be reloaded at any time, without restarting netdata. +Just send netdata the SIGUSR2 signal, like this: + +```sh +killall -USR2 netdata +``` + +### Entities in the health files + +There are 2 entities: + +1. **alarms**, which are attached to specific charts, and + +1. **templates**, which define rules that should be applied to all charts having a + specific `context`. You can use this feature to apply **alarms** to all disks, + all network interfaces, all mysql databases, all nginx web servers, etc. + +Both of these entities have exactly the same format and feature set. +The only difference is the label `alarm` or `template`. + +Netdata supports overriding **templates** with **alarms**. +For example, when a template is defined for a set of charts, an alarm with exactly the +same name attached to the same chart the template matches, will have higher precedence +(i.e. netdata will use the alarm on this chart and prevent the template from being applied +to it). + +### The format + +The following lines are parsed. + +#### Alarm line `alarm` or `template` + +This line starts an alarm or alarm template. + +``` +alarm: NAME +``` + +or + +``` +template: NAME +``` + +This line has to be first on each alarm or template. +`NAME` is anything you would like to name it (the only symbols allowed are `.` and `_`). + +--- + +#### Alarm line `on` + +This line defines the data the alarm should be attached to. + +For alarms: + +``` +on: CHART +``` + +For `CHART` you can use a chart `id` or `name` of the chart, as shown on the dashboard. + +For alarm templates: + +``` +on: CONTEXT +``` + +`CONTEXT` is the template of a chart. For example the charts `mysql_local.net` and +`mysql_server2.net` have the same context: `mysql.net`. So, you can use this to apply +alarms to all `mysql.net` charts. + +To find the `CONTEXT` of a chart hover over its date, above the legend. A tooltip will +appear with this format `plugin:nodule, context`. For example, the bandwidth chart of +a network interface says: + +``` +proc:/proc/dev/dev, net.net +``` + +So, `plugin = proc`, `module = /proc/net/dev` and `context = net.net`. + +--- + +#### Alarm line `os` + +This alarm or template will be used only if the O/S of the host loading it, matches this +pattern list. The value is a space separated list of simple patterns (use `*` as wildcard, +prefix with `!` for a negative match, order is important). + +``` +os: linux freebsd macos +``` + +--- + +#### Alarm line `hosts` + +This alarm or template will be used only if the hostname of the host loading it, matches +this pattern list. The value is a space separated list of simple patterns (use `*` as wildcard, +prefix with `!` for a negative match, order is important). + +``` +hosts: server1 server2 database* !redis3 redis* +``` + +The above says: use this alarm on all hosts named `server1`, `server2`, `database*`, and +all `redis*` except `redis3`. + +This is useful when you centralize metrics from multiple hosts, to one netdata. + +--- + +#### Alarm line `families` + +This line is only used in alarm templates. It filters the charts. So, if you need to create +an alarm template for a few of a kind of chart (a few of your disks, or a few of your network +interfaces, or a few your mysql servers, etc), you can create an alarm template that would +normally be applied to all of them, and filter them by [family](../docs/Charts.md#families). + +The format is: + +``` +families: SIMPLE PATTERN LIST +``` + +The simple pattern syntax and operation is explained in [simple patterns](../libnetdata/simple_pattern/). + +--- + +#### Alarm line `lookup` + +This line makes a database lookup to find a value. This result of this lookup is available as `$this`. + +The format is: + +``` +lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] +``` + +Everything is the same with [badges](../web/api/badges/). In short: + +- `METHOD` is one of `average`, `min`, `max`, `sum`, `incremental-sum`. + This is required. + +- `AFTER` is a relative number of seconds, but it also accepts a single letter for changing + the units, like `-1s` = 1 second in the past, `-1m` = 1 minute in the past, `-1h` = 1 hour + in the past, `-1d` = 1 day in the past. You need a negative number (i.e. how far in the past + to look for the value). **This is required**. + +- `at BEFORE` is by default 0 and is not required. Using this you can define the end of the + lookup. So data will be evaluated between `AFTER` and `BEFORE`. + +- `every DURATION` sets the updated frequency of the lookup (supports single letter units as + above too). + +- `OPTIONS` is a space separated list of `percentage`, `absolute`, `min2max`, `unaligned`, + `match-ids`, `match-names`. Check the badges documentation for more info. + +- `of DIMENSIONS` is optional and has to be the last parameter. Dimensions have to be separated + by `,` or `|`. The space characters found in dimensions will be kept as-is (a few dimensions + have spaces in their names). This accepts netdata simple patterns and the `match-ids` and + `match-names` options affect the searches for dimensions. + +The result of the lookup will be available as `$this` and `$NAME` in expressions. +The timestamps of the timeframe evaluated by the database lookup is available as variables +`$after` and `$before` (both are unix timestamps). + +--- + +#### Alarm line `calc` + +This expression is evaluated just after the `lookup` (if any). Its purpose is to apply some +calculation before using the value looked up from the db. + +You can also have an expression without a lookup, using other variables that are available. + +The result of the calculation will be available as `$this` in warning and critical expressions +(overwriting the `lookup` one). + +Format: + +``` +calc: EXPRESSION +``` + +Check [Expressions](#expressions) for more information. + +--- + +#### Alarm line `every` + +Sets the update frequency of this alarm. This is the same to the `every DURATION` given +in the `lookup` lines. + +Format: + +``` +every: DURATION +``` + +`DURATION` accepts `s` for seconds, `m` is minutes, `h` for hours, `d` for days. + +--- + +#### Alarm lines `green` and `red` + +Set the green and red thresholds of a chart. Both are available as `$green` and `$red` in +expressions. If multiple alarms define different thresholds, the ones defined by the first +alarm will be used. These will eventually visualized on the dashboard, so only one set of +them is allowed. If you need multiple sets of them in different alarms, use absolute numbers +instead of `$red` and `$green`. + +Format: + +``` +green: NUMBER +red: NUMBER +``` + +--- + +#### Alarm lines `warn` and `crit` + +These expressions should evaluate to true or false (alternatively non-zero or zero). +They trigger the alarm. Both are optional. + +Format: + +``` +warn: EXPRESSION +crit: EXPRESSION +``` +Check [Expressions](#expressions) for more information. + +--- + +#### Alarm line `to` + +This will be the first parameter of the script to be executed when the alarm switches status. +Its meaning is left up to the `exec` script. + +The default `exec` script, `alarm-notify.sh`, uses this field as a space separated list of roles, +which are then consulted to find the exact recipients per notification method. + +Format: + +``` +to: ROLE1 ROLE2 ROLE3 ... +``` + +--- + +#### Alarm line `exec` + +The script that will be executed when the alarm changes status. + +Format: + +``` +exec: SCRIPT +``` + +The default `SCRIPT` is netdata's `alarm-notify.sh`, which supports all the notifications +methods netdata supports, including custom hooks. + +--- + +#### Alarm line `delay` + +This is used to provide optional hysteresis settings for the notifications, to defend +against notification floods. These settings do not affect the actual alarm - only the time +the `exec` script is executed. + +Format: + +``` +delay: [[[up U] [down D] multiplier M] max X] +``` + +- `up U` defines the delay to be applied to a notification for an alarm that raised its status + (i.e. CLEAR to WARNING, CLEAR to CRITICAL, WARNING to CRITICAL). For example, `up 10s`, the + notification for this event will be sent 10 seconds after the actual event. This is used in + hope the alarm will get back to its previous state within the duration given. The default `U` + is zero. + +- `down D` defines the delay to be applied to a notification for an alarm that moves to lower + state (i.e. CRITICAL to WARNING, CRITICAL to CLEAR, WARNING to CLEAR). For example, `down 1m` + will delay the notification by 1 minute. This is used to prevent notifications for flapping + alarms. The default `D` is zero. + +- `mutliplier M` multiplies `U` and `D` when an alarm changes state, while a notification is + delayed. The default multiplier is `1.0`. + +- `max X` defines the maximum absolute notification delay an alarm may get. The default `X` + is `max(U * M, D * M)` (i.e. the max duration of `U` or `D` multiplied once with `M`). + + Example: + + `delay: up 10s down 15m multiplier 2 max 1h` + + The time is `00:00:00` and the status of the alarm is CLEAR. + + time of event|new status|delay|notification will be sent|why + -------------|----------|:---:|-------------------------|--- + 00:00:01 | WARNING | `up 10s` | 00:00:11 |first state switch + 00:00:05 | CLEAR | `down 15m x2`| 00:30:05 |the alarm changes state while a notification is delayed, so it was multiplied + 00:00:06 | WARNING | `up 10s x2 x2` | 00:00:26 |multiplied twice + 00:00:07|CLEAR|`down 15m x2 x2 x2`|00:45:07|multiplied 3 times. + + So: + - `U` and `D` are multiplied by `M` every time the alarm changes state (any state, not just + their matching one) and a delay is in place. + - All are reset to their defaults when the alarm switches state without a delay in place. + +#### Alarm line `option` + +The only possible value for the `option` line is + +``` +option: no-clear-notification +``` + +For some alarms we need compare two time-frames, to detect anomalies. For example, `health.d/httpcheck.conf` has an alarm template called `web_service_slow` that compares the average http call response time over the last 3 minutes, compared to the average over the last hour. It triggers a warning alarm when the average of the last 3 minutes is twice the average of the last hour. In such cases, it is easy to trigger the alarm, but difficult to tell when the alarm is cleared. As time passes, the newest window moves into the older, so the average response time of the last hour will keep increasing. Eventually, the comparison will find the averages in the two time-frames close enough to clear the alarm. However, the issue was not resolved, it's just a matter of the newer data "polluting" the old. For such alarms, it's a good idea to tell Netdata to not clear the notification, by using the `no-clear-notification` option. + +--- + +### Expressions + +netdata has an internal [infix expression parser](../libnetdata/eval). +This parses expressions and creates an internal structure that allows fast execution of them. + +These operators are supported `+`, `-`, `*`, `/`, `<`, `<=`, `<>`, `!=`, `>`, `>=`, `&&`, `||`, +`!`, `AND`, `OR`, `NOT`. Boolean operators result in either `1` (true) or `0` (false). + +The conditional evaluation operator `?` is supported too. Using this operator IF-THEN-ELSE +conditional statements can be specified. The format is: `(condition) ? (true expression) : +(false expression)`. So, netdata will first evaluate the `condition` and based on the result +will either evaluate `true expression` or `false expression`. +Example: `($this > 0) ? ($avail * 2) : ($used / 2)`. +Nested such expressions are also supported (i.e. `true expression` and `false expression` can +contain conditional evaluations). + +Expressions also support the `abs()` function. + +Expressions can have variables. Variables start with `$`. Check below for more information. + +There are two special values you can use: + +- `nan`, for example `$this != nan` will check if the variable `this` is available. A variable can be `nan` if the database lookup failed. All calculations (i.e. addition, multiplication, etc) with a `nan` result in a `nan`. + +- `inf`, for example `$this != inf` will check if `this` is not infinite. A value or variable can be infinite if divided by zero. All calculations (i.e. addition, multiplication, etc) with a `inf` result in a `inf`. + +--- + +### Special use of the conditional operator + +A common (but not necessarily obvious) use of the conditional evaluation operator is +to provide [hysteresis](https://en.wikipedia.org/wiki/Hysteresis) around the critical +or warning thresholds. This usage helps to avoid bogus messages resulting from small +variations in the value when it is varying regularly but staying close to the threshold +value, without needing to delay sending messages at all. + +An example of such usage from the default CPU usage alarms bundled with netdata is: + +``` +warn: $this > (($status >= $WARNING) ? (75) : (85)) +crit: $this > (($status == $CRITICAL) ? (85) : (95)) +``` + +The above say: +* If the alarm is currently a warning, then the threshold for being considered a warning + is 75, otherwise it's 85. + +* If the alarm is currently critical, then the threshold for being considered critical + is 85, otherwise it's 95. + +Which in turn, results in the following behavior: +* While the value is rising, it will trigger a warning when it exceeds 85, and a critical + alert when it exceeds 95. + +* While the value is falling, it will return to a warning state when it goes below 85, + and a normal state when it goes below 75. + +* If the value is constantly varying between 80 and 90, then it will trigger a warning the + first time it goes above 85, but will remain a warning until it goes below 75 (or goes above 85). + +* If the value is constantly varying between 90 and 100, then it will trigger a critical alert + the first time it goes above 95, but will remain a critical alert goes below 85 (at which + point it will return to being a warning). + +--- + +### Variables + +You can find all the variables that can be used for a given chart, using +`http://your.netdata.ip:19999/api/v1/alarm_variables?chart=CHART_NAME` +Example: [variables for the `system.cpu` chart of the registry](https://registry.my-netdata.io/api/v1/alarm_variables?chart=system.cpu). + +_Hint: If you don't know how to find the CHART_NAME, you can read about it [here](../docs/Charts.md#charts)._ + + +Netdata supports 3 internal indexes for variables that will be used in health monitoring. +<details markdown="1"><summary>The variables below can be used in both chart alarms and context templates.</summary> +Although the `alarm_variables` link shows you variables for a particular chart, the same variables can also be used in templates for charts belonging to the same [context](../docs/Charts.md#contexts). The reason is that all charts of a given contexts are essentially identical, with the only difference being the [family](../docs/Charts.md#families) that identifies a particular hardware or software instance. Charts and templates do not apply to specific families anyway, unless if you explicitly limit an alarm with the [alarm line `families`](#alarm-line-families). +</details> + + - **chart local variables**. All the dimensions of the chart are exposed as local variables. The value of $this for the other configured alarms of the chart also appears, under the name of each configured alarm. + + Charts also define a few special variables: + + - `$last_collected_t` is the unix timestamp of the last data collection + - `$collected_total_raw` is the sum of all the dimensions (their last collected values) + - `$update_every` is the update frequency of the chart + - `$green` and `$red` the threshold defined in alarms (these are per chart - the charts + inherits them from the the first alarm that defined them) + + Chart dimensions define their last calculated (i.e. interpolated) value, exactly as + shown on the charts, but also a variable with their name and suffix `_raw` that resolves + to the last collected value - as collected and another with suffix `_last_collected_t` + that resolves to unix timestamp the dimension was last collected (there may be dimensions + that fail to be collected while others continue normally). + + - **family variables**. Families are used to group charts together. For example all `eth0` + charts, have `family = eth0`. This index includes all local variables, but if there are + overlapping variables, only the first are exposed. + + - **host variables**. All the dimensions of all charts, including all alarms, in fullname. + Fullname is `CHART.VARIABLE`, where `CHART` is either the chart id or the chart name (both + are supported). + + - **special variables*** are: + + - `$this`, which is resolved to the value of the current alarm. + + - `$status`, which is resolved to the current status of the alarm (the current = the last + status, i.e. before the current database lookup and the evaluation of the `calc` line). + This values can be compared with `$REMOVED`, `$UNINITIALIZED`, `$UNDEFINED`, `$CLEAR`, + `$WARNING`, `$CRITICAL`. These values are incremental, ie. `$status > $CLEAR` works as + expected. + + - `$now`, which is resolved to current unix timestamp. + +## Alarm Statuses + +Alarms can have the following statuses: + + - `REMOVED` - the alarm has been deleted (this happens when a SIGUSR2 is sent to netdata + to reload health configuration) + + - `UNINITIALIZED` - the alarm is not initialized yet + + - `UNDEFINED` - the alarm failed to be calculated (i.e. the database lookup failed, + a division by zero occurred, etc) + + - `CLEAR` - the alarm is not armed / raised (i.e. is OK) + + - `WARNING` - the warning expression resulted in true or non-zero + + - `CRITICAL` - the critical expression resulted in true or non-zero + +The external script will be called for all status changes. + +## Examples + +Check the `health/health.d/` directory for all alarms shipped with netdata. + +Here are a few examples: + +### Example 1 + +A simple check if an apache server is alive: + +``` +template: apache_last_collected_secs + on: apache.requests + calc: $now - $last_collected_t + every: 10s + warn: $this > ( 5 * $update_every) + crit: $this > (10 * $update_every) +``` + +The above checks that netdata is able to collect data from apache. In detail: + +``` +template: apache_last_collected_secs +``` + +The above defines a **template** named `apache_last_collected_secs`. +The name is important since `$apache_last_collected_secs` resolves to the `calc` line. +So, try to give something descriptive. + +``` + on: apache.requests +``` + +The above applies the **template** to all charts that have `context = apache.requests` +(i.e. all your apache servers). + +``` + calc: $now - $last_collected_t +``` + +- `$now` is a standard variable that resolves to the current timestamp. + +- `$last_collected_t` is the last data collection timestamp of the chart. + So this calculation gives the number of seconds passed since the last data collection. + +``` + every: 10s +``` + +The alarm will be evaluated every 10 seconds. + +``` + warn: $this > ( 5 * $update_every) + crit: $this > (10 * $update_every) +``` + +If these result in non-zero or true, they trigger the alarm. + +- `$this` refers to the value of this alarm (i.e. the result of the `calc` line. + We could also use `$apache_last_collected_secs`. + +`$update_every` is the update frequency of the chart, in seconds. + +So, the warning condition checks if we have not collected data from apache for 5 +iterations and the critical condition checks for 10 iterations. + +### Example 2 + +Check if any of the disks is critically low on disk space: + +``` +template: disk_full_percent + on: disk.space + calc: $used * 100 / ($avail + $used) + every: 1m + warn: $this > 80 + crit: $this > 95 +``` + +`$used` and `$avail` are the `used` and `avail` chart dimensions as shown on the dashboard. + +So, the `calc` line finds the percentage of used space. `$this` resolves to this percentage. + +### Example 3 + +Predict if any disk will run out of space in the near future. + +We do this in 2 steps: + +Calculate the disk fill rate: + +``` + template: disk_fill_rate + on: disk.space + lookup: max -1s at -30m unaligned of avail + calc: ($this - $avail) / (30 * 60) + every: 15s +``` + +In the `calc` line: `$this` is the result of the `lookup` line (i.e. the free space 30 minutes +ago) and `$avail` is the current disk free space. So the `calc` line will either have a positive +number of GB/second if the disk if filling up, or a negative number of GB/second if the disk is +freeing up space. + +There is no `warn` or `crit` lines here. So, this template will just do the calculation and +nothing more. + +Predict the hours after which the disk will run out of space: + +``` + template: disk_full_after_hours + on: disk.space + calc: $avail / $disk_fill_rate / 3600 + every: 10s + warn: $this > 0 and $this < 48 + crit: $this > 0 and $this < 24 +``` + +The `calc` line estimates the time in hours, we will run out of disk space. Of course, only +positive values are interesting for this check, so the warning and critical conditions check +for positive values and that we have enough free space for 48 and 24 hours respectively. + +Once this alarm triggers we will receive an email like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/17839993/87872b32-6802-11e6-8e08-b2e4afef93bb.png) + +### Example 4 + +Check if any network interface is dropping packets: + +``` +template: 30min_packet_drops + on: net.drops + lookup: sum -30m unaligned absolute + every: 10s + crit: $this > 0 +``` + +The `lookup` line will calculate the sum of the all dropped packets in the last 30 minutes. + +The `crit` line will issue a critical alarm if even a single packet has been dropped. + +Note that the drops chart does not exist if a network interface has never dropped a single packet. +When netdata detects a dropped packet, it will add the chart and it will automatically attach this +alarm to it. + +## Troubleshooting + +You can compile netdata with [debugging](../daemon#debugging) and then set in `netdata.conf`: + +``` +[global] + debug flags = 0x0000000000800000 +``` + +Then check your `/var/log/netdata/debug.log`. It will show you how it works. +Important: this will generate a lot of output in debug.log. + +You can find the context of charts by looking up the chart in either +`http://your.netdata:19999/netdata.conf` or `http://your.netdata:19999/api/v1/charts`. + +You can find how netdata interpreted the expressions by examining the alarm at `http://your.netdata:19999/api/v1/alarms?all`. For each expression, netdata will return the expression as given in its config file, and the same expression with additional parentheses added to indicate the evaluation flow of the expression. + +## Disabling health checks or silencing notifications at runtime + +The health checks can be controlled at runtime via the [health management api](../web/api/health/#health-management-api). + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() + + + diff --git a/health/health.c b/health/health.c new file mode 100644 index 0000000..f92a1ba --- /dev/null +++ b/health/health.c @@ -0,0 +1,816 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +struct health_cmdapi_thread_status { + int status; + ; + struct rusage rusage; +}; + +unsigned int default_health_enabled = 1; + +// ---------------------------------------------------------------------------- +// health initialization + +inline char *health_user_config_dir(void) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%s/health.d", netdata_configured_user_config_dir); + return config_get(CONFIG_SECTION_HEALTH, "health configuration directory", buffer); +} + +inline char *health_stock_config_dir(void) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%s/health.d", netdata_configured_stock_config_dir); + return config_get(CONFIG_SECTION_HEALTH, "stock health configuration directory", buffer); +} + +void health_init(void) { + debug(D_HEALTH, "Health configuration initializing"); + + if(!(default_health_enabled = (unsigned int)config_get_boolean(CONFIG_SECTION_HEALTH, "enabled", default_health_enabled))) { + debug(D_HEALTH, "Health is disabled."); + return; + } +} + +// ---------------------------------------------------------------------------- +// re-load health configuration + +void health_reload_host(RRDHOST *host) { + if(unlikely(!host->health_enabled)) + return; + + char *user_path = health_user_config_dir(); + char *stock_path = health_stock_config_dir(); + + // free all running alarms + rrdhost_wrlock(host); + + while(host->templates) + rrdcalctemplate_unlink_and_free(host, host->templates); + + while(host->alarms) + rrdcalc_unlink_and_free(host, host->alarms); + + rrdhost_unlock(host); + + // invalidate all previous entries in the alarm log + ALARM_ENTRY *t; + for(t = host->health_log.alarms ; t ; t = t->next) { + if(t->new_status != RRDCALC_STATUS_REMOVED) + t->flags |= HEALTH_ENTRY_FLAG_UPDATED; + } + + rrdhost_rdlock(host); + // reset all thresholds to all charts + RRDSET *st; + rrdset_foreach_read(st, host) { + st->green = NAN; + st->red = NAN; + } + rrdhost_unlock(host); + + // load the new alarms + rrdhost_wrlock(host); + health_readdir(host, user_path, stock_path, NULL); + + // link the loaded alarms to their charts + rrdset_foreach_write(st, host) { + rrdsetcalc_link_matching(st); + rrdcalctemplate_link_matching(st); + } + + rrdhost_unlock(host); +} + +void health_reload(void) { + + rrd_rdlock(); + + RRDHOST *host; + rrdhost_foreach_read(host) + health_reload_host(host); + + rrd_unlock(); +} + +// ---------------------------------------------------------------------------- +// health main thread and friends + +static inline RRDCALC_STATUS rrdcalc_value2status(calculated_number n) { + if(isnan(n) || isinf(n)) return RRDCALC_STATUS_UNDEFINED; + if(n) return RRDCALC_STATUS_RAISED; + return RRDCALC_STATUS_CLEAR; +} + +#define ALARM_EXEC_COMMAND_LENGTH 8192 + +static inline void health_alarm_execute(RRDHOST *host, ALARM_ENTRY *ae) { + ae->flags |= HEALTH_ENTRY_FLAG_PROCESSED; + + if(unlikely(ae->new_status < RRDCALC_STATUS_CLEAR)) { + // do not send notifications for internal statuses + debug(D_HEALTH, "Health not sending notification for alarm '%s.%s' status %s (internal statuses)", ae->chart, ae->name, rrdcalc_status2string(ae->new_status)); + goto done; + } + + if(unlikely(ae->new_status <= RRDCALC_STATUS_CLEAR && (ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION))) { + // do not send notifications for disabled statuses + debug(D_HEALTH, "Health not sending notification for alarm '%s.%s' status %s (it has no-clear-notification enabled)", ae->chart, ae->name, rrdcalc_status2string(ae->new_status)); + // mark it as run, so that we will send the same alarm if it happens again + goto done; + } + + // find the previous notification for the same alarm + // which we have run the exec script + // exception: alarms with HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION set + if(likely(!(ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION))) { + uint32_t id = ae->alarm_id; + ALARM_ENTRY *t; + for(t = ae->next; t ; t = t->next) { + if(t->alarm_id == id && t->flags & HEALTH_ENTRY_FLAG_EXEC_RUN) + break; + } + + if(likely(t)) { + // we have executed this alarm notification in the past + if(t && t->new_status == ae->new_status) { + // don't send the notification for the same status again + debug(D_HEALTH, "Health not sending again notification for alarm '%s.%s' status %s", ae->chart, ae->name + , rrdcalc_status2string(ae->new_status)); + goto done; + } + } + else { + // we have not executed this alarm notification in the past + // so, don't send CLEAR notifications + if(unlikely(ae->new_status == RRDCALC_STATUS_CLEAR)) { + debug(D_HEALTH, "Health not sending notification for first initialization of alarm '%s.%s' status %s" + , ae->chart, ae->name, rrdcalc_status2string(ae->new_status)); + goto done; + } + } + } + + // Check if alarm notifications are silenced + if (ae->flags & HEALTH_ENTRY_FLAG_SILENCED) { + info("Health not sending notification for alarm '%s.%s' status %s (command API has disabled notifications)", ae->chart, ae->name, rrdcalc_status2string(ae->new_status)); + goto done; + } + + static char command_to_run[ALARM_EXEC_COMMAND_LENGTH + 1]; + pid_t command_pid; + + const char *exec = (ae->exec) ? ae->exec : host->health_default_exec; + const char *recipient = (ae->recipient) ? ae->recipient : host->health_default_recipient; + + int n_warn=0, n_crit=0; + RRDCALC *rc; + EVAL_EXPRESSION *expr=NULL; + + for(rc = host->alarms; rc ; rc = rc->next) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + + if(unlikely(rc->status == RRDCALC_STATUS_WARNING)) { + n_warn++; + if (ae->alarm_id == rc->id) + expr=rc->warning; + } else if (unlikely(rc->status == RRDCALC_STATUS_CRITICAL)) { + n_crit++; + if (ae->alarm_id == rc->id) + expr=rc->critical; + } else if (unlikely(rc->status == RRDCALC_STATUS_CLEAR)) { + if (ae->alarm_id == rc->id) + expr=rc->warning; + } + } + + snprintfz(command_to_run, ALARM_EXEC_COMMAND_LENGTH, "exec %s '%s' '%s' '%u' '%u' '%u' '%lu' '%s' '%s' '%s' '%s' '%s' '" CALCULATED_NUMBER_FORMAT_ZERO "' '" CALCULATED_NUMBER_FORMAT_ZERO "' '%s' '%u' '%u' '%s' '%s' '%s' '%s' '%s' '%s' '%d' '%d'", + exec, + recipient, + host->registry_hostname, + ae->unique_id, + ae->alarm_id, + ae->alarm_event_id, + (unsigned long)ae->when, + ae->name, + ae->chart?ae->chart:"NOCHART", + ae->family?ae->family:"NOFAMILY", + rrdcalc_status2string(ae->new_status), + rrdcalc_status2string(ae->old_status), + ae->new_value, + ae->old_value, + ae->source?ae->source:"UNKNOWN", + (uint32_t)ae->duration, + (uint32_t)ae->non_clear_duration, + ae->units?ae->units:"", + ae->info?ae->info:"", + ae->new_value_string, + ae->old_value_string, + (expr && expr->source)?expr->source:"NOSOURCE", + (expr && expr->error_msg)?buffer_tostring(expr->error_msg):"NOERRMSG", + n_warn, + n_crit + ); + + ae->flags |= HEALTH_ENTRY_FLAG_EXEC_RUN; + ae->exec_run_timestamp = now_realtime_sec(); + + debug(D_HEALTH, "executing command '%s'", command_to_run); + FILE *fp = mypopen(command_to_run, &command_pid); + if(!fp) { + error("HEALTH: Cannot popen(\"%s\", \"r\").", command_to_run); + goto done; + } + debug(D_HEALTH, "HEALTH reading from command (discarding command's output)"); + char buffer[100 + 1]; + while(fgets(buffer, 100, fp) != NULL) ; + ae->exec_code = mypclose(fp, command_pid); + debug(D_HEALTH, "done executing command - returned with code %d", ae->exec_code); + + if(ae->exec_code != 0) + ae->flags |= HEALTH_ENTRY_FLAG_EXEC_FAILED; + +done: + health_alarm_log_save(host, ae); +} + +static inline void health_process_notifications(RRDHOST *host, ALARM_ENTRY *ae) { + debug(D_HEALTH, "Health alarm '%s.%s' = " CALCULATED_NUMBER_FORMAT_AUTO " - changed status from %s to %s", + ae->chart?ae->chart:"NOCHART", ae->name, + ae->new_value, + rrdcalc_status2string(ae->old_status), + rrdcalc_status2string(ae->new_status) + ); + + health_alarm_execute(host, ae); +} + +static inline void health_alarm_log_process(RRDHOST *host) { + uint32_t first_waiting = (host->health_log.alarms)?host->health_log.alarms->unique_id:0; + time_t now = now_realtime_sec(); + + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae && ae->unique_id >= host->health_last_processed_id ; ae = ae->next) { + if(unlikely( + !(ae->flags & HEALTH_ENTRY_FLAG_PROCESSED) && + !(ae->flags & HEALTH_ENTRY_FLAG_UPDATED) + )) { + + if(unlikely(ae->unique_id < first_waiting)) + first_waiting = ae->unique_id; + + if(likely(now >= ae->delay_up_to_timestamp)) + health_process_notifications(host, ae); + } + } + + // remember this for the next iteration + host->health_last_processed_id = first_waiting; + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + if(host->health_log.count <= host->health_log.max) + return; + + // cleanup excess entries in the log + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *last = NULL; + unsigned int count = host->health_log.max * 2 / 3; + for(ae = host->health_log.alarms; ae && count ; count--, last = ae, ae = ae->next) ; + + if(ae && last && last->next == ae) + last->next = NULL; + else + ae = NULL; + + while(ae) { + debug(D_HEALTH, "Health removing alarm log entry with id: %u", ae->unique_id); + + ALARM_ENTRY *t = ae->next; + + health_alarm_log_free_one_nochecks_nounlink(ae); + + ae = t; + host->health_log.count--; + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} + +static inline int rrdcalc_isrunnable(RRDCALC *rc, time_t now, time_t *next_run) { + if(unlikely(!rc->rrdset)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. It is not linked to a chart.", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(rc->next_update > now)) { + if (unlikely(*next_run > rc->next_update)) { + // update the next_run time of the main loop + // to run this alarm precisely the time required + *next_run = rc->next_update; + } + + debug(D_HEALTH, "Health not examining alarm '%s.%s' yet (will do in %d secs).", rc->chart?rc->chart:"NOCHART", rc->name, (int) (rc->next_update - now)); + return 0; + } + + if(unlikely(!rc->update_every)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. It does not have an update frequency", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(rrdset_flag_check(rc->rrdset, RRDSET_FLAG_OBSOLETE))) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. The chart has been marked as obsolete", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(!rrdset_flag_check(rc->rrdset, RRDSET_FLAG_ENABLED))) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. The chart is not enabled", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(!rc->rrdset->last_collected_time.tv_sec || rc->rrdset->counter_done < 2)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. Chart is not fully collected yet.", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + int update_every = rc->rrdset->update_every; + time_t first = rrdset_first_entry_t(rc->rrdset); + time_t last = rrdset_last_entry_t(rc->rrdset); + + if(unlikely(now + update_every < first /* || now - update_every > last */)) { + debug(D_HEALTH + , "Health not examining alarm '%s.%s' yet (wanted time is out of bounds - we need %lu but got %lu - %lu)." + , rc->chart ? rc->chart : "NOCHART", rc->name, (unsigned long) now, (unsigned long) first + , (unsigned long) last); + return 0; + } + + if(RRDCALC_HAS_DB_LOOKUP(rc)) { + time_t needed = now + rc->before + rc->after; + + if(needed + update_every < first || needed - update_every > last) { + debug(D_HEALTH + , "Health not examining alarm '%s.%s' yet (not enough data yet - we need %lu but got %lu - %lu)." + , rc->chart ? rc->chart : "NOCHART", rc->name, (unsigned long) needed, (unsigned long) first + , (unsigned long) last); + return 0; + } + } + + return 1; +} + +static inline int check_if_resumed_from_suspention(void) { + static usec_t last_realtime = 0, last_monotonic = 0; + usec_t realtime = now_realtime_usec(), monotonic = now_monotonic_usec(); + int ret = 0; + + // detect if monotonic and realtime have twice the difference + // in which case we assume the system was just waken from hibernation + + if(last_realtime && last_monotonic && realtime - last_realtime > 2 * (monotonic - last_monotonic)) + ret = 1; + + last_realtime = realtime; + last_monotonic = monotonic; + + return ret; +} + +static void health_main_cleanup(void *ptr) { + struct netdata_static_thread *static_thread = (struct netdata_static_thread *)ptr; + static_thread->enabled = NETDATA_MAIN_THREAD_EXITING; + + info("cleaning up..."); + + static_thread->enabled = NETDATA_MAIN_THREAD_EXITED; +} + +SILENCE_TYPE check_silenced(RRDCALC *rc, char* host, SILENCERS *silencers) { + SILENCER *s; + debug(D_HEALTH, "Checking if alarm was silenced via the command API. Alarm info name:%s context:%s chart:%s host:%s family:%s", + rc->name, (rc->rrdset)?rc->rrdset->context:"", rc->chart, host, (rc->rrdset)?rc->rrdset->family:""); + + for (s = silencers->silencers; s!=NULL; s=s->next){ + if ( + (!s->alarms_pattern || (rc->name && s->alarms_pattern && simple_pattern_matches(s->alarms_pattern,rc->name))) && + (!s->contexts_pattern || (rc->rrdset && rc->rrdset->context && s->contexts_pattern && simple_pattern_matches(s->contexts_pattern,rc->rrdset->context))) && + (!s->hosts_pattern || (host && s->hosts_pattern && simple_pattern_matches(s->hosts_pattern,host))) && + (!s->charts_pattern || (rc->chart && s->charts_pattern && simple_pattern_matches(s->charts_pattern,rc->chart))) && + (!s->families_pattern || (rc->rrdset && rc->rrdset->family && s->families_pattern && simple_pattern_matches(s->families_pattern,rc->rrdset->family))) + ) { + debug(D_HEALTH, "Alarm matches command API silence entry %s:%s:%s:%s:%s", s->alarms,s->charts, s->contexts, s->hosts, s->families); + if (unlikely(silencers->stype == STYPE_NONE)) { + debug(D_HEALTH, "Alarm %s matched a silence entry, but no SILENCE or DISABLE command was issued via the command API. The match has no effect.", rc->name); + } else { + debug(D_HEALTH, "Alarm %s via the command API - name:%s context:%s chart:%s host:%s family:%s" + , (silencers->stype==STYPE_DISABLE_ALARMS)?"Disabled":"Silenced" + , rc->name + , (rc->rrdset)?rc->rrdset->context:"" + , rc->chart + , host + , (rc->rrdset)?rc->rrdset->family:"" + ); + } + return silencers->stype; + } + } + return STYPE_NONE; +} + +int update_disabled_silenced(RRDHOST *host, RRDCALC *rc) { + uint32_t rrdcalc_flags_old = rc->rrdcalc_flags; + // Clear the flags + rc->rrdcalc_flags &= ~(RRDCALC_FLAG_DISABLED | RRDCALC_FLAG_SILENCED); + if (unlikely(silencers->all_alarms)) { + if (silencers->stype == STYPE_DISABLE_ALARMS) rc->rrdcalc_flags |= RRDCALC_FLAG_DISABLED; + else if (silencers->stype == STYPE_SILENCE_NOTIFICATIONS) rc->rrdcalc_flags |= RRDCALC_FLAG_SILENCED; + } else { + SILENCE_TYPE st = check_silenced(rc, host->hostname, silencers); + if (st == STYPE_DISABLE_ALARMS) rc->rrdcalc_flags |= RRDCALC_FLAG_DISABLED; + else if (st == STYPE_SILENCE_NOTIFICATIONS) rc->rrdcalc_flags |= RRDCALC_FLAG_SILENCED; + } + + if (rrdcalc_flags_old != rc->rrdcalc_flags) { + info("Alarm silencing changed for host '%s' alarm '%s': Disabled %s->%s Silenced %s->%s", + host->hostname, + rc->name, + (rrdcalc_flags_old & RRDCALC_FLAG_DISABLED)?"true":"false", + (rc->rrdcalc_flags & RRDCALC_FLAG_DISABLED)?"true":"false", + (rrdcalc_flags_old & RRDCALC_FLAG_SILENCED)?"true":"false", + (rc->rrdcalc_flags & RRDCALC_FLAG_SILENCED)?"true":"false" + ); + } + if (rc->rrdcalc_flags & RRDCALC_FLAG_DISABLED) + return 1; + else + return 0; +} + +void *health_main(void *ptr) { + netdata_thread_cleanup_push(health_main_cleanup, ptr); + + int min_run_every = (int)config_get_number(CONFIG_SECTION_HEALTH, "run at least every seconds", 10); + if(min_run_every < 1) min_run_every = 1; + + time_t now = now_realtime_sec(); + time_t hibernation_delay = config_get_number(CONFIG_SECTION_HEALTH, "postpone alarms during hibernation for seconds", 60); + + unsigned int loop = 0; + + silencers = mallocz(sizeof(SILENCERS)); + silencers->all_alarms=0; + silencers->stype=STYPE_NONE; + silencers->silencers=NULL; + + while(!netdata_exit) { + loop++; + debug(D_HEALTH, "Health monitoring iteration no %u started", loop); + + int runnable = 0, apply_hibernation_delay = 0; + time_t next_run = now + min_run_every; + RRDCALC *rc; + + if (unlikely(check_if_resumed_from_suspention())) { + apply_hibernation_delay = 1; + + info("Postponing alarm checks for %ld seconds, because it seems that the system was just resumed from suspension.", + hibernation_delay + ); + } + + if (unlikely(silencers->all_alarms && silencers->stype == STYPE_DISABLE_ALARMS)) { + static int logged=0; + if (!logged) { + info("Skipping health checks, because all alarms are disabled via a %s command.", + HEALTH_CMDAPI_CMD_DISABLEALL); + logged = 1; + } + } + + rrd_rdlock(); + + RRDHOST *host; + rrdhost_foreach_read(host) { + if (unlikely(!host->health_enabled)) + continue; + + if (unlikely(apply_hibernation_delay)) { + + info("Postponing health checks for %ld seconds, on host '%s'.", hibernation_delay, host->hostname + ); + + host->health_delay_up_to = now + hibernation_delay; + } + + if (unlikely(host->health_delay_up_to)) { + if (unlikely(now < host->health_delay_up_to)) + continue; + + info("Resuming health checks on host '%s'.", host->hostname); + host->health_delay_up_to = 0; + } + + rrdhost_rdlock(host); + + // the first loop is to lookup values from the db + for (rc = host->alarms; rc; rc = rc->next) { + + if (update_disabled_silenced(host, rc)) + continue; + + if (unlikely(!rrdcalc_isrunnable(rc, now, &next_run))) { + if (unlikely(rc->rrdcalc_flags & RRDCALC_FLAG_RUNNABLE)) + rc->rrdcalc_flags &= ~RRDCALC_FLAG_RUNNABLE; + continue; + } + + runnable++; + rc->old_value = rc->value; + rc->rrdcalc_flags |= RRDCALC_FLAG_RUNNABLE; + + // ------------------------------------------------------------ + // if there is database lookup, do it + + if (unlikely(RRDCALC_HAS_DB_LOOKUP(rc))) { + /* time_t old_db_timestamp = rc->db_before; */ + int value_is_null = 0; + + int ret = rrdset2value_api_v1(rc->rrdset, NULL, &rc->value, rc->dimensions, 1, rc->after, + rc->before, rc->group, 0, rc->options, &rc->db_after, + &rc->db_before, &value_is_null + ); + + if (unlikely(ret != 200)) { + // database lookup failed + rc->value = NAN; + rc->rrdcalc_flags |= RRDCALC_FLAG_DB_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup returned error %d", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, ret + ); + } else + rc->rrdcalc_flags &= ~RRDCALC_FLAG_DB_ERROR; + + /* - RRDCALC_FLAG_DB_STALE not currently used + if (unlikely(old_db_timestamp == rc->db_before)) { + // database is stale + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database is stale", host->hostname, rc->chart?rc->chart:"NOCHART", rc->name); + + if (unlikely(!(rc->rrdcalc_flags & RRDCALC_FLAG_DB_STALE))) { + rc->rrdcalc_flags |= RRDCALC_FLAG_DB_STALE; + error("Health on host '%s', alarm '%s.%s': database is stale", host->hostname, rc->chart?rc->chart:"NOCHART", rc->name); + } + } + else if (unlikely(rc->rrdcalc_flags & RRDCALC_FLAG_DB_STALE)) + rc->rrdcalc_flags &= ~RRDCALC_FLAG_DB_STALE; + */ + + if (unlikely(value_is_null)) { + // collected value is null + rc->value = NAN; + rc->rrdcalc_flags |= RRDCALC_FLAG_DB_NAN; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': database lookup returned empty value (possibly value is not collected yet)", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name + ); + } else + rc->rrdcalc_flags &= ~RRDCALC_FLAG_DB_NAN; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup gave value " + CALCULATED_NUMBER_FORMAT, host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + rc->value + ); + } + + // ------------------------------------------------------------ + // if there is calculation expression, run it + + if (unlikely(rc->calculation)) { + if (unlikely(!expression_evaluate(rc->calculation))) { + // calculation failed + rc->value = NAN; + rc->rrdcalc_flags |= RRDCALC_FLAG_CALC_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' failed: %s", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + rc->calculation->parsed_as, buffer_tostring(rc->calculation->error_msg) + ); + } else { + rc->rrdcalc_flags &= ~RRDCALC_FLAG_CALC_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' gave value " + CALCULATED_NUMBER_FORMAT + ": %s (source: %s)", host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + rc->calculation->parsed_as, rc->calculation->result, + buffer_tostring(rc->calculation->error_msg), rc->source + ); + + rc->value = rc->calculation->result; + + if (rc->local) rc->local->last_updated = now; + if (rc->family) rc->family->last_updated = now; + if (rc->hostid) rc->hostid->last_updated = now; + if (rc->hostname) rc->hostname->last_updated = now; + } + } + } + + rrdhost_unlock(host); + + if (unlikely(runnable && !netdata_exit)) { + rrdhost_rdlock(host); + + for (rc = host->alarms; rc; rc = rc->next) { + if (unlikely(!(rc->rrdcalc_flags & RRDCALC_FLAG_RUNNABLE))) + continue; + + if (rc->rrdcalc_flags & RRDCALC_FLAG_DISABLED) { + continue; + } + RRDCALC_STATUS warning_status = RRDCALC_STATUS_UNDEFINED; + RRDCALC_STATUS critical_status = RRDCALC_STATUS_UNDEFINED; + + // -------------------------------------------------------- + // check the warning expression + + if (likely(rc->warning)) { + if (unlikely(!expression_evaluate(rc->warning))) { + // calculation failed + rc->rrdcalc_flags |= RRDCALC_FLAG_WARN_ERROR; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': warning expression failed with error: %s", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + buffer_tostring(rc->warning->error_msg) + ); + } else { + rc->rrdcalc_flags &= ~RRDCALC_FLAG_WARN_ERROR; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': warning expression gave value " + CALCULATED_NUMBER_FORMAT + ": %s (source: %s)", host->hostname, rc->chart ? rc->chart : "NOCHART", + rc->name, rc->warning->result, buffer_tostring(rc->warning->error_msg), rc->source + ); + warning_status = rrdcalc_value2status(rc->warning->result); + } + } + + // -------------------------------------------------------- + // check the critical expression + + if (likely(rc->critical)) { + if (unlikely(!expression_evaluate(rc->critical))) { + // calculation failed + rc->rrdcalc_flags |= RRDCALC_FLAG_CRIT_ERROR; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': critical expression failed with error: %s", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + buffer_tostring(rc->critical->error_msg) + ); + } else { + rc->rrdcalc_flags &= ~RRDCALC_FLAG_CRIT_ERROR; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': critical expression gave value " + CALCULATED_NUMBER_FORMAT + ": %s (source: %s)", host->hostname, rc->chart ? rc->chart : "NOCHART", + rc->name, rc->critical->result, buffer_tostring(rc->critical->error_msg), + rc->source + ); + critical_status = rrdcalc_value2status(rc->critical->result); + } + } + + // -------------------------------------------------------- + // decide the final alarm status + + RRDCALC_STATUS status = RRDCALC_STATUS_UNDEFINED; + + switch (warning_status) { + case RRDCALC_STATUS_CLEAR: + status = RRDCALC_STATUS_CLEAR; + break; + + case RRDCALC_STATUS_RAISED: + status = RRDCALC_STATUS_WARNING; + break; + + default: + break; + } + + switch (critical_status) { + case RRDCALC_STATUS_CLEAR: + if (status == RRDCALC_STATUS_UNDEFINED) + status = RRDCALC_STATUS_CLEAR; + break; + + case RRDCALC_STATUS_RAISED: + status = RRDCALC_STATUS_CRITICAL; + break; + + default: + break; + } + + // -------------------------------------------------------- + // check if the new status and the old differ + + if (status != rc->status) { + int delay = 0; + + // apply trigger hysteresis + + if (now > rc->delay_up_to_timestamp) { + rc->delay_up_current = rc->delay_up_duration; + rc->delay_down_current = rc->delay_down_duration; + rc->delay_last = 0; + rc->delay_up_to_timestamp = 0; + } else { + rc->delay_up_current = (int) (rc->delay_up_current * rc->delay_multiplier); + if (rc->delay_up_current > rc->delay_max_duration) + rc->delay_up_current = rc->delay_max_duration; + + rc->delay_down_current = (int) (rc->delay_down_current * rc->delay_multiplier); + if (rc->delay_down_current > rc->delay_max_duration) + rc->delay_down_current = rc->delay_max_duration; + } + + if (status > rc->status) + delay = rc->delay_up_current; + else + delay = rc->delay_down_current; + + // COMMENTED: because we do need to send raising alarms + // if(now + delay < rc->delay_up_to_timestamp) + // delay = (int)(rc->delay_up_to_timestamp - now); + + rc->delay_last = delay; + rc->delay_up_to_timestamp = now + delay; + + health_alarm_log( + host, rc->id, rc->next_event_id++, now, rc->name, rc->rrdset->id, + rc->rrdset->family, rc->exec, rc->recipient, now - rc->last_status_change, + rc->old_value, rc->value, rc->status, status, rc->source, rc->units, rc->info, + rc->delay_last, + ( + ((rc->options & RRDCALC_FLAG_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | + ((rc->rrdcalc_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) + ) + + ); + + rc->last_status_change = now; + rc->status = status; + } + + rc->last_updated = now; + rc->next_update = now + rc->update_every; + + if (next_run > rc->next_update) + next_run = rc->next_update; + } + + rrdhost_unlock(host); + } + + if (unlikely(netdata_exit)) + break; + + // execute notifications + // and cleanup + health_alarm_log_process(host); + + if (unlikely(netdata_exit)) + break; + + } /* rrdhost_foreach */ + + rrd_unlock(); + + + if(unlikely(netdata_exit)) + break; + + now = now_realtime_sec(); + if(now < next_run) { + debug(D_HEALTH, "Health monitoring iteration no %u done. Next iteration in %d secs", loop, (int) (next_run - now)); + sleep_usec(USEC_PER_SEC * (usec_t) (next_run - now)); + now = now_realtime_sec(); + } + else + debug(D_HEALTH, "Health monitoring iteration no %u done. Next iteration now", loop); + + } // forever + + netdata_thread_cleanup_pop(1); + return NULL; +} diff --git a/health/health.d/adaptec_raid.conf b/health/health.d/adaptec_raid.conf new file mode 100644 index 0000000..a1301ce --- /dev/null +++ b/health/health.d/adaptec_raid.conf @@ -0,0 +1,24 @@ + +# logical device status check + +template: adapter_raid_ld_status + on: adapter_raid.ld_status + lookup: max -5s + units: bool + every: 10s + crit: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: at least 1 logical device is failed or degraded + to: sysadmin + +# physical device state check + +template: adapter_raid_pd_state + on: adapter_raid.pd_state + lookup: max -5s + units: bool + every: 10s + crit: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: at least 1 physical device is not in online state + to: sysadmin diff --git a/health/health.d/apache.conf b/health/health.d/apache.conf new file mode 100644 index 0000000..0c98b87 --- /dev/null +++ b/health/health.d/apache.conf @@ -0,0 +1,14 @@ + +# make sure apache is running + +template: apache_last_collected_secs + on: apache.requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/apcupsd.conf b/health/health.d/apcupsd.conf new file mode 100644 index 0000000..4f86037 --- /dev/null +++ b/health/health.d/apcupsd.conf @@ -0,0 +1,40 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + +template: 10min_ups_load + on: apcupsd.load + os: * + hosts: * + lookup: average -10m unaligned of percentage + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 10m multiplier 1.5 max 1h + info: average UPS load for the last 10 minutes + to: sitemgr + +# Discussion in https://github.com/netdata/netdata/pull/3928: +# Fire the alarm as soon as it's going on battery (99% charge) and clear only when full. +template: ups_charge + on: apcupsd.charge + os: * + hosts: * + lookup: average -60s unaligned of charge + units: % + every: 60s + warn: $this < 100 + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 10m multiplier 1.5 max 1h + info: current UPS charge, averaged over the last 60 seconds to reduce measurement errors + to: sitemgr + +template: apcupsd_last_collected_secs + on: apcupsd.load + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/backend.conf b/health/health.d/backend.conf new file mode 100644 index 0000000..7af100d --- /dev/null +++ b/health/health.d/backend.conf @@ -0,0 +1,45 @@ + +# make sure we are sending data to backend + + alarm: backend_last_buffering + on: netdata.backend_metrics + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful buffering of backend data + to: dba + + alarm: backend_metrics_sent + on: netdata.backend_metrics + units: % + calc: abs($sent) * 100 / abs($buffered) + every: 10s + warn: $this != 100 + delay: down 5m multiplier 1.5 max 1h + info: percentage of metrics sent to the backend server + to: dba + + alarm: backend_metrics_lost + on: netdata.backend_metrics + units: metrics + calc: abs($lost) + every: 10s + crit: ($this != 0) || ($status == $CRITICAL && abs($sent) == 0) + delay: down 5m multiplier 1.5 max 1h + info: number of metrics lost due to repeating failures to contact the backend server + to: dba + +# this chart has been removed from netdata +# alarm: backend_slow +# on: netdata.backend_latency +# units: % +# calc: $latency * 100 / ($update_every * 1000) +# every: 10s +# warn: $this > 50 +# crit: $this > 100 +# delay: down 5m multiplier 1.5 max 1h +# info: the percentage of time between iterations needed by the backend time to process the data sent by netdata +# to: dba diff --git a/health/health.d/bcache.conf b/health/health.d/bcache.conf new file mode 100644 index 0000000..f0da9ac --- /dev/null +++ b/health/health.d/bcache.conf @@ -0,0 +1,22 @@ + +template: bcache_cache_errors + on: disk.bcache_cache_read_races + lookup: sum -10m unaligned absolute + units: errors + every: 1m + warn: $this > 0 + crit: $this > ( ($status >= $CRITICAL) ? (0) : (10) ) + delay: down 1h multiplier 1.5 max 2h + info: the number of times bcache had issues using the cache, during the last 10 mins (this usually means your SSD cache is failing) + to: sysadmin + +template: bcache_cache_dirty + on: disk.bcache_cache_alloc + calc: $dirty + $metadata + $undefined + units: % + every: 1m + warn: $this > ( ($status >= $WARNING ) ? ( 70 ) : ( 90 ) ) + crit: $this > ( ($status >= $CRITICAL) ? ( 90 ) : ( 95 ) ) + delay: up 1m down 1h multiplier 1.5 max 2h + info: the percentage of cache space used for dirty and metadata (this usually means your SSD cache is too small) + to: sysadmin diff --git a/health/health.d/beanstalkd.conf b/health/health.d/beanstalkd.conf new file mode 100644 index 0000000..30dc273 --- /dev/null +++ b/health/health.d/beanstalkd.conf @@ -0,0 +1,36 @@ +# get the number of buried jobs in all queues + +template: server_buried_jobs + on: beanstalk.current_jobs + calc: $buried + units: jobs + every: 10s + warn: $this > 0 + crit: $this > 10 + delay: up 0 down 5m multiplier 1.2 max 1h + info: the number of buried jobs aggregated across all tubes + to: sysadmin + +# get the number of buried jobs per queue + +#template: tube_buried_jobs +# on: beanstalk.jobs +# calc: $buried +# units: jobs +# every: 10s +# warn: $this > 0 +# crit: $this > 10 +# delay: up 0 down 5m multiplier 1.2 max 1h +# info: the number of jobs buried per tube +# to: sysadmin + +# get the current number of tubes + +#template: number_of_tubes +# on: beanstalk.current_tubes +# calc: $tubes +# every: 10s +# warn: $this < 5 +# delay: up 0 down 5m multiplier 1.2 max 1h +# info: the current number of tubes on the server +# to: sysadmin diff --git a/health/health.d/bind_rndc.conf b/health/health.d/bind_rndc.conf new file mode 100644 index 0000000..4145e77 --- /dev/null +++ b/health/health.d/bind_rndc.conf @@ -0,0 +1,9 @@ + template: bind_rndc_stats_file_size + on: bind_rndc.stats_size + units: megabytes + every: 60 + calc: $stats_size + warn: $this > 512 + crit: $this > 1024 + info: Bind stats file is very large! Consider to create logrotate conf file for it! + to: sysadmin diff --git a/health/health.d/boinc.conf b/health/health.d/boinc.conf new file mode 100644 index 0000000..43c588d --- /dev/null +++ b/health/health.d/boinc.conf @@ -0,0 +1,62 @@ +# Alarms for various BOINC issues. + +# Warn on any compute errors encountered. +template: boinc_compute_errors + on: boinc.states + os: * + hosts: * +families: * + lookup: average -10m unaligned of comperror + units: tasks + every: 1m + warn: $this > 0 + crit: $this > 1 + delay: up 1m down 5m multiplier 1.5 max 1h + info: the total number of compute errors over the past 10 minutes + to: sysadmin + +# Warn on lots of upload errors +template: boinc_upload_errors + on: boinc.states + os: * + hosts: * +families: * + lookup: average -10m unaligned of upload_failed + units: tasks + every: 1m + warn: $this > 0 + crit: $this > 1 + delay: up 1m down 5m multiplier 1.5 max 1h + info: the average number of failed uploads over the past 10 minutes + to: sysadmin + +# Warn on the task queue being empty +template: boinc_total_tasks + on: boinc.tasks + os: * + hosts: * +families: * + lookup: average -10m unaligned of total + units: tasks + every: 1m + warn: $this < 1 + crit: $this < 0.1 + delay: up 5m down 10m multiplier 1.5 max 1h + info: the total number of locally available tasks + to: sysadmin + +# Warn on no active tasks with a non-empty queue +template: boinc_active_tasks + on: boinc.tasks + os: * + hosts: * +families: * + lookup: average -10m unaligned of active + calc: ($boinc_total_tasks >= 1) ? ($this) : (inf) + units: tasks + every: 1m + warn: $this < 1 + crit: $this < 0.1 + delay: up 5m down 10m multiplier 1.5 max 1h + info: the total number of active tasks + to: sysadmin diff --git a/health/health.d/btrfs.conf b/health/health.d/btrfs.conf new file mode 100644 index 0000000..b27aa54 --- /dev/null +++ b/health/health.d/btrfs.conf @@ -0,0 +1,57 @@ + +template: btrfs_allocated + on: btrfs.disk + os: * + hosts: * +families: * + calc: 100 - ($unallocated * 100 / ($unallocated + $data_used + $data_free + $meta_used + $meta_free + $sys_used + $sys_free)) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) + crit: $this > (($status == $CRITICAL) ? (95) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: the percentage of allocated BTRFS physical disk space + to: sysadmin + +template: btrfs_data + on: btrfs.data + os: * + hosts: * +families: * + calc: $used * 100 / ($used + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: the percentage of used BTRFS data space + to: sysadmin + +template: btrfs_metadata + on: btrfs.metadata + os: * + hosts: * +families: * + calc: ($used + $reserved) * 100 / ($used + $free + $reserved) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: the percentage of used BTRFS metadata space + to: sysadmin + +template: btrfs_system + on: btrfs.system + os: * + hosts: * +families: * + calc: $used * 100 / ($used + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: the percentage of used BTRFS system space + to: sysadmin + diff --git a/health/health.d/ceph.conf b/health/health.d/ceph.conf new file mode 100644 index 0000000..de16f7b --- /dev/null +++ b/health/health.d/ceph.conf @@ -0,0 +1,13 @@ +# low ceph disk available + +template: cluster_space_usage + on: ceph.general_usage + calc: $avail * 100 / ($avail + $used) + units: % + every: 10s + warn: $this < 10 + crit: $this < 1 + delay: down 5m multiplier 1.2 max 1h + info: ceph disk usage is almost full + to: sysadmin + diff --git a/health/health.d/couchdb.conf b/health/health.d/couchdb.conf new file mode 100644 index 0000000..4a28952 --- /dev/null +++ b/health/health.d/couchdb.conf @@ -0,0 +1,13 @@ + +# make sure couchdb is running + +template: couchdb_last_collected_secs + on: couchdb.request_methods + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba diff --git a/health/health.d/cpu.conf b/health/health.d/cpu.conf new file mode 100644 index 0000000..fa81898 --- /dev/null +++ b/health/health.d/cpu.conf @@ -0,0 +1,55 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +template: 10min_cpu_usage + on: system.cpu + os: linux + hosts: * + lookup: average -10m unaligned of user,system,softirq,irq,guest + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal) + to: sysadmin + +template: 10min_cpu_iowait + on: system.cpu + os: linux + hosts: * + lookup: average -10m unaligned of iowait + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (20) : (40)) + crit: $this > (($status == $CRITICAL) ? (40) : (50)) + delay: down 15m multiplier 1.5 max 1h + info: average CPU wait I/O for the last 10 minutes + to: sysadmin + +template: 20min_steal_cpu + on: system.cpu + os: linux + hosts: * + lookup: average -20m unaligned of steal + units: % + every: 5m + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (20) : (30)) + delay: down 1h multiplier 1.5 max 2h + info: average CPU steal time for the last 20 minutes + to: sysadmin + +## FreeBSD +template: 10min_cpu_usage + on: system.cpu + os: freebsd + hosts: * + lookup: average -10m unaligned of user,system,interrupt + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average cpu utilization for the last 10 minutes (excluding nice) + to: sysadmin diff --git a/health/health.d/disks.conf b/health/health.d/disks.conf new file mode 100644 index 0000000..26f8584 --- /dev/null +++ b/health/health.d/disks.conf @@ -0,0 +1,167 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + +# ----------------------------------------------------------------------------- +# low disk space + +# checking the latest collected values +# raise an alarm if the disk is low on +# available disk space + +template: disk_space_usage + on: disk.space + os: linux freebsd + hosts: * +families: * + calc: $used * 100 / ($avail + $used) + units: % + every: 1m + warn: $this > (($status >= $WARNING ) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: current disk space usage + to: sysadmin + +template: disk_inode_usage + on: disk.inodes + os: linux freebsd + hosts: * +families: * + calc: $used * 100 / ($avail + $used) + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: current disk inode usage + to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk fill rate + +# calculate the rate the disk fills +# use as base, the available space change +# during the last hour + +# this is just a calculation - it has no alarm +# we will use it in the next template to find +# the hours remaining + +template: disk_fill_rate + on: disk.space + os: linux freebsd + hosts: * +families: * + lookup: min -10m at -50m unaligned of avail + calc: ($this - $avail) / (($now - $after) / 3600) + every: 1m + units: GB/hour + info: average rate the disk fills up (positive), or frees up (negative) space, for the last hour + + +# calculate the hours remaining +# if the disk continues to fill +# in this rate + +template: out_of_disk_space_time + on: disk.space + os: linux freebsd + hosts: * +families: * + calc: ($disk_fill_rate > 0) ? ($avail / $disk_fill_rate) : (inf) + units: hours + every: 10s + warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) + crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) + delay: down 15m multiplier 1.2 max 1h + info: estimated time the disk will run out of space, if the system continues to add data with the rate of the last hour + to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk inode fill rate + +# calculate the rate the disk inodes are allocated +# use as base, the available inodes change +# during the last hour + +# this is just a calculation - it has no alarm +# we will use it in the next template to find +# the hours remaining + +template: disk_inode_rate + on: disk.inodes + os: linux freebsd + hosts: * +families: * + lookup: min -10m at -50m unaligned of avail + calc: ($this - $avail) / (($now - $after) / 3600) + every: 1m + units: inodes/hour + info: average rate at which disk inodes are allocated (positive), or freed (negative), for the last hour + +# calculate the hours remaining +# if the disk inodes are allocated +# in this rate + +template: out_of_disk_inodes_time + on: disk.inodes + os: linux freebsd + hosts: * +families: * + calc: ($disk_inode_rate > 0) ? ($avail / $disk_inode_rate) : (inf) + units: hours + every: 10s + warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) + crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) + delay: down 15m multiplier 1.2 max 1h + info: estimated time the disk will run out of inodes, if the system continues to allocate inodes with the rate of the last hour + to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk congestion + +# raise an alarm if the disk is congested +# by calculating the average disk utilization +# for the last 10 minutes + +template: 10min_disk_utilization + on: disk.util + os: linux freebsd + hosts: * +families: * + lookup: average -10m unaligned + units: % + every: 1m + green: 90 + red: 98 + warn: $this > $green * (($status >= $WARNING) ? (0.7) : (1)) + crit: $this > $red * (($status == $CRITICAL) ? (0.7) : (1)) + delay: down 15m multiplier 1.2 max 1h + info: the percentage of time the disk was busy, during the last 10 minutes + to: sysadmin + + +# raise an alarm if the disk backlog +# is above 1000ms (1s) per second +# for 10 minutes +# (i.e. the disk cannot catch up) + +template: 10min_disk_backlog + on: disk.backlog + os: linux + hosts: * +families: * + lookup: average -10m unaligned + units: ms + every: 1m + green: 2000 + red: 5000 + warn: $this > $green * (($status >= $WARNING) ? (0.7) : (1)) + crit: $this > $red * (($status == $CRITICAL) ? (0.7) : (1)) + delay: down 15m multiplier 1.2 max 1h + info: average of the kernel estimated disk backlog, for the last 10 minutes + to: sysadmin diff --git a/health/health.d/dockerd.conf b/health/health.d/dockerd.conf new file mode 100644 index 0000000..729906c --- /dev/null +++ b/health/health.d/dockerd.conf @@ -0,0 +1,8 @@ +template: docker_unhealthy_containers + on: docker.unhealthy_containers + units: unhealthy containers + every: 10s + lookup: average -10s + crit: $this > 0 + info: number of unhealthy containers + to: sysadmin diff --git a/health/health.d/elasticsearch.conf b/health/health.d/elasticsearch.conf new file mode 100644 index 0000000..dffd409 --- /dev/null +++ b/health/health.d/elasticsearch.conf @@ -0,0 +1,9 @@ + alarm: elasticsearch_last_collected + on: elasticsearch_local.cluster_health_status + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin diff --git a/health/health.d/entropy.conf b/health/health.d/entropy.conf new file mode 100644 index 0000000..66d44ec --- /dev/null +++ b/health/health.d/entropy.conf @@ -0,0 +1,16 @@ + +# check if entropy is too low +# the alarm is checked every 1 minute +# and examines the last hour of data + + alarm: lowest_entropy + on: system.entropy + os: linux + hosts: * + lookup: min -10m unaligned + units: entries + every: 5m + warn: $this < (($status >= $WARNING) ? (200) : (100)) + delay: down 1h multiplier 1.5 max 2h + info: minimum entries in the random numbers pool in the last 10 minutes + to: silent diff --git a/health/health.d/fping.conf b/health/health.d/fping.conf new file mode 100644 index 0000000..43658fe --- /dev/null +++ b/health/health.d/fping.conf @@ -0,0 +1,53 @@ + +template: fping_last_collected_secs +families: * + on: fping.latency + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +template: host_reachable +families: * + on: fping.latency + calc: $average != nan + units: up/down + every: 10s + crit: $this == 0 + info: states if the remote host is reachable + delay: down 30m multiplier 1.5 max 2h + to: sysadmin + +template: host_latency +families: * + on: fping.latency + lookup: average -10s unaligned of average + units: ms + every: 10s + green: 500 + red: 1000 + warn: $this > $green OR $max > $red + crit: $this > $red + info: average round trip delay during the last 10 seconds + delay: down 30m multiplier 1.5 max 2h + to: sysadmin + +template: packet_loss +families: * + on: fping.quality + lookup: average -10m unaligned of returned + calc: 100 - $this + green: 1 + red: 10 + units: % + every: 10s + warn: $this > $green + crit: $this > $red + info: packet loss percentage + delay: down 30m multiplier 1.5 max 2h + to: sysadmin + diff --git a/health/health.d/fronius.conf b/health/health.d/fronius.conf new file mode 100644 index 0000000..cdf6c8f --- /dev/null +++ b/health/health.d/fronius.conf @@ -0,0 +1,11 @@ +template: fronius_last_collected_secs +families: * + on: fronius.power + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/haproxy.conf b/health/health.d/haproxy.conf new file mode 100644 index 0000000..e49c70d --- /dev/null +++ b/health/health.d/haproxy.conf @@ -0,0 +1,27 @@ +template: haproxy_backend_server_status + on: haproxy_hs.down + units: failed servers + every: 10s + lookup: average -10s + crit: $this > 0 + info: number of failed haproxy backend servers + to: sysadmin + +template: haproxy_backend_status + on: haproxy_hb.down + units: failed backend + every: 10s + lookup: average -10s + crit: $this > 0 + info: number of failed haproxy backends + to: sysadmin + +template: haproxy_last_collected + on: haproxy_hb.down + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin diff --git a/health/health.d/httpcheck.conf b/health/health.d/httpcheck.conf new file mode 100644 index 0000000..0ddf35e --- /dev/null +++ b/health/health.d/httpcheck.conf @@ -0,0 +1,99 @@ +template: httpcheck_last_collected_secs +families: * + on: httpcheck.status + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# This is a fast-reacting no-notification alarm ideal for custom dashboards or badges +template: web_service_up +families: * + on: httpcheck.status + lookup: average -1m unaligned percentage of success + calc: ($this < 75) ? (0) : ($this) + every: 5s + units: up/down + info: at least 75% verified responses during last 60 seconds, ideal for badges + to: silent + +template: web_service_bad_content +families: * + on: httpcheck.status + lookup: average -5m unaligned percentage of bad_content + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of unexpected http response content during the last 5 minutes + options: no-clear-notification + to: webmaster + +template: web_service_bad_status +families: * + on: httpcheck.status + lookup: average -5m unaligned percentage of bad_status + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of unexpected http status during the last 5 minutes + options: no-clear-notification + to: webmaster + +template: web_service_timeouts +families: * + on: httpcheck.status + lookup: average -5m unaligned percentage of timeout + every: 10s + units: % + info: average of timeouts during the last 5 minutes + +template: no_web_service_connections +families: * + on: httpcheck.status + lookup: average -5m unaligned percentage of no_connection + every: 10s + units: % + info: average of failed requests during the last 5 minutes + +# combined timeout & no connection alarm +template: web_service_unreachable +families: * + on: httpcheck.status + calc: ($no_web_service_connections >= $web_service_timeouts) ? ($no_web_service_connections) : ($web_service_timeouts) + units: % + every: 10s + warn: ($no_web_service_connections >= 10 OR $web_service_timeouts >= 10) AND ($no_web_service_connections < 40 OR $web_service_timeouts < 40) + crit: $no_web_service_connections >= 40 OR $web_service_timeouts >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of failed requests either due to timeouts or no connection during the last 5 minutes + options: no-clear-notification + to: webmaster + +template: 1h_web_service_response_time +families: * + on: httpcheck.responsetime + lookup: average -1h unaligned of time + every: 30s + units: ms + info: average response time over the last hour + +template: web_service_slow +families: * + on: httpcheck.responsetime + lookup: average -3m unaligned of time + units: ms + every: 10s + warn: ($this > ($1h_web_service_response_time * 2) ) + crit: ($this > ($1h_web_service_response_time * 3) ) + info: average response time over the last 3 minutes, compared to the average over the last hour + delay: down 5m multiplier 1.5 max 1h + options: no-clear-notification + to: webmaster diff --git a/health/health.d/ipc.conf b/health/health.d/ipc.conf new file mode 100644 index 0000000..989d6e9 --- /dev/null +++ b/health/health.d/ipc.conf @@ -0,0 +1,28 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: semaphores_used + on: system.ipc_semaphores + os: linux + hosts: * + calc: $semaphores * 100 / $ipc_semaphores_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (70) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: the percentage of IPC semaphores used + to: sysadmin + + alarm: semaphore_arrays_used + on: system.ipc_semaphore_arrays + os: linux + hosts: * + calc: $arrays * 100 / $ipc_semaphores_arrays_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (70) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: the percentage of IPC semaphore arrays used + to: sysadmin diff --git a/health/health.d/ipfs.conf b/health/health.d/ipfs.conf new file mode 100644 index 0000000..3f77572 --- /dev/null +++ b/health/health.d/ipfs.conf @@ -0,0 +1,11 @@ + +template: ipfs_datastore_usage + on: ipfs.repo_size + calc: $size * 100 / $avail + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: ipfs Datastore close to running out of space + to: sysadmin diff --git a/health/health.d/ipmi.conf b/health/health.d/ipmi.conf new file mode 100644 index 0000000..c255819 --- /dev/null +++ b/health/health.d/ipmi.conf @@ -0,0 +1,20 @@ + alarm: ipmi_sensors_states + on: ipmi.sensors_states + calc: $warning + $critical + units: sensors + every: 10s + warn: $this > 0 + crit: $critical > 0 + delay: up 5m down 15m multiplier 1.5 max 1h + info: the number IPMI sensors in non-nominal state + to: sysadmin + + alarm: ipmi_events + on: ipmi.events + calc: $events + units: events + every: 10s + warn: $this > 0 + delay: up 5m down 15m multiplier 1.5 max 1h + info: the number of events in the IPMI System Event Log (SEL) + to: sysadmin diff --git a/health/health.d/isc_dhcpd.conf b/health/health.d/isc_dhcpd.conf new file mode 100644 index 0000000..8054656 --- /dev/null +++ b/health/health.d/isc_dhcpd.conf @@ -0,0 +1,10 @@ + template: isc_dhcpd_leases_size + on: isc_dhcpd.leases_total + units: KB + every: 60 + calc: $leases_size + warn: $this > 3072 + crit: $this > 6144 + delay: up 2m down 5m + info: dhcpd.leases file too big! Module can slow down your server. + to: sysadmin diff --git a/health/health.d/lighttpd.conf b/health/health.d/lighttpd.conf new file mode 100644 index 0000000..915907a --- /dev/null +++ b/health/health.d/lighttpd.conf @@ -0,0 +1,14 @@ + +# make sure lighttpd is running + +template: lighttpd_last_collected_secs + on: lighttpd.requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/linux_power_supply.conf b/health/health.d/linux_power_supply.conf new file mode 100644 index 0000000..745d2c3 --- /dev/null +++ b/health/health.d/linux_power_supply.conf @@ -0,0 +1,12 @@ +# Alert on low battery capacity. + +template: linux_power_supply_capacity + on: powersupply.capacity + calc: $capacity + units: % + every: 10s + warn: $this < 10 + crit: $this < 5 + delay: up 0 down 5m multiplier 1.2 max 1h + info: the percentage remaining capacity of the power supply + to: sysadmin diff --git a/health/health.d/load.conf b/health/health.d/load.conf new file mode 100644 index 0000000..ee0c54b --- /dev/null +++ b/health/health.d/load.conf @@ -0,0 +1,56 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# Calculate the base trigger point for the load average alarms. +# This is the maximum number of CPU's in the system over the past 1 +# minute, with a special case for a single CPU of setting the trigger at 2. + alarm: load_trigger + on: system.load + os: linux + hosts: * + calc: ($active_processors == nan or $active_processors == inf or $active_processors < 2) ? ( 2 ) : ( $active_processors ) + units: cpus + every: 1m + info: trigger point for load average alarms + +# Send alarms if the load average is unusually high. +# These intentionally _do not_ calculate the average over the sampled +# time period because the values being checked already are averages. + alarm: load_average_15 + on: system.load + os: linux + hosts: * + lookup: max -1m unaligned of load15 + units: load + every: 1m + warn: $this > (($status >= $WARNING) ? (1.75 * $load_trigger) : (2 * $load_trigger)) + crit: $this > (($status == $CRITICAL) ? (3.5 * $load_trigger) : (4 * $load_trigger)) + delay: down 15m multiplier 1.5 max 1h + info: fifteen-minute load average + to: sysadmin + + alarm: load_average_5 + on: system.load + os: linux + hosts: * + lookup: max -1m unaligned of load5 + units: load + every: 1m + warn: $this > (($status >= $WARNING) ? (3.5 * $load_trigger) : (4 * $load_trigger)) + crit: $this > (($status == $CRITICAL) ? (7 * $load_trigger) : (8 * $load_trigger)) + delay: down 15m multiplier 1.5 max 1h + info: five-minute load average + to: sysadmin + + alarm: load_average_1 + on: system.load + os: linux + hosts: * + lookup: max -1m unaligned of load1 + units: load + every: 1m + warn: $this > (($status >= $WARNING) ? (7 * $load_trigger) : (8 * $load_trigger)) + crit: $this > (($status == $CRITICAL) ? (14 * $load_trigger) : (16 * $load_trigger)) + delay: down 15m multiplier 1.5 max 1h + info: one-minute load average + to: sysadmin diff --git a/health/health.d/mdstat.conf b/health/health.d/mdstat.conf new file mode 100644 index 0000000..a53ec7a --- /dev/null +++ b/health/health.d/mdstat.conf @@ -0,0 +1,37 @@ +template: mdstat_last_collected + on: md.disks + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin + +template: mdstat_disks + on: md.disks + units: failed devices + every: 10s + calc: $total - $inuse + crit: $this > 0 + info: Array is degraded! + to: sysadmin + +template: mdstat_mismatch_cnt + on: md.mismatch_cnt + units: unsynchronized blocks + calc: $count + every: 10s + crit: $this > 0 + info: Mismatch count! + to: sysadmin + +template: mdstat_nonredundant_last_collected + on: md.nonredundant + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin
\ No newline at end of file diff --git a/health/health.d/megacli.conf b/health/health.d/megacli.conf new file mode 100644 index 0000000..1881a7b --- /dev/null +++ b/health/health.d/megacli.conf @@ -0,0 +1,48 @@ + alarm: adapter_state + on: megacli.adapter_degraded + units: is degraded + lookup: sum -10s + every: 10s + crit: $this > 0 + info: adapter state + to: sysadmin + + template: bbu_relative_charge + on: megacli.bbu_relative_charge + units: percent + lookup: average -10s + every: 10s + warn: $this <= (($status >= $WARNING) ? (85) : (80)) + crit: $this <= (($status == $CRITICAL) ? (50) : (40)) + info: BBU relative state of charge + to: sysadmin + + template: bbu_cycle_count + on: megacli.bbu_cycle_count + units: cycle count + lookup: average -10s + every: 10s + warn: $this >= 100 + crit: $this >= 500 + info: BBU cycle count + to: sysadmin + + alarm: pd_media_errors + on: megacli.pd_media_error + units: media errors + lookup: sum -10s + every: 10s + warn: $this > 0 + delay: down 1m multiplier 2 max 10m + info: physical drive media errors + to: sysadmin + + alarm: pd_predictive_failures + on: megacli.pd_predictive_failure + units: predictive failures + lookup: sum -10s + every: 10s + warn: $this > 0 + delay: down 1m multiplier 2 max 10m + info: physical drive predictive failures + to: sysadmin diff --git a/health/health.d/memcached.conf b/health/health.d/memcached.conf new file mode 100644 index 0000000..d248ef5 --- /dev/null +++ b/health/health.d/memcached.conf @@ -0,0 +1,52 @@ + +# make sure memcached is running + +template: memcached_last_collected_secs + on: memcached.cache + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba + + +# detect if memcached cache is full + +template: memcached_cache_memory_usage + on: memcached.cache + calc: $used * 100 / ($used + $available) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: up 0 down 15m multiplier 1.5 max 1h + info: current cache memory usage + to: dba + + +# find the rate memcached cache is filling + +template: cache_fill_rate + on: memcached.cache + lookup: min -10m at -50m unaligned of available + calc: ($this - $available) / (($now - $after) / 3600) + units: KB/hour + every: 1m + info: average rate the cache fills up (positive), or frees up (negative) space, for the last hour + + +# find the hours remaining until memcached cache is full + +template: out_of_cache_space_time + on: memcached.cache + calc: ($cache_fill_rate > 0) ? ($available / $cache_fill_rate) : (inf) + units: hours + every: 10s + warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) + crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) + delay: down 15m multiplier 1.5 max 1h + info: estimated time the cache will run out of space, if the system continues to add data with the rate of the last hour + to: dba diff --git a/health/health.d/memory.conf b/health/health.d/memory.conf new file mode 100644 index 0000000..4a0e6e5 --- /dev/null +++ b/health/health.d/memory.conf @@ -0,0 +1,38 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: 1hour_ecc_memory_correctable + on: mem.ecc_ce + os: linux + hosts: * + lookup: sum -10m unaligned + units: errors + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: number of ECC correctable errors during the last hour + to: sysadmin + + alarm: 1hour_ecc_memory_uncorrectable + on: mem.ecc_ue + os: linux + hosts: * + lookup: sum -10m unaligned + units: errors + every: 1m + crit: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: number of ECC uncorrectable errors during the last hour + to: sysadmin + + alarm: 1hour_memory_hw_corrupted + on: mem.hwcorrupt + os: linux + hosts: * + calc: $HardwareCorrupted + units: MB + every: 10s + warn: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: amount of memory corrupted due to a hardware failure + to: sysadmin diff --git a/health/health.d/mongodb.conf b/health/health.d/mongodb.conf new file mode 100644 index 0000000..a80cb31 --- /dev/null +++ b/health/health.d/mongodb.conf @@ -0,0 +1,13 @@ + +# make sure mongodb is running + +template: mongodb_last_collected_secs + on: mongodb.read_operations + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba diff --git a/health/health.d/mysql.conf b/health/health.d/mysql.conf new file mode 100644 index 0000000..39c4019 --- /dev/null +++ b/health/health.d/mysql.conf @@ -0,0 +1,100 @@ + +# make sure mysql is running + +template: mysql_last_collected_secs + on: mysql.queries + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba + + +# ----------------------------------------------------------------------------- +# slow queries + +template: mysql_10s_slow_queries + on: mysql.queries + lookup: sum -10s of slow_queries + units: slow queries + every: 10s + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (10) : (20)) + delay: down 5m multiplier 1.5 max 1h + info: number of mysql slow queries over the last 10 seconds + to: dba + + +# ----------------------------------------------------------------------------- +# lock waits + +template: mysql_10s_table_locks_immediate + on: mysql.table_locks + lookup: sum -10s absolute of immediate + units: immediate locks + every: 10s + info: number of table immediate locks over the last 10 seconds + to: dba + +template: mysql_10s_table_locks_waited + on: mysql.table_locks + lookup: sum -10s absolute of waited + units: waited locks + every: 10s + info: number of table waited locks over the last 10 seconds + to: dba + +template: mysql_10s_waited_locks_ratio + on: mysql.table_locks + calc: ( ($mysql_10s_table_locks_waited + $mysql_10s_table_locks_immediate) > 0 ) ? (($mysql_10s_table_locks_waited * 100) / ($mysql_10s_table_locks_waited + $mysql_10s_table_locks_immediate)) : 0 + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (10) : (25)) + crit: $this > (($status == $CRITICAL) ? (25) : (50)) + delay: down 30m multiplier 1.5 max 1h + info: the ratio of mysql waited table locks, for the last 10 seconds + to: dba + + +# ----------------------------------------------------------------------------- +# connections + +template: mysql_connections + on: mysql.connections_active + calc: $active * 100 / $limit + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (60) : (70)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: down 15m multiplier 1.5 max 1h + info: the ratio of current active connections vs the maximum possible number of connections + to: dba + + +# ----------------------------------------------------------------------------- +# replication + +template: mysql_replication + on: mysql.slave_status + calc: ($sql_running == -1 OR $io_running == -1)?0:1 + units: ok/failed + every: 10s + crit: $this == 0 + delay: down 5m multiplier 1.5 max 1h + info: checks if mysql replication has stopped + to: dba + +template: mysql_replication_lag + on: mysql.slave_behind + calc: $seconds + units: seconds + every: 10s + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (10) : (30)) + delay: down 15m multiplier 1.5 max 1h + info: the number of seconds mysql replication is behind this master + to: dba + diff --git a/health/health.d/named.conf b/health/health.d/named.conf new file mode 100644 index 0000000..4fc65c8 --- /dev/null +++ b/health/health.d/named.conf @@ -0,0 +1,14 @@ + +# make sure named is running + +template: named_last_collected_secs + on: named.global_queries + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: domainadmin + diff --git a/health/health.d/net.conf b/health/health.d/net.conf new file mode 100644 index 0000000..ae3c26e --- /dev/null +++ b/health/health.d/net.conf @@ -0,0 +1,165 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- +# net traffic overflow + + template: interface_speed + on: net.net + os: * + hosts: * + families: * + calc: ( $nic_speed_max > 0 ) ? ( $nic_speed_max) : ( nan ) + units: Mbit + every: 10s + info: The current speed of the physical network interface + + template: 1m_received_traffic_overflow + on: net.net + os: linux + hosts: * + families: * + lookup: average -1m unaligned absolute of received + calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed * 1000)) : ( nan ) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (90)) + delay: down 1m multiplier 1.5 max 1h + info: interface received bandwidth usage over net device speed max + to: sysadmin + + template: 1m_sent_traffic_overflow + on: net.net + os: linux + hosts: * + families: * + lookup: average -1m unaligned absolute of sent + calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed * 1000)) : ( nan ) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (90)) + delay: down 1m multiplier 1.5 max 1h + info: interface sent bandwidth usage over net device speed max + to: sysadmin + +# ----------------------------------------------------------------------------- +# dropped packets + +# check if an interface is dropping packets +# the alarm is checked every 1 minute +# and examines the last 10 minutes of data + +template: inbound_packets_dropped + on: net.drops + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute of inbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface inbound dropped packets in the last 10 minutes + to: sysadmin + +template: outbound_packets_dropped + on: net.drops + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute of outbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface outbound dropped packets in the last 10 minutes + to: sysadmin + +template: inbound_packets_dropped_ratio + on: net.packets + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute of received + calc: (($inbound_packets_dropped != nan AND $this > 0) ? ($inbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 0.1 + crit: $this >= 2 + delay: down 1h multiplier 1.5 max 2h + info: the ratio of inbound dropped packets vs the total number of received packets of the network interface, during the last 10 minutes + to: sysadmin + +template: outbound_packets_dropped_ratio + on: net.packets + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute of sent + calc: (($outbound_packets_dropped != nan AND $this > 0) ? ($outbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 0.1 + crit: $this >= 2 + delay: down 1h multiplier 1.5 max 2h + info: the ratio of outbound dropped packets vs the total number of sent packets of the network interface, during the last 10 minutes + to: sysadmin + + +# ----------------------------------------------------------------------------- +# FIFO errors + +# check if an interface is having FIFO +# buffer errors +# the alarm is checked every 1 minute +# and examines the last 10 minutes of data + +template: 10min_fifo_errors + on: net.fifo + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute + units: errors + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 2h + info: interface fifo errors in the last 10 minutes + to: sysadmin + + +# ----------------------------------------------------------------------------- +# check for packet storms + +# 1. calculate the rate packets are received in 1m: 1m_received_packets_rate +# 2. do the same for the last 10s +# 3. raise an alarm if the later is 10x or 20x the first +# we assume the minimum packet storm should at least have +# 10000 packets/s, average of the last 10 seconds + +template: 1m_received_packets_rate + on: net.packets + os: linux freebsd + hosts: * +families: * + lookup: average -1m unaligned of received + units: packets + every: 10s + info: the average number of packets received during the last minute + +template: 10s_received_packets_storm + on: net.packets + os: linux freebsd + hosts: * +families: * + lookup: average -10s unaligned of received + calc: $this * 100 / (($1m_received_packets_rate < 1000)?(1000):($1m_received_packets_rate)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(200):(5000)) + crit: $this > (($status >= $WARNING)?(5000):(6000)) +options: no-clear-notification + info: the % of the rate of received packets in the last 10 seconds, compared to the rate of the last minute (clear notification for this alarm will not be sent) + to: sysadmin diff --git a/health/health.d/netfilter.conf b/health/health.d/netfilter.conf new file mode 100644 index 0000000..1d07752 --- /dev/null +++ b/health/health.d/netfilter.conf @@ -0,0 +1,29 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: netfilter_last_collected_secs + on: netfilter.conntrack_sockets + os: linux + hosts: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + + alarm: netfilter_conntrack_full + on: netfilter.conntrack_sockets + os: linux + hosts: * + lookup: max -10s unaligned of connections + calc: $this * 100 / $netfilter_conntrack_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: the number of connections tracked by the netfilter connection tracker, as a percentage of the connection tracker table size + to: sysadmin diff --git a/health/health.d/nginx.conf b/health/health.d/nginx.conf new file mode 100644 index 0000000..a686c3d --- /dev/null +++ b/health/health.d/nginx.conf @@ -0,0 +1,14 @@ + +# make sure nginx is running + +template: nginx_last_collected_secs + on: nginx.requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/nginx_plus.conf b/health/health.d/nginx_plus.conf new file mode 100644 index 0000000..5a171a7 --- /dev/null +++ b/health/health.d/nginx_plus.conf @@ -0,0 +1,14 @@ + +# make sure nginx_plus is running + +template: nginx_plus_last_collected_secs + on: nginx_plus.requests_total + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/portcheck.conf b/health/health.d/portcheck.conf new file mode 100644 index 0000000..f42b63d --- /dev/null +++ b/health/health.d/portcheck.conf @@ -0,0 +1,48 @@ +template: portcheck_last_collected_secs +families: * + on: portcheck.status + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# This is a fast-reacting no-notification alarm ideal for custom dashboards or badges +template: service_reachable +families: * + on: portcheck.status + lookup: average -1m unaligned percentage of success + calc: ($this < 75) ? (0) : ($this) + every: 5s + units: up/down + info: at least 75% successful connections during last 60 seconds, ideal for badges + to: silent + +template: connection_timeouts +families: * + on: portcheck.status + lookup: average -5m unaligned percentage of timeout + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of timeouts during the last 5 minutes + options: no-clear-notification + to: sysadmin + +template: connection_fails +families: * + on: portcheck.status + lookup: average -5m unaligned percentage of no_connection + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of failed connections during the last 5 minutes + options: no-clear-notification + to: sysadmin diff --git a/health/health.d/postgres.conf b/health/health.d/postgres.conf new file mode 100644 index 0000000..4e0583b --- /dev/null +++ b/health/health.d/postgres.conf @@ -0,0 +1,13 @@ + +# make sure postgres is running + +template: postgres_last_collected_secs + on: postgres.db_stat_transactions + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba diff --git a/health/health.d/qos.conf b/health/health.d/qos.conf new file mode 100644 index 0000000..7290d15 --- /dev/null +++ b/health/health.d/qos.conf @@ -0,0 +1,18 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# check if a QoS class is dropping packets +# the alarm is checked every 10 seconds +# and examines the last minute of data + +#template: 10min_qos_packet_drops +# on: tc.qos_dropped +# os: linux +# hosts: * +# lookup: sum -10m unaligned absolute +# every: 30s +# warn: $this > 0 +# delay: up 0 down 30m multiplier 1.5 max 1h +# units: packets +# info: dropped packets in the last 30 minutes +# to: sysadmin diff --git a/health/health.d/ram.conf b/health/health.d/ram.conf new file mode 100644 index 0000000..4e43732 --- /dev/null +++ b/health/health.d/ram.conf @@ -0,0 +1,64 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: used_ram_to_ignore + on: system.ram + os: linux freebsd + hosts: * + calc: ($zfs.arc_size.arcsz = nan)?(0):($zfs.arc_size.arcsz) + every: 10s + info: the amount of memory that is reported as used, but it is actually capable for resizing itself based on the system needs (eg. ZFS ARC) + + alarm: ram_in_use + on: system.ram + os: linux + hosts: * +# calc: $used * 100 / ($used + $cached + $free) + calc: ($used - $used_ram_to_ignore) * 100 / ($used - $used_ram_to_ignore + $cached + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: system RAM used + to: sysadmin + + alarm: ram_available + on: mem.available + os: linux + hosts: * + calc: ($avail + $used_ram_to_ignore) * 100 / ($system.ram.used + $system.ram.cached + $system.ram.free + $system.ram.buffers) + units: % + every: 10s + warn: $this < (($status >= $WARNING) ? ( 5) : (10)) + crit: $this < (($status == $CRITICAL) ? (10) : ( 5)) + delay: down 15m multiplier 1.5 max 1h + info: estimated amount of RAM available for userspace processes, without causing swapping + to: sysadmin + +## FreeBSD +alarm: ram_in_use + on: system.ram + os: freebsd +hosts: * + calc: ($active + $wired + $laundry + $buffers - $used_ram_to_ignore) * 100 / ($active + $wired + $laundry + $buffers - $used_ram_to_ignore + $cache + $free + $inactive) +units: % +every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) +delay: down 15m multiplier 1.5 max 1h + info: system RAM usage + to: sysadmin + + alarm: ram_available + on: system.ram + os: freebsd + hosts: * + calc: ($free + $inactive + $used_ram_to_ignore) * 100 / ($free + $active + $inactive + $wired + $cache + $laundry + $buffers) + units: % + every: 10s + warn: $this < (($status >= $WARNING) ? ( 5) : (10)) + crit: $this < (($status == $CRITICAL) ? (10) : ( 5)) + delay: down 15m multiplier 1.5 max 1h + info: estimated amount of RAM available for userspace processes, without causing swapping + to: sysadmin diff --git a/health/health.d/redis.conf b/health/health.d/redis.conf new file mode 100644 index 0000000..c08a884 --- /dev/null +++ b/health/health.d/redis.conf @@ -0,0 +1,34 @@ + +# make sure redis is running + +template: redis_last_collected_secs + on: redis.operations + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba + +template: redis_bgsave_broken +families: * + on: redis.bgsave_health + every: 10s + crit: $rdb_last_bgsave_status != 0 + units: ok/failed + info: states if redis bgsave is working + delay: down 5m multiplier 1.5 max 1h + to: dba + +template: redis_bgsave_slow +families: * + on: redis.bgsave_now + every: 10s + warn: $rdb_bgsave_in_progress > 600 + crit: $rdb_bgsave_in_progress > 1200 + units: seconds + info: the time redis needs to save its database + delay: down 5m multiplier 1.5 max 1h + to: dba diff --git a/health/health.d/retroshare.conf b/health/health.d/retroshare.conf new file mode 100644 index 0000000..2344b60 --- /dev/null +++ b/health/health.d/retroshare.conf @@ -0,0 +1,25 @@ +# make sure RetroShare is running + +template: retroshare_last_collected_secs + on: retroshare.peers + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# make sure the DHT is fine when active + +template: retroshare_dht_working + on: retroshare.dht + calc: $dht_size_all + units: peers + every: 1m + warn: $this < (($status >= $WARNING) ? (120) : (100)) + crit: $this < (($status == $CRITICAL) ? (10) : (1)) + delay: up 0 down 15m multiplier 1.5 max 1h + info: Checks if the DHT has enough peers to operate + to: sysadmin diff --git a/health/health.d/softnet.conf b/health/health.d/softnet.conf new file mode 100644 index 0000000..77c804b --- /dev/null +++ b/health/health.d/softnet.conf @@ -0,0 +1,40 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# check for common /proc/net/softnet_stat errors + + alarm: 10min_netdev_backlog_exceeded + on: system.softnet_stat + os: linux + hosts: * + lookup: sum -10m unaligned absolute of dropped + units: packets + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 2h + info: number of packets dropped in the last 10min, because sysctl net.core.netdev_max_backlog was exceeded (this can be a cause for dropped packets) + to: sysadmin + + alarm: 10min_netdev_budget_ran_outs + on: system.softnet_stat + os: linux + hosts: * + lookup: sum -10m unaligned absolute of squeezed + units: events + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (10)) + delay: down 1h multiplier 1.5 max 2h + info: number of times, during the last 10min, ksoftirq ran out of sysctl net.core.netdev_budget or net.core.netdev_budget_usecs, with work remaining (this can be a cause for dropped packets) + to: silent + + alarm: 10min_netisr_backlog_exceeded + on: system.softnet_stat + os: freebsd + hosts: * + lookup: sum -10m unaligned absolute of qdrops + units: packets + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 2h + info: number of drops in the last 10min, because sysctl net.route.netisr_maxqlen was exceeded (this can be a cause for dropped packets) + to: sysadmin diff --git a/health/health.d/squid.conf b/health/health.d/squid.conf new file mode 100644 index 0000000..06cc967 --- /dev/null +++ b/health/health.d/squid.conf @@ -0,0 +1,14 @@ + +# make sure squid is running + +template: squid_last_collected_secs + on: squid.clients_requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: proxyadmin + diff --git a/health/health.d/stiebeleltron.conf b/health/health.d/stiebeleltron.conf new file mode 100644 index 0000000..e0361eb --- /dev/null +++ b/health/health.d/stiebeleltron.conf @@ -0,0 +1,11 @@ +template: stiebeleltron_last_collected_secs +families: * + on: stiebeleltron.heating.hc1 + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/swap.conf b/health/health.d/swap.conf new file mode 100644 index 0000000..f920b08 --- /dev/null +++ b/health/health.d/swap.conf @@ -0,0 +1,43 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: 30min_ram_swapped_out + on: system.swapio + os: linux freebsd + hosts: * + lookup: sum -30m unaligned absolute of out + # we have to convert KB to MB by dividing $this (i.e. the result of the lookup) with 1024 + calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free ) + units: % of RAM + every: 1m + warn: $this > (($status >= $WARNING) ? (10) : (20)) + crit: $this > (($status == $CRITICAL) ? (20) : (30)) + delay: up 0 down 15m multiplier 1.5 max 1h + info: the amount of memory swapped in the last 30 minutes, as a percentage of the system RAM + to: sysadmin + + alarm: ram_in_swap + on: system.swap + os: linux + hosts: * + calc: $used * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free ) + units: % of RAM + every: 10s + warn: $this > (($status >= $WARNING) ? (15) : (20)) + crit: $this > (($status == $CRITICAL) ? (40) : (50)) + delay: up 30s down 15m multiplier 1.5 max 1h + info: the swap memory used, as a percentage of the system RAM + to: sysadmin + + alarm: used_swap + on: system.swap + os: linux freebsd + hosts: * + calc: $used * 100 / ( $used + $free ) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 30s down 15m multiplier 1.5 max 1h + info: the percentage of swap memory used + to: sysadmin diff --git a/health/health.d/tcp_conn.conf b/health/health.d/tcp_conn.conf new file mode 100644 index 0000000..7aa9a98 --- /dev/null +++ b/health/health.d/tcp_conn.conf @@ -0,0 +1,19 @@ + +# +# ${tcp_max_connections} may be nan or -1 if the system +# supports dynamic threshold for TCP connections. +# In this case, the alarm will always be zero. +# + + alarm: tcp_connections + on: ipv4.tcpsock + os: linux + hosts: * + calc: (${tcp_max_connections} > 0) ? ( ${connections} * 100 / ${tcp_max_connections} ) : 0 + units: % + every: 10s + warn: $this > (($status >= $WARNING ) ? ( 60 ) : ( 80 )) + crit: $this > (($status >= $CRITICAL) ? ( 80 ) : ( 90 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the percentage of IPv4 TCP connections over the max allowed + to: sysadmin diff --git a/health/health.d/tcp_listen.conf b/health/health.d/tcp_listen.conf new file mode 100644 index 0000000..552930a --- /dev/null +++ b/health/health.d/tcp_listen.conf @@ -0,0 +1,82 @@ +# +# There are two queues involved when incoming TCP connections are handled +# (both at the kernel): +# +# SYN queue +# The SYN queue tracks TCP handshakes until connections are fully established. +# It overflows when too many incoming TCP connection requests hang in the +# half-open state and the server is not configured to fall back to SYN cookies. +# Overflows are usually caused by SYN flood DoS attacks (i.e. someone sends +# lots of SYN packets and never completes the handshakes). +# +# Accept queue +# The accept queue holds fully established TCP connections waiting to be handled +# by the listening application. It overflows when the server application fails +# to accept new connections at the rate they are coming in. +# +# +# ----------------------------------------------------------------------------- +# tcp accept queue (at the kernel) + + alarm: 1m_tcp_accept_queue_overflows + on: ip.tcp_accept_queue + os: linux + hosts: * + lookup: sum -60s unaligned absolute of ListenOverflows + units: overflows + every: 10s + crit: $this > 0 + delay: up 0 down 5m multiplier 1.5 max 1h + info: the number of times the TCP accept queue of the kernel overflown, during the last minute + to: sysadmin + +# THIS IS TOO GENERIC +# CHECK: https://github.com/netdata/netdata/issues/3234#issuecomment-423935842 + alarm: 1m_tcp_accept_queue_drops + on: ip.tcp_accept_queue + os: linux + hosts: * + lookup: sum -60s unaligned absolute of ListenDrops + units: drops + every: 10s +# warn: $this > 0 + crit: $this > (($status == $CRITICAL) ? (0) : (150)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the number of times the TCP accept queue of the kernel dropped packets, during the last minute (includes bogus packets received) + to: sysadmin + + +# ----------------------------------------------------------------------------- +# tcp SYN queue (at the kernel) + +# When the SYN queue is full, either TcpExtTCPReqQFullDoCookies or +# TcpExtTCPReqQFullDrop is incremented, depending on whether SYN cookies are +# enabled or not. In both cases this probably indicates a SYN flood attack, +# so i guess a notification should be sent. + + alarm: 1m_tcp_syn_queue_drops + on: ip.tcp_syn_queue + os: linux + hosts: * + lookup: sum -60s unaligned absolute of TCPReqQFullDrop + units: drops + every: 10s + warn: $this > 0 + crit: $this > (($status == $CRITICAL) ? (0) : (60)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the number of times the TCP SYN queue of the kernel was full and dropped packets, during the last minute + to: sysadmin + + alarm: 1m_tcp_syn_queue_cookies + on: ip.tcp_syn_queue + os: linux + hosts: * + lookup: sum -60s unaligned absolute of TCPReqQFullDoCookies + units: cookies + every: 10s + warn: $this > 0 + crit: $this > (($status == $CRITICAL) ? (0) : (60)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the number of times the TCP SYN queue of the kernel was full and sent SYN cookies, during the last minute + to: sysadmin + diff --git a/health/health.d/tcp_mem.conf b/health/health.d/tcp_mem.conf new file mode 100644 index 0000000..6927d57 --- /dev/null +++ b/health/health.d/tcp_mem.conf @@ -0,0 +1,20 @@ +# +# check +# http://blog.tsunanet.net/2011/03/out-of-socket-memory.html +# +# We give a warning when TCP is under memory pressure +# and a critical when TCP is 90% of its upper memory limit +# + + alarm: tcp_memory + on: ipv4.sockstat_tcp_mem + os: linux + hosts: * + calc: ${mem} * 100 / ${tcp_mem_high} + units: % + every: 10s + warn: ${mem} > (($status >= $WARNING ) ? ( ${tcp_mem_pressure} * 0.8 ) : ( ${tcp_mem_pressure} )) + crit: ${mem} > (($status >= $CRITICAL ) ? ( ${tcp_mem_pressure} ) : ( ${tcp_mem_high} * 0.9 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the amount of TCP memory as a percentage of its max memory limit + to: sysadmin diff --git a/health/health.d/tcp_orphans.conf b/health/health.d/tcp_orphans.conf new file mode 100644 index 0000000..280d659 --- /dev/null +++ b/health/health.d/tcp_orphans.conf @@ -0,0 +1,21 @@ + +# +# check +# http://blog.tsunanet.net/2011/03/out-of-socket-memory.html +# +# The kernel may penalize orphans by 2x or even 4x +# so we alarm warning at 25% and critical at 50% +# + + alarm: tcp_orphans + on: ipv4.sockstat_tcp_sockets + os: linux + hosts: * + calc: ${orphan} * 100 / ${tcp_max_orphans} + units: % + every: 10s + warn: $this > (($status >= $WARNING ) ? ( 20 ) : ( 25 )) + crit: $this > (($status >= $CRITICAL) ? ( 25 ) : ( 50 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the percentage of orphan IPv4 TCP sockets over the max allowed (this may lead to too-many-orphans errors) + to: sysadmin diff --git a/health/health.d/tcp_resets.conf b/health/health.d/tcp_resets.conf new file mode 100644 index 0000000..91dad3c --- /dev/null +++ b/health/health.d/tcp_resets.conf @@ -0,0 +1,67 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- + + alarm: ipv4_tcphandshake_last_collected_secs + on: ipv4.tcphandshake + os: linux freebsd + hosts: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# ----------------------------------------------------------------------------- +# tcp resets this host sends + + alarm: 1m_ipv4_tcp_resets_sent + on: ipv4.tcphandshake + os: linux + hosts: * + lookup: average -1m at -10s unaligned absolute of OutRsts + units: tcp resets/s + every: 10s + info: average TCP RESETS this host is sending, over the last minute + + alarm: 10s_ipv4_tcp_resets_sent + on: ipv4.tcphandshake + os: linux + hosts: * + lookup: average -10s unaligned absolute of OutRsts + units: tcp resets/s + every: 10s + warn: $this > ((($1m_ipv4_tcp_resets_sent < 5)?(5):($1m_ipv4_tcp_resets_sent)) * (($status >= $WARNING) ? (1) : (20))) + delay: up 0 down 60m multiplier 1.2 max 2h + options: no-clear-notification + info: average TCP RESETS this host is sending, over the last 10 seconds (this can be an indication that a port scan is made, or that a service running on this host has crashed; clear notification for this alarm will not be sent) + to: sysadmin + +# ----------------------------------------------------------------------------- +# tcp resets this host receives + + alarm: 1m_ipv4_tcp_resets_received + on: ipv4.tcphandshake + os: linux freebsd + hosts: * + lookup: average -1m at -10s unaligned absolute of AttemptFails + units: tcp resets/s + every: 10s + info: average TCP RESETS this host is sending, over the last minute + + alarm: 10s_ipv4_tcp_resets_received + on: ipv4.tcphandshake + os: linux freebsd + hosts: * + lookup: average -10s unaligned absolute of AttemptFails + units: tcp resets/s + every: 10s + warn: $this > ((($1m_ipv4_tcp_resets_received < 5)?(5):($1m_ipv4_tcp_resets_received)) * (($status >= $WARNING) ? (1) : (10))) + delay: up 0 down 60m multiplier 1.2 max 2h + options: no-clear-notification + info: average TCP RESETS this host is receiving, over the last 10 seconds (this can be an indication that a service this host needs, has crashed; clear notification for this alarm will not be sent) + to: sysadmin diff --git a/health/health.d/udp_errors.conf b/health/health.d/udp_errors.conf new file mode 100644 index 0000000..5140228 --- /dev/null +++ b/health/health.d/udp_errors.conf @@ -0,0 +1,49 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- + + alarm: ipv4_udperrors_last_collected_secs + on: ipv4.udperrors + os: linux freebsd + hosts: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# ----------------------------------------------------------------------------- +# UDP receive buffer errors + + alarm: 1m_ipv4_udp_receive_buffer_errors + on: ipv4.udperrors + os: linux freebsd + hosts: * + lookup: sum -1m unaligned absolute of RcvbufErrors + units: errors + every: 10s + warn: $this > 0 + crit: $this > (($status == $CRITICAL) ? (0) : (100)) + info: number of UDP receive buffer errors during the last minute + delay: up 0 down 60m multiplier 1.2 max 2h + to: sysadmin + +# ----------------------------------------------------------------------------- +# UDP send buffer errors + + alarm: 1m_ipv4_udp_send_buffer_errors + on: ipv4.udperrors + os: linux + hosts: * + lookup: sum -1m unaligned absolute of SndbufErrors + units: errors + every: 10s + warn: $this > 0 + crit: $this > (($status == $CRITICAL) ? (0) : (100)) + info: number of UDP send buffer errors during the last minute + delay: up 0 down 60m multiplier 1.2 max 2h + to: sysadmin diff --git a/health/health.d/varnish.conf b/health/health.d/varnish.conf new file mode 100644 index 0000000..cca7446 --- /dev/null +++ b/health/health.d/varnish.conf @@ -0,0 +1,9 @@ + alarm: varnish_last_collected + on: varnish.uptime + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin diff --git a/health/health.d/web_log.conf b/health/health.d/web_log.conf new file mode 100644 index 0000000..031adc2 --- /dev/null +++ b/health/health.d/web_log.conf @@ -0,0 +1,193 @@ + +# make sure we can collect web log data + +template: last_collected_secs + on: web_log.response_codes +families: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + + +# ----------------------------------------------------------------------------- +# high level response code alarms + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: 1m_requests + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned + calc: ($this == 0)?(1):($this) + units: requests + every: 10s + info: the sum of all HTTP requests over the last minute + +template: 1m_successful + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned of successful_requests + calc: $this * 100 / $1m_requests + units: % + every: 10s + warn: ($1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 95 ) : ( 85 )) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 85 ) : ( 75 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of successful HTTP responses (1xx, 2xx, 304) over the last minute + to: webmaster + +template: 1m_redirects + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned of redirects + calc: $this * 100 / $1m_requests + units: % + every: 10s + warn: ($1m_requests > 120) ? ($this > (($status >= $WARNING ) ? ( 1 ) : ( 20 )) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 20 ) : ( 30 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP redirects (3xx except 304) over the last minute + to: webmaster + +template: 1m_bad_requests + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned of bad_requests + calc: $this * 100 / $1m_requests + units: % + every: 10s + warn: ($1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 10 ) : ( 30 )) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 30 ) : ( 50 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP bad requests (4xx) over the last minute + to: webmaster + +template: 1m_internal_errors + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned of server_errors + calc: $this * 100 / $1m_requests + units: % + every: 10s + warn: ($1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 1 ) : ( 2 )) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 2 ) : ( 5 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP internal server errors (5xx), over the last minute + to: webmaster + +# unmatched lines + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_total_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: 1m_total_requests + on: web_log.response_codes +families: * + lookup: sum -1m unaligned + calc: ($this == 0)?(1):($this) + units: requests + every: 10s + info: the sum of all HTTP requests over the last minute + +template: 1m_unmatched +on: web_log.response_codes +families: * + lookup: sum -1m unaligned of unmatched + calc: $this * 100 / $1m_total_requests + units: % + every: 10s + warn: ($1m_total_requests > 120) ? ($this > 1) : ( 0 ) + crit: ($1m_total_requests > 120) ? ($this > 5) : ( 0 ) + delay: up 1m down 5m multiplier 1.5 max 1h + info: the ratio of unmatched lines, over the last minute + to: webmaster + +# ----------------------------------------------------------------------------- +# web slow + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: 10m_response_time + on: web_log.response_time +families: * + lookup: average -10m unaligned of avg + units: ms + every: 30s + info: the average time to respond to HTTP requests, over the last 10 minutes + +template: web_slow + on: web_log.response_time +families: * + lookup: average -1m unaligned of avg + units: ms + every: 10s + green: 500 + red: 1000 + warn: ($1m_requests > 120) ? ($this > $green && $this > ($10m_response_time * 2) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this > $red && $this > ($10m_response_time * 4) ) : ( 0 ) + delay: down 15m multiplier 1.5 max 1h + info: the average time to respond to HTTP requests, over the last 1 minute + options: no-clear-notification + to: webmaster + +# ----------------------------------------------------------------------------- +# web too many or too few requests + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $5m_successful_old > 120 +# +# i.e. when there were at least 120 requests during the 5 minutes starting +# at -10m and ending at -5m + +template: 5m_successful_old + on: web_log.response_statuses +families: * + lookup: average -5m at -5m unaligned of successful_requests + units: requests/s + every: 30s + info: average rate of successful HTTP requests over the last 5 minutes + +template: 5m_successful + on: web_log.response_statuses +families: * + lookup: average -5m unaligned of successful_requests + units: requests/s + every: 30s + info: average successful HTTP requests over the last 5 minutes + +template: 5m_requests_ratio + on: web_log.response_codes +families: * + calc: ($5m_successful_old > 0)?($5m_successful * 100 / $5m_successful_old):(100) + units: % + every: 30s + warn: ($5m_successful_old > 120) ? ($this > 200 OR $this < 50) : (0) + crit: ($5m_successful_old > 120) ? ($this > 400 OR $this < 25) : (0) + delay: down 15m multiplier 1.5 max 1h +options: no-clear-notification + info: the percentage of successful web requests over the last 5 minutes, \ + compared with the previous 5 minutes \ + (clear notification for this alarm will not be sent) + to: webmaster + diff --git a/health/health.d/zfs.conf b/health/health.d/zfs.conf new file mode 100644 index 0000000..af73824 --- /dev/null +++ b/health/health.d/zfs.conf @@ -0,0 +1,10 @@ + + alarm: zfs_memory_throttle + on: zfs.memory_ops + lookup: sum -10m unaligned absolute of throttled + units: events + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 2h + info: the number of times ZFS had to limit the ARC growth in the last 10 minutes + to: sysadmin diff --git a/health/health.h b/health/health.h new file mode 100644 index 0000000..ff10fd6 --- /dev/null +++ b/health/health.h @@ -0,0 +1,142 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#ifndef NETDATA_HEALTH_H +#define NETDATA_HEALTH_H 1 + +#include "../daemon/common.h" + +#define NETDATA_PLUGIN_HOOK_HEALTH \ + { \ + .name = "HEALTH", \ + .config_section = NULL, \ + .config_name = NULL, \ + .enabled = 1, \ + .thread = NULL, \ + .init_routine = NULL, \ + .start_routine = health_main \ + }, + +extern unsigned int default_health_enabled; + +#define HEALTH_ENTRY_FLAG_PROCESSED 0x00000001 +#define HEALTH_ENTRY_FLAG_UPDATED 0x00000002 +#define HEALTH_ENTRY_FLAG_EXEC_RUN 0x00000004 +#define HEALTH_ENTRY_FLAG_EXEC_FAILED 0x00000008 +#define HEALTH_ENTRY_FLAG_SILENCED 0x00000008 + +#define HEALTH_ENTRY_FLAG_SAVED 0x10000000 +#define HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION 0x80000000 + +#ifndef HEALTH_LISTEN_PORT +#define HEALTH_LISTEN_PORT 19998 +#endif + +#ifndef HEALTH_LISTEN_BACKLOG +#define HEALTH_LISTEN_BACKLOG 4096 +#endif + +#define HEALTH_ALARM_KEY "alarm" +#define HEALTH_TEMPLATE_KEY "template" +#define HEALTH_ON_KEY "on" +#define HEALTH_CONTEXT_KEY "context" +#define HEALTH_CHART_KEY "chart" +#define HEALTH_HOST_KEY "hosts" +#define HEALTH_OS_KEY "os" +#define HEALTH_FAMILIES_KEY "families" +#define HEALTH_LOOKUP_KEY "lookup" +#define HEALTH_CALC_KEY "calc" +#define HEALTH_EVERY_KEY "every" +#define HEALTH_GREEN_KEY "green" +#define HEALTH_RED_KEY "red" +#define HEALTH_WARN_KEY "warn" +#define HEALTH_CRIT_KEY "crit" +#define HEALTH_EXEC_KEY "exec" +#define HEALTH_RECIPIENT_KEY "to" +#define HEALTH_UNITS_KEY "units" +#define HEALTH_INFO_KEY "info" +#define HEALTH_DELAY_KEY "delay" +#define HEALTH_OPTIONS_KEY "options" + +typedef struct silencer { + char *alarms; + SIMPLE_PATTERN *alarms_pattern; + + char *hosts; + SIMPLE_PATTERN *hosts_pattern; + + char *contexts; + SIMPLE_PATTERN *contexts_pattern; + + char *charts; + SIMPLE_PATTERN *charts_pattern; + + char *families; + SIMPLE_PATTERN *families_pattern; + + struct silencer *next; +} SILENCER; + +typedef enum silence_type { + STYPE_NONE, + STYPE_DISABLE_ALARMS, + STYPE_SILENCE_NOTIFICATIONS +} SILENCE_TYPE; + +typedef struct silencers { + int all_alarms; + SILENCE_TYPE stype; + SILENCER *silencers; +} SILENCERS; + +SILENCERS *silencers; + +extern void health_init(void); +extern void *health_main(void *ptr); + +extern void health_reload(void); + +extern int health_variable_lookup(const char *variable, uint32_t hash, RRDCALC *rc, calculated_number *result); +extern void health_alarms2json(RRDHOST *host, BUFFER *wb, int all); +extern void health_alarm_log2json(RRDHOST *host, BUFFER *wb, uint32_t after); + +void health_api_v1_chart_variables2json(RRDSET *st, BUFFER *buf); + +extern int health_alarm_log_open(RRDHOST *host); +extern void health_alarm_log_close(RRDHOST *host); +extern void health_log_rotate(RRDHOST *host); +extern void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae); +extern ssize_t health_alarm_log_read(RRDHOST *host, FILE *fp, const char *filename); +extern void health_alarm_log_load(RRDHOST *host); + +extern void health_alarm_log( + RRDHOST *host, + uint32_t alarm_id, + uint32_t alarm_event_id, + time_t when, + const char *name, + const char *chart, + const char *family, + const char *exec, + const char *recipient, + time_t duration, + calculated_number old_value, + calculated_number new_value, + RRDCALC_STATUS old_status, + RRDCALC_STATUS new_status, + const char *source, + const char *units, + const char *info, + int delay, + uint32_t flags); + +extern void health_readdir(RRDHOST *host, const char *user_path, const char *stock_path, const char *subpath); +extern char *health_user_config_dir(void); +extern char *health_stock_config_dir(void); +extern void health_reload_host(RRDHOST *host); +extern void health_alarm_log_free(RRDHOST *host); + +extern void health_alarm_log_free_one_nochecks_nounlink(ALARM_ENTRY *ae); + +extern void *health_cmdapi_thread(void *ptr); + +#endif //NETDATA_HEALTH_H diff --git a/health/health_config.c b/health/health_config.c new file mode 100644 index 0000000..35fde90 --- /dev/null +++ b/health/health_config.c @@ -0,0 +1,861 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +#define HEALTH_CONF_MAX_LINE 4096 + +#define HEALTH_ALARM_KEY "alarm" +#define HEALTH_TEMPLATE_KEY "template" +#define HEALTH_ON_KEY "on" +#define HEALTH_HOST_KEY "hosts" +#define HEALTH_OS_KEY "os" +#define HEALTH_FAMILIES_KEY "families" +#define HEALTH_LOOKUP_KEY "lookup" +#define HEALTH_CALC_KEY "calc" +#define HEALTH_EVERY_KEY "every" +#define HEALTH_GREEN_KEY "green" +#define HEALTH_RED_KEY "red" +#define HEALTH_WARN_KEY "warn" +#define HEALTH_CRIT_KEY "crit" +#define HEALTH_EXEC_KEY "exec" +#define HEALTH_RECIPIENT_KEY "to" +#define HEALTH_UNITS_KEY "units" +#define HEALTH_INFO_KEY "info" +#define HEALTH_DELAY_KEY "delay" +#define HEALTH_OPTIONS_KEY "options" + +static inline int rrdcalc_add_alarm_from_config(RRDHOST *host, RRDCALC *rc) { + if(!rc->chart) { + error("Health configuration for alarm '%s' does not have a chart", rc->name); + return 0; + } + + if(!rc->update_every) { + error("Health configuration for alarm '%s.%s' has no frequency (parameter 'every'). Ignoring it.", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(!RRDCALC_HAS_DB_LOOKUP(rc) && !rc->calculation && !rc->warning && !rc->critical) { + error("Health configuration for alarm '%s.%s' is useless (no db lookup, no calculation, no warning and no critical expressions)", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if (rrdcalc_exists(host, rc->chart, rc->name, rc->hash_chart, rc->hash)) + return 0; + + rc->id = rrdcalc_get_unique_id(host, rc->chart, rc->name, &rc->next_event_id); + + debug(D_HEALTH, "Health configuration adding alarm '%s.%s' (%u): exec '%s', recipient '%s', green " CALCULATED_NUMBER_FORMAT_AUTO ", red " CALCULATED_NUMBER_FORMAT_AUTO ", lookup: group %d, after %d, before %d, options %u, dimensions '%s', update every %d, calculation '%s', warning '%s', critical '%s', source '%s', delay up %d, delay down %d, delay max %d, delay_multiplier %f", + rc->chart?rc->chart:"NOCHART", + rc->name, + rc->id, + (rc->exec)?rc->exec:"DEFAULT", + (rc->recipient)?rc->recipient:"DEFAULT", + rc->green, + rc->red, + (int)rc->group, + rc->after, + rc->before, + rc->options, + (rc->dimensions)?rc->dimensions:"NONE", + rc->update_every, + (rc->calculation)?rc->calculation->parsed_as:"NONE", + (rc->warning)?rc->warning->parsed_as:"NONE", + (rc->critical)?rc->critical->parsed_as:"NONE", + rc->source, + rc->delay_up_duration, + rc->delay_down_duration, + rc->delay_max_duration, + rc->delay_multiplier + ); + + rrdcalc_create_part2(host, rc); + return 1; +} + +static inline int rrdcalctemplate_add_template_from_config(RRDHOST *host, RRDCALCTEMPLATE *rt) { + if(unlikely(!rt->context)) { + error("Health configuration for template '%s' does not have a context", rt->name); + return 0; + } + + if(unlikely(!rt->update_every)) { + error("Health configuration for template '%s' has no frequency (parameter 'every'). Ignoring it.", rt->name); + return 0; + } + + if(unlikely(!RRDCALCTEMPLATE_HAS_DB_LOOKUP(rt) && !rt->calculation && !rt->warning && !rt->critical)) { + error("Health configuration for template '%s' is useless (no calculation, no warning and no critical evaluation)", rt->name); + return 0; + } + + RRDCALCTEMPLATE *t, *last = NULL; + for (t = host->templates; t ; last = t, t = t->next) { + if(unlikely(t->hash_name == rt->hash_name + && !strcmp(t->name, rt->name) + && !strcmp(t->family_match?t->family_match:"*", rt->family_match?rt->family_match:"*") + )) { + error("Health configuration template '%s' already exists for host '%s'.", rt->name, host->hostname); + return 0; + } + } + + debug(D_HEALTH, "Health configuration adding template '%s': context '%s', exec '%s', recipient '%s', green " CALCULATED_NUMBER_FORMAT_AUTO ", red " CALCULATED_NUMBER_FORMAT_AUTO ", lookup: group %d, after %d, before %d, options %u, dimensions '%s', update every %d, calculation '%s', warning '%s', critical '%s', source '%s', delay up %d, delay down %d, delay max %d, delay_multiplier %f", + rt->name, + (rt->context)?rt->context:"NONE", + (rt->exec)?rt->exec:"DEFAULT", + (rt->recipient)?rt->recipient:"DEFAULT", + rt->green, + rt->red, + (int)rt->group, + rt->after, + rt->before, + rt->options, + (rt->dimensions)?rt->dimensions:"NONE", + rt->update_every, + (rt->calculation)?rt->calculation->parsed_as:"NONE", + (rt->warning)?rt->warning->parsed_as:"NONE", + (rt->critical)?rt->critical->parsed_as:"NONE", + rt->source, + rt->delay_up_duration, + rt->delay_down_duration, + rt->delay_max_duration, + rt->delay_multiplier + ); + + if(likely(last)) { + last->next = rt; + } + else { + rt->next = host->templates; + host->templates = rt; + } + + return 1; +} + +static inline int health_parse_duration(char *string, int *result) { + // make sure it is a number + if(!*string || !(isdigit(*string) || *string == '+' || *string == '-')) { + *result = 0; + return 0; + } + + char *e = NULL; + calculated_number n = str2ld(string, &e); + if(e && *e) { + switch (*e) { + case 'Y': + *result = (int) (n * 86400 * 365); + break; + case 'M': + *result = (int) (n * 86400 * 30); + break; + case 'w': + *result = (int) (n * 86400 * 7); + break; + case 'd': + *result = (int) (n * 86400); + break; + case 'h': + *result = (int) (n * 3600); + break; + case 'm': + *result = (int) (n * 60); + break; + + default: + case 's': + *result = (int) (n); + break; + } + } + else + *result = (int)(n); + + return 1; +} + +static inline int health_parse_delay( + size_t line, const char *filename, char *string, + int *delay_up_duration, + int *delay_down_duration, + int *delay_max_duration, + float *delay_multiplier) { + + char given_up = 0; + char given_down = 0; + char given_max = 0; + char given_multiplier = 0; + + char *s = string; + while(*s) { + char *key = s; + + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!*key) break; + + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!strcasecmp(key, "up")) { + if (!health_parse_duration(value, delay_up_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_up = 1; + } + else if(!strcasecmp(key, "down")) { + if (!health_parse_duration(value, delay_down_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_down = 1; + } + else if(!strcasecmp(key, "multiplier")) { + *delay_multiplier = strtof(value, NULL); + if(isnan(*delay_multiplier) || isinf(*delay_multiplier) || islessequal(*delay_multiplier, 0)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_multiplier = 1; + } + else if(!strcasecmp(key, "max")) { + if (!health_parse_duration(value, delay_max_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_max = 1; + } + else { + error("Health configuration at line %zu of file '%s': unknown keyword '%s'", + line, filename, key); + } + } + + if(!given_up) + *delay_up_duration = 0; + + if(!given_down) + *delay_down_duration = 0; + + if(!given_multiplier) + *delay_multiplier = 1.0; + + if(!given_max) { + if((*delay_max_duration) < (*delay_up_duration) * (*delay_multiplier)) + *delay_max_duration = (int)((*delay_up_duration) * (*delay_multiplier)); + + if((*delay_max_duration) < (*delay_down_duration) * (*delay_multiplier)) + *delay_max_duration = (int)((*delay_down_duration) * (*delay_multiplier)); + } + + return 1; +} + +static inline uint32_t health_parse_options(const char *s) { + uint32_t options = 0; + char buf[100+1] = ""; + + while(*s) { + buf[0] = '\0'; + + // skip spaces + while(*s && isspace(*s)) + s++; + + // find the next space + size_t count = 0; + while(*s && count < 100 && !isspace(*s)) + buf[count++] = *s++; + + if(buf[0]) { + buf[count] = '\0'; + + if(!strcasecmp(buf, "no-clear-notification") || !strcasecmp(buf, "no-clear")) + options |= RRDCALC_FLAG_NO_CLEAR_NOTIFICATION; + else + error("Ignoring unknown alarm option '%s'", buf); + } + } + + return options; +} + +static inline int health_parse_db_lookup( + size_t line, const char *filename, char *string, + RRDR_GROUPING *group_method, int *after, int *before, int *every, + uint32_t *options, char **dimensions +) { + debug(D_HEALTH, "Health configuration parsing database lookup %zu@%s: %s", line, filename, string); + + if(*dimensions) freez(*dimensions); + *dimensions = NULL; + *after = 0; + *before = 0; + *every = 0; + *options = 0; + + char *s = string, *key; + + // first is the group method + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + if(!*s) { + error("Health configuration invalid chart calculation at line %zu of file '%s': expected group method followed by the 'after' time, but got '%s'", + line, filename, key); + return 0; + } + + if((*group_method = web_client_api_request_v1_data_group(key, RRDR_GROUPING_UNDEFINED)) == RRDR_GROUPING_UNDEFINED) { + error("Health configuration at line %zu of file '%s': invalid group method '%s'", + line, filename, key); + return 0; + } + + // then is the 'after' time + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!health_parse_duration(key, after)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' after group method", + line, filename, key); + return 0; + } + + // sane defaults + *every = abs(*after); + + // now we may have optional parameters + while(*s) { + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + if(!*key) break; + + if(!strcasecmp(key, "at")) { + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if (!health_parse_duration(value, before)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' for '%s' keyword", + line, filename, value, key); + } + } + else if(!strcasecmp(key, HEALTH_EVERY_KEY)) { + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if (!health_parse_duration(value, every)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' for '%s' keyword", + line, filename, value, key); + } + } + else if(!strcasecmp(key, "absolute") || !strcasecmp(key, "abs") || !strcasecmp(key, "absolute_sum")) { + *options |= RRDR_OPTION_ABSOLUTE; + } + else if(!strcasecmp(key, "min2max")) { + *options |= RRDR_OPTION_MIN2MAX; + } + else if(!strcasecmp(key, "null2zero")) { + *options |= RRDR_OPTION_NULL2ZERO; + } + else if(!strcasecmp(key, "percentage")) { + *options |= RRDR_OPTION_PERCENTAGE; + } + else if(!strcasecmp(key, "unaligned")) { + *options |= RRDR_OPTION_NOT_ALIGNED; + } + else if(!strcasecmp(key, "match-ids") || !strcasecmp(key, "match_ids")) { + *options |= RRDR_OPTION_MATCH_IDS; + } + else if(!strcasecmp(key, "match-names") || !strcasecmp(key, "match_names")) { + *options |= RRDR_OPTION_MATCH_NAMES; + } + else if(!strcasecmp(key, "of")) { + if(*s && strcasecmp(s, "all") != 0) + *dimensions = strdupz(s); + break; + } + else { + error("Health configuration at line %zu of file '%s': unknown keyword '%s'", + line, filename, key); + } + } + + return 1; +} + +static inline char *health_source_file(size_t line, const char *file) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%zu@%s", line, file); + return strdupz(buffer); +} + +static inline void strip_quotes(char *s) { + while(*s) { + if(*s == '\'' || *s == '"') *s = ' '; + s++; + } +} + +static int health_readfile(const char *filename, void *data) { + RRDHOST *host = (RRDHOST *)data; + + debug(D_HEALTH, "Health configuration reading file '%s'", filename); + + static uint32_t + hash_alarm = 0, + hash_template = 0, + hash_os = 0, + hash_on = 0, + hash_host = 0, + hash_families = 0, + hash_calc = 0, + hash_green = 0, + hash_red = 0, + hash_warn = 0, + hash_crit = 0, + hash_exec = 0, + hash_every = 0, + hash_lookup = 0, + hash_units = 0, + hash_info = 0, + hash_recipient = 0, + hash_delay = 0, + hash_options = 0; + + char buffer[HEALTH_CONF_MAX_LINE + 1]; + + if(unlikely(!hash_alarm)) { + hash_alarm = simple_uhash(HEALTH_ALARM_KEY); + hash_template = simple_uhash(HEALTH_TEMPLATE_KEY); + hash_on = simple_uhash(HEALTH_ON_KEY); + hash_os = simple_uhash(HEALTH_OS_KEY); + hash_host = simple_uhash(HEALTH_HOST_KEY); + hash_families = simple_uhash(HEALTH_FAMILIES_KEY); + hash_calc = simple_uhash(HEALTH_CALC_KEY); + hash_lookup = simple_uhash(HEALTH_LOOKUP_KEY); + hash_green = simple_uhash(HEALTH_GREEN_KEY); + hash_red = simple_uhash(HEALTH_RED_KEY); + hash_warn = simple_uhash(HEALTH_WARN_KEY); + hash_crit = simple_uhash(HEALTH_CRIT_KEY); + hash_exec = simple_uhash(HEALTH_EXEC_KEY); + hash_every = simple_uhash(HEALTH_EVERY_KEY); + hash_units = simple_hash(HEALTH_UNITS_KEY); + hash_info = simple_hash(HEALTH_INFO_KEY); + hash_recipient = simple_hash(HEALTH_RECIPIENT_KEY); + hash_delay = simple_uhash(HEALTH_DELAY_KEY); + hash_options = simple_uhash(HEALTH_OPTIONS_KEY); + } + + FILE *fp = fopen(filename, "r"); + if(!fp) { + error("Health configuration cannot read file '%s'.", filename); + return 0; + } + + RRDCALC *rc = NULL; + RRDCALCTEMPLATE *rt = NULL; + + int ignore_this = 0; + size_t line = 0, append = 0; + char *s; + while((s = fgets(&buffer[append], (int)(HEALTH_CONF_MAX_LINE - append), fp)) || append) { + int stop_appending = !s; + line++; + s = trim(buffer); + if(!s || *s == '#') continue; + + append = strlen(s); + if(!stop_appending && s[append - 1] == '\\') { + s[append - 1] = ' '; + append = &s[append] - buffer; + if(append < HEALTH_CONF_MAX_LINE) + continue; + else { + error("Health configuration has too long muli-line at line %zu of file '%s'.", line, filename); + } + } + append = 0; + + char *key = s; + while(*s && *s != ':') s++; + if(!*s) { + error("Health configuration has invalid line %zu of file '%s'. It does not contain a ':'. Ignoring it.", line, filename); + continue; + } + *s = '\0'; + s++; + + char *value = s; + key = trim_all(key); + value = trim_all(value); + + if(!key) { + error("Health configuration has invalid line %zu of file '%s'. Keyword is empty. Ignoring it.", line, filename); + continue; + } + + if(!value) { + error("Health configuration has invalid line %zu of file '%s'. value is empty. Ignoring it.", line, filename); + continue; + } + + uint32_t hash = simple_uhash(key); + + if(hash == hash_alarm && !strcasecmp(key, HEALTH_ALARM_KEY)) { + if (rc && (ignore_this || !rrdcalc_add_alarm_from_config(host, rc))) + rrdcalc_free(rc); + + if(rt) { + if (ignore_this || !rrdcalctemplate_add_template_from_config(host, rt)) + rrdcalctemplate_free(rt); + + rt = NULL; + } + + rc = callocz(1, sizeof(RRDCALC)); + rc->next_event_id = 1; + rc->name = strdupz(value); + rc->hash = simple_hash(rc->name); + rc->source = health_source_file(line, filename); + rc->green = NAN; + rc->red = NAN; + rc->value = NAN; + rc->old_value = NAN; + rc->delay_multiplier = 1.0; + + if(rrdvar_fix_name(rc->name)) + error("Health configuration renamed alarm '%s' to '%s'", value, rc->name); + + ignore_this = 0; + } + else if(hash == hash_template && !strcasecmp(key, HEALTH_TEMPLATE_KEY)) { + if(rc) { + if(ignore_this || !rrdcalc_add_alarm_from_config(host, rc)) + rrdcalc_free(rc); + + rc = NULL; + } + + if(rt && (ignore_this || !rrdcalctemplate_add_template_from_config(host, rt))) + rrdcalctemplate_free(rt); + + rt = callocz(1, sizeof(RRDCALCTEMPLATE)); + rt->name = strdupz(value); + rt->hash_name = simple_hash(rt->name); + rt->source = health_source_file(line, filename); + rt->green = NAN; + rt->red = NAN; + rt->delay_multiplier = 1.0; + + if(rrdvar_fix_name(rt->name)) + error("Health configuration renamed template '%s' to '%s'", value, rt->name); + + ignore_this = 0; + } + else if(hash == hash_os && !strcasecmp(key, HEALTH_OS_KEY)) { + char *os_match = value; + SIMPLE_PATTERN *os_pattern = simple_pattern_create(os_match, NULL, SIMPLE_PATTERN_EXACT); + + if(!simple_pattern_matches(os_pattern, host->os)) { + if(rc) + debug(D_HEALTH, "HEALTH on '%s' ignoring alarm '%s' defined at %zu@%s: host O/S does not match '%s'", host->hostname, rc->name, line, filename, os_match); + + if(rt) + debug(D_HEALTH, "HEALTH on '%s' ignoring template '%s' defined at %zu@%s: host O/S does not match '%s'", host->hostname, rt->name, line, filename, os_match); + + ignore_this = 1; + } + + simple_pattern_free(os_pattern); + } + else if(hash == hash_host && !strcasecmp(key, HEALTH_HOST_KEY)) { + char *host_match = value; + SIMPLE_PATTERN *host_pattern = simple_pattern_create(host_match, NULL, SIMPLE_PATTERN_EXACT); + + if(!simple_pattern_matches(host_pattern, host->hostname)) { + if(rc) + debug(D_HEALTH, "HEALTH on '%s' ignoring alarm '%s' defined at %zu@%s: hostname does not match '%s'", host->hostname, rc->name, line, filename, host_match); + + if(rt) + debug(D_HEALTH, "HEALTH on '%s' ignoring template '%s' defined at %zu@%s: hostname does not match '%s'", host->hostname, rt->name, line, filename, host_match); + + ignore_this = 1; + } + + simple_pattern_free(host_pattern); + } + else if(rc) { + if(hash == hash_on && !strcasecmp(key, HEALTH_ON_KEY)) { + if(rc->chart) { + if(strcmp(rc->chart, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->chart, value, value); + + freez(rc->chart); + } + rc->chart = strdupz(value); + rc->hash_chart = simple_hash(rc->chart); + } + else if(hash == hash_lookup && !strcasecmp(key, HEALTH_LOOKUP_KEY)) { + health_parse_db_lookup(line, filename, value, &rc->group, &rc->after, &rc->before, + &rc->update_every, + &rc->options, &rc->dimensions); + } + else if(hash == hash_every && !strcasecmp(key, HEALTH_EVERY_KEY)) { + if(!health_parse_duration(value, &rc->update_every)) + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' cannot parse duration: '%s'.", + line, filename, rc->name, key, value); + } + else if(hash == hash_green && !strcasecmp(key, HEALTH_GREEN_KEY)) { + char *e; + rc->green = str2ld(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rc->name, key, e); + } + } + else if(hash == hash_red && !strcasecmp(key, HEALTH_RED_KEY)) { + char *e; + rc->red = str2ld(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rc->name, key, e); + } + } + else if(hash == hash_calc && !strcasecmp(key, HEALTH_CALC_KEY)) { + const char *failed_at = NULL; + int error = 0; + rc->calculation = expression_parse(value, &failed_at, &error); + if(!rc->calculation) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rc->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_warn && !strcasecmp(key, HEALTH_WARN_KEY)) { + const char *failed_at = NULL; + int error = 0; + rc->warning = expression_parse(value, &failed_at, &error); + if(!rc->warning) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rc->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_crit && !strcasecmp(key, HEALTH_CRIT_KEY)) { + const char *failed_at = NULL; + int error = 0; + rc->critical = expression_parse(value, &failed_at, &error); + if(!rc->critical) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rc->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_exec && !strcasecmp(key, HEALTH_EXEC_KEY)) { + if(rc->exec) { + if(strcmp(rc->exec, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->exec, value, value); + + freez(rc->exec); + } + rc->exec = strdupz(value); + } + else if(hash == hash_recipient && !strcasecmp(key, HEALTH_RECIPIENT_KEY)) { + if(rc->recipient) { + if(strcmp(rc->recipient, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->recipient, value, value); + + freez(rc->recipient); + } + rc->recipient = strdupz(value); + } + else if(hash == hash_units && !strcasecmp(key, HEALTH_UNITS_KEY)) { + if(rc->units) { + if(strcmp(rc->units, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->units, value, value); + + freez(rc->units); + } + rc->units = strdupz(value); + strip_quotes(rc->units); + } + else if(hash == hash_info && !strcasecmp(key, HEALTH_INFO_KEY)) { + if(rc->info) { + if(strcmp(rc->info, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->info, value, value); + + freez(rc->info); + } + rc->info = strdupz(value); + strip_quotes(rc->info); + } + else if(hash == hash_delay && !strcasecmp(key, HEALTH_DELAY_KEY)) { + health_parse_delay(line, filename, value, &rc->delay_up_duration, &rc->delay_down_duration, &rc->delay_max_duration, &rc->delay_multiplier); + } + else if(hash == hash_options && !strcasecmp(key, HEALTH_OPTIONS_KEY)) { + rc->options |= health_parse_options(value); + } + else { + error("Health configuration at line %zu of file '%s' for alarm '%s' has unknown key '%s'.", + line, filename, rc->name, key); + } + } + else if(rt) { + if(hash == hash_on && !strcasecmp(key, HEALTH_ON_KEY)) { + if(rt->context) { + if(strcmp(rt->context, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->context, value, value); + + freez(rt->context); + } + rt->context = strdupz(value); + rt->hash_context = simple_hash(rt->context); + } + else if(hash == hash_families && !strcasecmp(key, HEALTH_FAMILIES_KEY)) { + freez(rt->family_match); + simple_pattern_free(rt->family_pattern); + + rt->family_match = strdupz(value); + rt->family_pattern = simple_pattern_create(rt->family_match, NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_lookup && !strcasecmp(key, HEALTH_LOOKUP_KEY)) { + health_parse_db_lookup(line, filename, value, &rt->group, &rt->after, &rt->before, + &rt->update_every, &rt->options, &rt->dimensions); + } + else if(hash == hash_every && !strcasecmp(key, HEALTH_EVERY_KEY)) { + if(!health_parse_duration(value, &rt->update_every)) + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' cannot parse duration: '%s'.", + line, filename, rt->name, key, value); + } + else if(hash == hash_green && !strcasecmp(key, HEALTH_GREEN_KEY)) { + char *e; + rt->green = str2ld(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rt->name, key, e); + } + } + else if(hash == hash_red && !strcasecmp(key, HEALTH_RED_KEY)) { + char *e; + rt->red = str2ld(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rt->name, key, e); + } + } + else if(hash == hash_calc && !strcasecmp(key, HEALTH_CALC_KEY)) { + const char *failed_at = NULL; + int error = 0; + rt->calculation = expression_parse(value, &failed_at, &error); + if(!rt->calculation) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rt->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_warn && !strcasecmp(key, HEALTH_WARN_KEY)) { + const char *failed_at = NULL; + int error = 0; + rt->warning = expression_parse(value, &failed_at, &error); + if(!rt->warning) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rt->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_crit && !strcasecmp(key, HEALTH_CRIT_KEY)) { + const char *failed_at = NULL; + int error = 0; + rt->critical = expression_parse(value, &failed_at, &error); + if(!rt->critical) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rt->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_exec && !strcasecmp(key, HEALTH_EXEC_KEY)) { + if(rt->exec) { + if(strcmp(rt->exec, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->exec, value, value); + + freez(rt->exec); + } + rt->exec = strdupz(value); + } + else if(hash == hash_recipient && !strcasecmp(key, HEALTH_RECIPIENT_KEY)) { + if(rt->recipient) { + if(strcmp(rt->recipient, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->recipient, value, value); + + freez(rt->recipient); + } + rt->recipient = strdupz(value); + } + else if(hash == hash_units && !strcasecmp(key, HEALTH_UNITS_KEY)) { + if(rt->units) { + if(strcmp(rt->units, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->units, value, value); + + freez(rt->units); + } + rt->units = strdupz(value); + strip_quotes(rt->units); + } + else if(hash == hash_info && !strcasecmp(key, HEALTH_INFO_KEY)) { + if(rt->info) { + if(strcmp(rt->info, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->info, value, value); + + freez(rt->info); + } + rt->info = strdupz(value); + strip_quotes(rt->info); + } + else if(hash == hash_delay && !strcasecmp(key, HEALTH_DELAY_KEY)) { + health_parse_delay(line, filename, value, &rt->delay_up_duration, &rt->delay_down_duration, &rt->delay_max_duration, &rt->delay_multiplier); + } + else if(hash == hash_options && !strcasecmp(key, HEALTH_OPTIONS_KEY)) { + rt->options |= health_parse_options(value); + } + else { + error("Health configuration at line %zu of file '%s' for template '%s' has unknown key '%s'.", + line, filename, rt->name, key); + } + } + else { + error("Health configuration at line %zu of file '%s' has unknown key '%s'. Expected either '" HEALTH_ALARM_KEY "' or '" HEALTH_TEMPLATE_KEY "'.", + line, filename, key); + } + } + + if(rc && (ignore_this || !rrdcalc_add_alarm_from_config(host, rc))) + rrdcalc_free(rc); + + if(rt && (ignore_this || !rrdcalctemplate_add_template_from_config(host, rt))) + rrdcalctemplate_free(rt); + + fclose(fp); + return 1; +} + +void health_readdir(RRDHOST *host, const char *user_path, const char *stock_path, const char *subpath) { + if(unlikely(!host->health_enabled)) { + debug(D_HEALTH, "CONFIG health is not enabled for host '%s'", host->hostname); + return; + } + recursive_config_double_dir_load(user_path, stock_path, subpath, health_readfile, (void *) host, 0); +} diff --git a/health/health_json.c b/health/health_json.c new file mode 100644 index 0000000..7811324 --- /dev/null +++ b/health/health_json.c @@ -0,0 +1,267 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +static inline void health_string2json(BUFFER *wb, const char *prefix, const char *label, const char *value, const char *suffix) { + if(value && *value) { + buffer_sprintf(wb, "%s\"%s\":\"", prefix, label); + buffer_strcat_htmlescape(wb, value); + buffer_strcat(wb, "\""); + buffer_strcat(wb, suffix); + } + else + buffer_sprintf(wb, "%s\"%s\":null%s", prefix, label, suffix); +} + +static inline void health_alarm_entry2json_nolock(BUFFER *wb, ALARM_ENTRY *ae, RRDHOST *host) { + buffer_sprintf(wb, + "\n\t{\n" + "\t\t\"hostname\": \"%s\",\n" + "\t\t\"unique_id\": %u,\n" + "\t\t\"alarm_id\": %u,\n" + "\t\t\"alarm_event_id\": %u,\n" + "\t\t\"name\": \"%s\",\n" + "\t\t\"chart\": \"%s\",\n" + "\t\t\"family\": \"%s\",\n" + "\t\t\"processed\": %s,\n" + "\t\t\"updated\": %s,\n" + "\t\t\"exec_run\": %lu,\n" + "\t\t\"exec_failed\": %s,\n" + "\t\t\"exec\": \"%s\",\n" + "\t\t\"recipient\": \"%s\",\n" + "\t\t\"exec_code\": %d,\n" + "\t\t\"source\": \"%s\",\n" + "\t\t\"units\": \"%s\",\n" + "\t\t\"when\": %lu,\n" + "\t\t\"duration\": %lu,\n" + "\t\t\"non_clear_duration\": %lu,\n" + "\t\t\"status\": \"%s\",\n" + "\t\t\"old_status\": \"%s\",\n" + "\t\t\"delay\": %d,\n" + "\t\t\"delay_up_to_timestamp\": %lu,\n" + "\t\t\"updated_by_id\": %u,\n" + "\t\t\"updates_id\": %u,\n" + "\t\t\"value_string\": \"%s\",\n" + "\t\t\"old_value_string\": \"%s\",\n" + "\t\t\"silenced\": \"%s\",\n" + , host->hostname + , ae->unique_id + , ae->alarm_id + , ae->alarm_event_id + , ae->name + , ae->chart + , ae->family + , (ae->flags & HEALTH_ENTRY_FLAG_PROCESSED)?"true":"false" + , (ae->flags & HEALTH_ENTRY_FLAG_UPDATED)?"true":"false" + , (unsigned long)ae->exec_run_timestamp + , (ae->flags & HEALTH_ENTRY_FLAG_EXEC_FAILED)?"true":"false" + , ae->exec?ae->exec:host->health_default_exec + , ae->recipient?ae->recipient:host->health_default_recipient + , ae->exec_code + , ae->source + , ae->units?ae->units:"" + , (unsigned long)ae->when + , (unsigned long)ae->duration + , (unsigned long)ae->non_clear_duration + , rrdcalc_status2string(ae->new_status) + , rrdcalc_status2string(ae->old_status) + , ae->delay + , (unsigned long)ae->delay_up_to_timestamp + , ae->updated_by_id + , ae->updates_id + , ae->new_value_string + , ae->old_value_string + , (ae->flags & HEALTH_ENTRY_FLAG_SILENCED)?"true":"false" + ); + + health_string2json(wb, "\t\t", "info", ae->info?ae->info:"", ",\n"); + + if(unlikely(ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION)) { + buffer_strcat(wb, "\t\t\"no_clear_notification\": true,\n"); + } + + buffer_strcat(wb, "\t\t\"value\":"); + buffer_rrd_value(wb, ae->new_value); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\"old_value\":"); + buffer_rrd_value(wb, ae->old_value); + buffer_strcat(wb, "\n"); + + buffer_strcat(wb, "\t}"); +} + +void health_alarm_log2json(RRDHOST *host, BUFFER *wb, uint32_t after) { + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + buffer_strcat(wb, "["); + + unsigned int max = host->health_log.max; + unsigned int count = 0; + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae && count < max ; count++, ae = ae->next) { + if(ae->unique_id > after) { + if(likely(count)) buffer_strcat(wb, ","); + health_alarm_entry2json_nolock(wb, ae, host); + } + } + + buffer_strcat(wb, "\n]\n"); + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} + +static inline void health_rrdcalc2json_nolock(RRDHOST *host, BUFFER *wb, RRDCALC *rc) { + char value_string[100 + 1]; + format_value_and_unit(value_string, 100, rc->value, rc->units, -1); + + buffer_sprintf(wb, + "\t\t\"%s.%s\": {\n" + "\t\t\t\"id\": %lu,\n" + "\t\t\t\"name\": \"%s\",\n" + "\t\t\t\"chart\": \"%s\",\n" + "\t\t\t\"family\": \"%s\",\n" + "\t\t\t\"active\": %s,\n" + "\t\t\t\"disabled\": %s,\n" + "\t\t\t\"silenced\": %s,\n" + "\t\t\t\"exec\": \"%s\",\n" + "\t\t\t\"recipient\": \"%s\",\n" + "\t\t\t\"source\": \"%s\",\n" + "\t\t\t\"units\": \"%s\",\n" + "\t\t\t\"info\": \"%s\",\n" + "\t\t\t\"status\": \"%s\",\n" + "\t\t\t\"last_status_change\": %lu,\n" + "\t\t\t\"last_updated\": %lu,\n" + "\t\t\t\"next_update\": %lu,\n" + "\t\t\t\"update_every\": %d,\n" + "\t\t\t\"delay_up_duration\": %d,\n" + "\t\t\t\"delay_down_duration\": %d,\n" + "\t\t\t\"delay_max_duration\": %d,\n" + "\t\t\t\"delay_multiplier\": %f,\n" + "\t\t\t\"delay\": %d,\n" + "\t\t\t\"delay_up_to_timestamp\": %lu,\n" + "\t\t\t\"value_string\": \"%s\",\n" + , rc->chart, rc->name + , (unsigned long)rc->id + , rc->name + , rc->chart + , (rc->rrdset && rc->rrdset->family)?rc->rrdset->family:"" + , (rc->rrdset)?"true":"false" + , (rc->rrdcalc_flags & RRDCALC_FLAG_DISABLED)?"true":"false" + , (rc->rrdcalc_flags & RRDCALC_FLAG_SILENCED)?"true":"false" + , rc->exec?rc->exec:host->health_default_exec + , rc->recipient?rc->recipient:host->health_default_recipient + , rc->source + , rc->units?rc->units:"" + , rc->info?rc->info:"" + , rrdcalc_status2string(rc->status) + , (unsigned long)rc->last_status_change + , (unsigned long)rc->last_updated + , (unsigned long)rc->next_update + , rc->update_every + , rc->delay_up_duration + , rc->delay_down_duration + , rc->delay_max_duration + , rc->delay_multiplier + , rc->delay_last + , (unsigned long)rc->delay_up_to_timestamp + , value_string + ); + + if(unlikely(rc->options & RRDCALC_FLAG_NO_CLEAR_NOTIFICATION)) { + buffer_strcat(wb, "\t\t\t\"no_clear_notification\": true,\n"); + } + + if(RRDCALC_HAS_DB_LOOKUP(rc)) { + if(rc->dimensions && *rc->dimensions) + health_string2json(wb, "\t\t\t", "lookup_dimensions", rc->dimensions, ",\n"); + + buffer_sprintf(wb, + "\t\t\t\"db_after\": %lu,\n" + "\t\t\t\"db_before\": %lu,\n" + "\t\t\t\"lookup_method\": \"%s\",\n" + "\t\t\t\"lookup_after\": %d,\n" + "\t\t\t\"lookup_before\": %d,\n" + "\t\t\t\"lookup_options\": \"", + (unsigned long) rc->db_after, + (unsigned long) rc->db_before, + group_method2string(rc->group), + rc->after, + rc->before + ); + buffer_data_options2string(wb, rc->options); + buffer_strcat(wb, "\",\n"); + } + + if(rc->calculation) { + health_string2json(wb, "\t\t\t", "calc", rc->calculation->source, ",\n"); + health_string2json(wb, "\t\t\t", "calc_parsed", rc->calculation->parsed_as, ",\n"); + } + + if(rc->warning) { + health_string2json(wb, "\t\t\t", "warn", rc->warning->source, ",\n"); + health_string2json(wb, "\t\t\t", "warn_parsed", rc->warning->parsed_as, ",\n"); + } + + if(rc->critical) { + health_string2json(wb, "\t\t\t", "crit", rc->critical->source, ",\n"); + health_string2json(wb, "\t\t\t", "crit_parsed", rc->critical->parsed_as, ",\n"); + } + + buffer_strcat(wb, "\t\t\t\"green\":"); + buffer_rrd_value(wb, rc->green); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\t\"red\":"); + buffer_rrd_value(wb, rc->red); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\t\"value\":"); + buffer_rrd_value(wb, rc->value); + buffer_strcat(wb, "\n"); + + buffer_strcat(wb, "\t\t}"); +} + +//void health_rrdcalctemplate2json_nolock(BUFFER *wb, RRDCALCTEMPLATE *rt) { +// +//} + +void health_alarms2json(RRDHOST *host, BUFFER *wb, int all) { + int i; + + rrdhost_rdlock(host); + buffer_sprintf(wb, "{\n\t\"hostname\": \"%s\"," + "\n\t\"latest_alarm_log_unique_id\": %u," + "\n\t\"status\": %s," + "\n\t\"now\": %lu," + "\n\t\"alarms\": {\n", + host->hostname, + (host->health_log.next_log_id > 0)?(host->health_log.next_log_id - 1):0, + host->health_enabled?"true":"false", + (unsigned long)now_realtime_sec()); + + RRDCALC *rc; + for(i = 0, rc = host->alarms; rc ; rc = rc->next) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + + if(likely(!all && !(rc->status == RRDCALC_STATUS_WARNING || rc->status == RRDCALC_STATUS_CRITICAL))) + continue; + + if(likely(i)) buffer_strcat(wb, ",\n"); + health_rrdcalc2json_nolock(host, wb, rc); + i++; + } + +// buffer_strcat(wb, "\n\t},\n\t\"templates\": {"); +// RRDCALCTEMPLATE *rt; +// for(rt = host->templates; rt ; rt = rt->next) +// health_rrdcalctemplate2json_nolock(wb, rt); + + buffer_strcat(wb, "\n\t}\n}\n"); + rrdhost_unlock(host); +} + + + diff --git a/health/health_log.c b/health/health_log.c new file mode 100644 index 0000000..009e426 --- /dev/null +++ b/health/health_log.c @@ -0,0 +1,463 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +// ---------------------------------------------------------------------------- +// health alarm log load/save +// no need for locking - only one thread is reading / writing the alarms log + +inline int health_alarm_log_open(RRDHOST *host) { + if(host->health_log_fp) + fclose(host->health_log_fp); + + host->health_log_fp = fopen(host->health_log_filename, "a"); + + if(host->health_log_fp) { + if (setvbuf(host->health_log_fp, NULL, _IOLBF, 0) != 0) + error("HEALTH [%s]: cannot set line buffering on health log file '%s'.", host->hostname, host->health_log_filename); + return 0; + } + + error("HEALTH [%s]: cannot open health log file '%s'. Health data will be lost in case of netdata or server crash.", host->hostname, host->health_log_filename); + return -1; +} + +inline void health_alarm_log_close(RRDHOST *host) { + if(host->health_log_fp) { + fclose(host->health_log_fp); + host->health_log_fp = NULL; + } +} + +inline void health_log_rotate(RRDHOST *host) { + static size_t rotate_every = 0; + + if(unlikely(rotate_every == 0)) { + rotate_every = (size_t)config_get_number(CONFIG_SECTION_HEALTH, "rotate log every lines", 2000); + if(rotate_every < 100) rotate_every = 100; + } + + if(unlikely(host->health_log_entries_written > rotate_every)) { + health_alarm_log_close(host); + + char old_filename[FILENAME_MAX + 1]; + snprintfz(old_filename, FILENAME_MAX, "%s.old", host->health_log_filename); + + if(unlink(old_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot remove old alarms log file '%s'", host->hostname, old_filename); + + if(link(host->health_log_filename, old_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot move file '%s' to '%s'.", host->hostname, host->health_log_filename, old_filename); + + if(unlink(host->health_log_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot remove old alarms log file '%s'", host->hostname, host->health_log_filename); + + // open it with truncate + host->health_log_fp = fopen(host->health_log_filename, "w"); + + if(host->health_log_fp) + fclose(host->health_log_fp); + else + error("HEALTH [%s]: cannot truncate health log '%s'", host->hostname, host->health_log_filename); + + host->health_log_fp = NULL; + + host->health_log_entries_written = 0; + health_alarm_log_open(host); + } +} + +inline void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae) { + health_log_rotate(host); + + if(likely(host->health_log_fp)) { + if(unlikely(fprintf(host->health_log_fp + , "%c\t%s" + "\t%08x\t%08x\t%08x\t%08x\t%08x" + "\t%08x\t%08x\t%08x" + "\t%08x\t%08x\t%08x" + "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" + "\t%d\t%d\t%d\t%d" + "\t" CALCULATED_NUMBER_FORMAT_AUTO "\t" CALCULATED_NUMBER_FORMAT_AUTO + "\n" + , (ae->flags & HEALTH_ENTRY_FLAG_SAVED)?'U':'A' + , host->hostname + + , ae->unique_id + , ae->alarm_id + , ae->alarm_event_id + , ae->updated_by_id + , ae->updates_id + + , (uint32_t)ae->when + , (uint32_t)ae->duration + , (uint32_t)ae->non_clear_duration + , (uint32_t)ae->flags + , (uint32_t)ae->exec_run_timestamp + , (uint32_t)ae->delay_up_to_timestamp + + , (ae->name)?ae->name:"" + , (ae->chart)?ae->chart:"" + , (ae->family)?ae->family:"" + , (ae->exec)?ae->exec:"" + , (ae->recipient)?ae->recipient:"" + , (ae->source)?ae->source:"" + , (ae->units)?ae->units:"" + , (ae->info)?ae->info:"" + + , ae->exec_code + , ae->new_status + , ae->old_status + , ae->delay + + , ae->new_value + , ae->old_value + ) < 0)) + error("HEALTH [%s]: failed to save alarm log entry to '%s'. Health data may be lost in case of abnormal restart.", host->hostname, host->health_log_filename); + else { + ae->flags |= HEALTH_ENTRY_FLAG_SAVED; + host->health_log_entries_written++; + } + } +} + +inline ssize_t health_alarm_log_read(RRDHOST *host, FILE *fp, const char *filename) { + errno = 0; + + char *s, *buf = mallocz(65536 + 1); + size_t line = 0, len = 0; + ssize_t loaded = 0, updated = 0, errored = 0, duplicate = 0; + + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + while((s = fgets_trim_len(buf, 65536, fp, &len))) { + host->health_log_entries_written++; + line++; + + int max_entries = 30, entries = 0; + char *pointers[max_entries]; + + pointers[entries++] = s++; + while(*s) { + if(unlikely(*s == '\t')) { + *s = '\0'; + pointers[entries++] = ++s; + if(entries >= max_entries) { + error("HEALTH [%s]: line %zu of file '%s' has more than %d entries. Ignoring excessive entries.", host->hostname, line, filename, max_entries); + break; + } + } + else s++; + } + + if(likely(*pointers[0] == 'U' || *pointers[0] == 'A')) { + ALARM_ENTRY *ae = NULL; + + if(entries < 26) { + error("HEALTH [%s]: line %zu of file '%s' should have at least 26 entries, but it has %d. Ignoring it.", host->hostname, line, filename, entries); + errored++; + continue; + } + + // check that we have valid ids + uint32_t unique_id = (uint32_t)strtoul(pointers[2], NULL, 16); + if(!unique_id) { + error("HEALTH [%s]: line %zu of file '%s' states alarm entry with invalid unique id %u (%s). Ignoring it.", host->hostname, line, filename, unique_id, pointers[2]); + errored++; + continue; + } + + uint32_t alarm_id = (uint32_t)strtoul(pointers[3], NULL, 16); + if(!alarm_id) { + error("HEALTH [%s]: line %zu of file '%s' states alarm entry for invalid alarm id %u (%s). Ignoring it.", host->hostname, line, filename, alarm_id, pointers[3]); + errored++; + continue; + } + + if(unlikely(*pointers[0] == 'A')) { + // make sure it is properly numbered + if(unlikely(host->health_log.alarms && unique_id < host->health_log.alarms->unique_id)) { + error("HEALTH [%s]: line %zu of file '%s' has alarm log entry %u in wrong order. Ignoring it.", host->hostname, line, filename, unique_id); + errored++; + continue; + } + + ae = callocz(1, sizeof(ALARM_ENTRY)); + } + else if(unlikely(*pointers[0] == 'U')) { + // find the original + for(ae = host->health_log.alarms; ae; ae = ae->next) { + if(unlikely(unique_id == ae->unique_id)) { + if(unlikely(*pointers[0] == 'A')) { + error("HEALTH [%s]: line %zu of file '%s' adds duplicate alarm log entry %u. Using the later." + , host->hostname, line, filename, unique_id); + *pointers[0] = 'U'; + duplicate++; + } + break; + } + else if(unlikely(unique_id > ae->unique_id)) { + // no need to continue + // the linked list is sorted + ae = NULL; + break; + } + } + } + + // if not found, skip this line + if(unlikely(!ae)) { + // error("HEALTH [%s]: line %zu of file '%s' updates alarm log entry with unique id %u, but it is not found.", host->hostname, line, filename, unique_id); + continue; + } + + // check for a possible host missmatch + //if(strcmp(pointers[1], host->hostname)) + // error("HEALTH [%s]: line %zu of file '%s' provides an alarm for host '%s' but this is named '%s'.", host->hostname, line, filename, pointers[1], host->hostname); + + ae->unique_id = unique_id; + ae->alarm_id = alarm_id; + ae->alarm_event_id = (uint32_t)strtoul(pointers[4], NULL, 16); + ae->updated_by_id = (uint32_t)strtoul(pointers[5], NULL, 16); + ae->updates_id = (uint32_t)strtoul(pointers[6], NULL, 16); + + ae->when = (uint32_t)strtoul(pointers[7], NULL, 16); + ae->duration = (uint32_t)strtoul(pointers[8], NULL, 16); + ae->non_clear_duration = (uint32_t)strtoul(pointers[9], NULL, 16); + + ae->flags = (uint32_t)strtoul(pointers[10], NULL, 16); + ae->flags |= HEALTH_ENTRY_FLAG_SAVED; + + ae->exec_run_timestamp = (uint32_t)strtoul(pointers[11], NULL, 16); + ae->delay_up_to_timestamp = (uint32_t)strtoul(pointers[12], NULL, 16); + + freez(ae->name); + ae->name = strdupz(pointers[13]); + ae->hash_name = simple_hash(ae->name); + + freez(ae->chart); + ae->chart = strdupz(pointers[14]); + ae->hash_chart = simple_hash(ae->chart); + + freez(ae->family); + ae->family = strdupz(pointers[15]); + + freez(ae->exec); + ae->exec = strdupz(pointers[16]); + if(!*ae->exec) { freez(ae->exec); ae->exec = NULL; } + + freez(ae->recipient); + ae->recipient = strdupz(pointers[17]); + if(!*ae->recipient) { freez(ae->recipient); ae->recipient = NULL; } + + freez(ae->source); + ae->source = strdupz(pointers[18]); + if(!*ae->source) { freez(ae->source); ae->source = NULL; } + + freez(ae->units); + ae->units = strdupz(pointers[19]); + if(!*ae->units) { freez(ae->units); ae->units = NULL; } + + freez(ae->info); + ae->info = strdupz(pointers[20]); + if(!*ae->info) { freez(ae->info); ae->info = NULL; } + + ae->exec_code = str2i(pointers[21]); + ae->new_status = str2i(pointers[22]); + ae->old_status = str2i(pointers[23]); + ae->delay = str2i(pointers[24]); + + ae->new_value = str2l(pointers[25]); + ae->old_value = str2l(pointers[26]); + + char value_string[100 + 1]; + freez(ae->old_value_string); + freez(ae->new_value_string); + ae->old_value_string = strdupz(format_value_and_unit(value_string, 100, ae->old_value, ae->units, -1)); + ae->new_value_string = strdupz(format_value_and_unit(value_string, 100, ae->new_value, ae->units, -1)); + + // add it to host if not already there + if(unlikely(*pointers[0] == 'A')) { + ae->next = host->health_log.alarms; + host->health_log.alarms = ae; + loaded++; + } + else updated++; + + if(unlikely(ae->unique_id > host->health_max_unique_id)) + host->health_max_unique_id = ae->unique_id; + + if(unlikely(ae->alarm_id >= host->health_max_alarm_id)) + host->health_max_alarm_id = ae->alarm_id; + } + else { + error("HEALTH [%s]: line %zu of file '%s' is invalid (unrecognized entry type '%s').", host->hostname, line, filename, pointers[0]); + errored++; + } + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + freez(buf); + + if(!host->health_max_unique_id) host->health_max_unique_id = (uint32_t)now_realtime_sec(); + if(!host->health_max_alarm_id) host->health_max_alarm_id = (uint32_t)now_realtime_sec(); + + host->health_log.next_log_id = host->health_max_unique_id + 1; + host->health_log.next_alarm_id = host->health_max_alarm_id + 1; + + debug(D_HEALTH, "HEALTH [%s]: loaded file '%s' with %zd new alarm entries, updated %zd alarms, errors %zd entries, duplicate %zd", host->hostname, filename, loaded, updated, errored, duplicate); + return loaded; +} + +inline void health_alarm_log_load(RRDHOST *host) { + health_alarm_log_close(host); + + char filename[FILENAME_MAX + 1]; + snprintfz(filename, FILENAME_MAX, "%s.old", host->health_log_filename); + FILE *fp = fopen(filename, "r"); + if(!fp) + error("HEALTH [%s]: cannot open health file: %s", host->hostname, filename); + else { + health_alarm_log_read(host, fp, filename); + fclose(fp); + } + + host->health_log_entries_written = 0; + fp = fopen(host->health_log_filename, "r"); + if(!fp) + error("HEALTH [%s]: cannot open health file: %s", host->hostname, host->health_log_filename); + else { + health_alarm_log_read(host, fp, host->health_log_filename); + fclose(fp); + } + + health_alarm_log_open(host); +} + + +// ---------------------------------------------------------------------------- +// health alarm log management + +inline void health_alarm_log( + RRDHOST *host, + uint32_t alarm_id, + uint32_t alarm_event_id, + time_t when, + const char *name, + const char *chart, + const char *family, + const char *exec, + const char *recipient, + time_t duration, + calculated_number old_value, + calculated_number new_value, + RRDCALC_STATUS old_status, + RRDCALC_STATUS new_status, + const char *source, + const char *units, + const char *info, + int delay, + uint32_t flags +) { + debug(D_HEALTH, "Health adding alarm log entry with id: %u", host->health_log.next_log_id); + + ALARM_ENTRY *ae = callocz(1, sizeof(ALARM_ENTRY)); + ae->name = strdupz(name); + ae->hash_name = simple_hash(ae->name); + + if(chart) { + ae->chart = strdupz(chart); + ae->hash_chart = simple_hash(ae->chart); + } + + if(family) + ae->family = strdupz(family); + + if(exec) ae->exec = strdupz(exec); + if(recipient) ae->recipient = strdupz(recipient); + if(source) ae->source = strdupz(source); + if(units) ae->units = strdupz(units); + if(info) ae->info = strdupz(info); + + ae->unique_id = host->health_log.next_log_id++; + ae->alarm_id = alarm_id; + ae->alarm_event_id = alarm_event_id; + ae->when = when; + ae->old_value = old_value; + ae->new_value = new_value; + + char value_string[100 + 1]; + ae->old_value_string = strdupz(format_value_and_unit(value_string, 100, ae->old_value, ae->units, -1)); + ae->new_value_string = strdupz(format_value_and_unit(value_string, 100, ae->new_value, ae->units, -1)); + + ae->old_status = old_status; + ae->new_status = new_status; + ae->duration = duration; + ae->delay = delay; + ae->delay_up_to_timestamp = when + delay; + ae->flags |= flags; + + if(ae->old_status == RRDCALC_STATUS_WARNING || ae->old_status == RRDCALC_STATUS_CRITICAL) + ae->non_clear_duration += ae->duration; + + // link it + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + ae->next = host->health_log.alarms; + host->health_log.alarms = ae; + host->health_log.count++; + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + // match previous alarms + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + ALARM_ENTRY *t; + for(t = host->health_log.alarms ; t ; t = t->next) { + if(t != ae && t->alarm_id == ae->alarm_id) { + if(!(t->flags & HEALTH_ENTRY_FLAG_UPDATED) && !t->updated_by_id) { + t->flags |= HEALTH_ENTRY_FLAG_UPDATED; + t->updated_by_id = ae->unique_id; + ae->updates_id = t->unique_id; + + if((t->new_status == RRDCALC_STATUS_WARNING || t->new_status == RRDCALC_STATUS_CRITICAL) && + (t->old_status == RRDCALC_STATUS_WARNING || t->old_status == RRDCALC_STATUS_CRITICAL)) + ae->non_clear_duration += t->non_clear_duration; + + health_alarm_log_save(host, t); + } + + // no need to continue + break; + } + } + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + health_alarm_log_save(host, ae); +} + +inline void health_alarm_log_free_one_nochecks_nounlink(ALARM_ENTRY *ae) { + freez(ae->name); + freez(ae->chart); + freez(ae->family); + freez(ae->exec); + freez(ae->recipient); + freez(ae->source); + freez(ae->units); + freez(ae->info); + freez(ae->old_value_string); + freez(ae->new_value_string); + freez(ae); +} + +inline void health_alarm_log_free(RRDHOST *host) { + rrdhost_check_wrlock(host); + + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *ae; + while((ae = host->health_log.alarms)) { + host->health_log.alarms = ae->next; + health_alarm_log_free_one_nochecks_nounlink(ae); + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} diff --git a/health/notifications/Makefile.am b/health/notifications/Makefile.am new file mode 100644 index 0000000..a5b88f0 --- /dev/null +++ b/health/notifications/Makefile.am @@ -0,0 +1,45 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +AUTOMAKE_OPTIONS = subdir-objects +MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +CLEANFILES = \ + alarm-notify.sh \ + $(NULL) + +include $(top_srcdir)/build/subst.inc +SUFFIXES = .in + +dist_libconfig_DATA = \ + health_alarm_notify.conf \ + health_email_recipients.conf \ + $(NULL) + +dist_plugins_SCRIPTS = \ + alarm-notify.sh \ + alarm-email.sh \ + alarm-test.sh \ + $(NULL) + +dist_noinst_DATA = \ + alarm-notify.sh.in \ + README.md \ + $(NULL) + +include alerta/Makefile.inc +include awssns/Makefile.inc +include discord/Makefile.inc +include email/Makefile.inc +include flock/Makefile.inc +include irc/Makefile.inc +include kavenegar/Makefile.inc +include messagebird/Makefile.inc +include pagerduty/Makefile.inc +include pushbullet/Makefile.inc +include pushover/Makefile.inc +include rocketchat/Makefile.inc +include slack/Makefile.inc +include syslog/Makefile.inc +include telegram/Makefile.inc +include twilio/Makefile.inc +include web/Makefile.inc diff --git a/health/notifications/README.md b/health/notifications/README.md new file mode 100644 index 0000000..5b7b434 --- /dev/null +++ b/health/notifications/README.md @@ -0,0 +1,66 @@ +# Netdata alarm notifications + +The `exec` line in health configuration defines an external script that will be called once +the alarm is triggered. The default script is **[alarm-notify.sh](alarm-notify.sh.in)**. + +You can change the default script globally by editing `/etc/netdata/netdata.conf`. + +`alarm-notify.sh` is capable of sending notifications: + +- to multiple recipients +- using multiple notification methods +- filtering severity per recipient + +It uses **roles**. For example `sysadmin`, `webmaster`, `dba`, etc. + +Each alarm is assigned to one or more roles, using the `to` line of the alarm configuration. +Then `alarm-notify.sh` uses its own configuration file `/etc/netdata/health_alarm_notify.conf` +the default is [here](health_alarm_notify.conf) +(to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`) +to find the destination address of the notification for each method. + +Each role may have one or more destinations. + +So, for example the `sysadmin` role may send: + +1. emails to admin1@example.com and admin2@example.com +2. pushover.net notifications to USERTOKENS `A`, `B` and `C`. +3. pushbullet.com push notifications to admin1@example.com and admin2@example.com +4. messages to slack.com channel `#alarms` and `#systems`. +5. messages to Discord channels `#alarms` and `#systems`. + +## Configuration + +Edit [`/etc/netdata/health_alarm_notify.conf`](health_alarm_notify.conf) +by running `/etc/netdata/edit-config health_alarm_notify.conf`: + +- settings per notification method: + + all notification methods except email, require some configuration + (i.e. API keys, tokens, destination rooms, channels, etc). + +2. **recipients** per **role** per **notification method** + +## Testing Notifications + +You can run the following command by hand, to test alarms configuration: + +```sh +# become user netdata +su -s /bin/bash netdata + +# enable debugging info on the console +export NETDATA_ALARM_NOTIFY_DEBUG=1 + +# send test alarms to sysadmin +/usr/libexec/netdata/plugins.d/alarm-notify.sh test + +# send test alarms to any role +/usr/libexec/netdata/plugins.d/alarm-notify.sh test "ROLE" +``` +If you need to dig even deeper, you can trace the execution with `bash -x`. Note that in test mode, alarm-notify.sh calls itself with many more arguments. So first do + ```sh + bash -x /usr/libexec/netdata/plugins.d/alarm-notify.sh test + ``` + Then look in the output for the alarm-notify.sh calls and run the one you want to trace with `bash -x`. +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/alarm-email.sh b/health/notifications/alarm-email.sh new file mode 100755 index 0000000..69c4c3f --- /dev/null +++ b/health/notifications/alarm-email.sh @@ -0,0 +1,7 @@ +#!/usr/bin/env bash +# SPDX-License-Identifier: GPL-3.0-or-later + +# OBSOLETE - REPLACED WITH +# alarm-notify.sh + +${0/alarm-email.sh/alarm-notify.sh} "${@}" diff --git a/health/notifications/alarm-notify.sh.in b/health/notifications/alarm-notify.sh.in new file mode 100755 index 0000000..dd3cda9 --- /dev/null +++ b/health/notifications/alarm-notify.sh.in @@ -0,0 +1,2304 @@ +#!/usr/bin/env bash + +# netdata +# real-time performance and health monitoring, done right! +# (C) 2017 Costa Tsaousis <costa@tsaousis.gr> +# SPDX-License-Identifier: GPL-3.0-or-later +# +# Script to send alarm notifications for netdata +# +# Features: +# - multiple notification methods +# - multiple roles per alarm +# - multiple recipients per role +# - severity filtering per recipient +# +# Supported notification methods: +# - emails by @ktsaou +# - slack.com notifications by @ktsaou +# - alerta.io notifications by @kattunga +# - discordapp.com notifications by @lowfive +# - pushover.net notifications by @ktsaou +# - pushbullet.com push notifications by Tiago Peralta @tperalta82 #1070 +# - telegram.org notifications by @hashworks #1002 +# - twilio.com notifications by Levi Blaney @shadycuz #1211 +# - kafka notifications by @ktsaou #1342 +# - pagerduty.com notifications by Jim Cooley @jimcooley #1373 +# - messagebird.com notifications by @tech_no_logical #1453 +# - hipchat notifications by @ktsaou #1561 +# - fleep notifications by @Ferroin +# - prowlapp.com notifications by @Ferroin +# - custom notifications by @ktsaou +# - syslog messages by @Ferroin +# - Microsoft Team notification by @tioumen + +# ----------------------------------------------------------------------------- +# testing notifications + + +if [ \( "${1}" = "test" -o "${2}" = "test" \) -a "${#}" -le 2 ] +then + if [ "${2}" = "test" ] + then + recipient="${1}" + else + recipient="${2}" + fi + + [ -z "${recipient}" ] && recipient="sysadmin" + + id=1 + last="CLEAR" + test_res=0 + for x in "WARNING" "CRITICAL" "CLEAR" + do + echo >&2 + echo >&2 "# SENDING TEST ${x} ALARM TO ROLE: ${recipient}" + + "${0}" "${recipient}" "$(hostname)" 1 1 "${id}" "$(date +%s)" "test_alarm" "test.chart" "test.family" "${x}" "${last}" 100 90 "${0}" 1 $((0 + id)) "units" "this is a test alarm to verify notifications work" "new value" "old value" "evaluated expression" "expression variable values" 0 0 + if [ $? -ne 0 ] + then + echo >&2 "# FAILED" + test_res=1 + else + echo >&2 "# OK" + fi + + last="${x}" + id=$((id + 1)) + done + + exit $test_res +fi + +export PATH="${PATH}:/sbin:/usr/sbin:/usr/local/sbin" +export LC_ALL=C + +# ----------------------------------------------------------------------------- + +PROGRAM_NAME="$(basename "${0}")" + +logdate() { + date "+%Y-%m-%d %H:%M:%S" +} + +log() { + local status="${1}" + shift + + echo >&2 "$(logdate): ${PROGRAM_NAME}: ${status}: ${*}" + +} + +warning() { + log WARNING "${@}" +} + +error() { + log ERROR "${@}" +} + +info() { + log INFO "${@}" +} + +fatal() { + log FATAL "${@}" + exit 1 +} + +debug=${NETDATA_ALARM_NOTIFY_DEBUG-0} +debug() { + [ "${debug}" = "1" ] && log DEBUG "${@}" +} + +docurl() { + if [ -z "${curl}" ] + then + error "\${curl} is unset." + return 1 + fi + + if [ "${debug}" = "1" ] + then + echo >&2 "--- BEGIN curl command ---" + printf >&2 "%q " ${curl} "${@}" + echo >&2 + echo >&2 "--- END curl command ---" + + local out=$(mktemp /tmp/netdata-health-alarm-notify-XXXXXXXX) + local code=$(${curl} ${curl_options} --write-out %{http_code} --output "${out}" --silent --show-error "${@}") + local ret=$? + echo >&2 "--- BEGIN received response ---" + cat >&2 "${out}" + echo >&2 + echo >&2 "--- END received response ---" + echo >&2 "RECEIVED HTTP RESPONSE CODE: ${code}" + rm "${out}" + echo "${code}" + return ${ret} + fi + + ${curl} ${curl_options} --write-out %{http_code} --output /dev/null --silent --show-error "${@}" + return $? +} + +# ----------------------------------------------------------------------------- +# List of all the notification mechanisms we support. +# Used in a couple of places to write more compact code. + +method_names=" +email +pushover +pushbullet +telegram +slack +alerta +flock +discord +hipchat +twilio +messagebird +pd +fleep +syslog +custom +msteam +kavenegar +prowl +" + +# ----------------------------------------------------------------------------- +# this is to be overwritten by the config file + +custom_sender() { + info "not sending custom notification for ${status} of '${host}.${chart}.${name}'" +} + + +# ----------------------------------------------------------------------------- + +# check for BASH v4+ (required for associative arrays) +[ $(( ${BASH_VERSINFO[0]} )) -lt 4 ] && \ + fatal "BASH version 4 or later is required (this is ${BASH_VERSION})." + +# ----------------------------------------------------------------------------- +# defaults to allow running this script by hand + +[ -z "${NETDATA_USER_CONFIG_DIR}" ] && NETDATA_USER_CONFIG_DIR="@configdir_POST@" +[ -z "${NETDATA_STOCK_CONFIG_DIR}" ] && NETDATA_STOCK_CONFIG_DIR="@libconfigdir_POST@" +[ -z "${NETDATA_CACHE_DIR}" ] && NETDATA_CACHE_DIR="@cachedir_POST@" +[ -z "${NETDATA_REGISTRY_URL}" ] && NETDATA_REGISTRY_URL="https://registry.my-netdata.io" + +# ----------------------------------------------------------------------------- +# parse command line parameters + +if [ ${1} = "unittest" ] ; then + unittest=1 # enable unit testing mode + roles="${2}" # the role that should be used for unit testing + cfgfile="${3}" # the location of the config file to use for unit testing + status="${4}" # the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + old_status="${5}" # the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL +else + roles="${1}" # the roles that should be notified for this event + args_host="${2}" # the host generated this event + unique_id="${3}" # the unique id of this event + alarm_id="${4}" # the unique id of the alarm that generated this event + event_id="${5}" # the incremental id of the event, for this alarm id + when="${6}" # the timestamp this event occurred + name="${7}" # the name of the alarm, as given in netdata health.d entries + chart="${8}" # the name of the chart (type.id) + family="${9}" # the family of the chart + status="${10}" # the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + old_status="${11}" # the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + value="${12}" # the current value of the alarm + old_value="${13}" # the previous value of the alarm + src="${14}" # the line number and file the alarm has been configured + duration="${15}" # the duration in seconds of the previous alarm state + non_clear_duration="${16}" # the total duration in seconds this is/was non-clear + units="${17}" # the units of the value + info="${18}" # a short description of the alarm + value_string="${19}" # friendly value (with units) + old_value_string="${20}" # friendly old value (with units) + calc_expression="${21}" # contains the expression that was evaluated to trigger the alarm + calc_param_values="${22}" # the values of the parameters in the expression, at the time of the evaluation + total_warnings="${23}" # Total number of alarms in WARNING state + total_critical="${24}" # Total number of alarms in CRITICAL state +fi + + +# ----------------------------------------------------------------------------- +# find a suitable hostname to use, if netdata did not supply a hostname + +if [ -z ${args_host} ] + then + this_host=$(hostname -s 2>/dev/null) + host="${this_host}" + args_host="${this_host}" +else + host="${args_host}" +fi + +# ----------------------------------------------------------------------------- +# screen statuses we don't need to send a notification + +# don't do anything if this is not WARNING, CRITICAL or CLEAR +if [ "${status}" != "WARNING" -a "${status}" != "CRITICAL" -a "${status}" != "CLEAR" ] +then + info "not sending notification for ${status} of '${host}.${chart}.${name}'" + exit 1 +fi + +# don't do anything if this is CLEAR, but it was not WARNING or CRITICAL +if [ "${clear_alarm_always}" != "YES" -a "${old_status}" != "WARNING" -a "${old_status}" != "CRITICAL" -a "${status}" = "CLEAR" ] +then + info "not sending notification for ${status} of '${host}.${chart}.${name}' (last status was ${old_status})" + exit 1 +fi + +# ----------------------------------------------------------------------------- +# load configuration + +# By default fetch images from the global public registry. +# This is required by default, since all notification methods need to download +# images via the Internet, and private registries might not be reachable. +# This can be overwritten at the configuration file. +images_base_url="https://registry.my-netdata.io" + +# curl options to use +curl_options="" + +# hostname handling +use_fqdn="NO" + +# needed commands +# if empty they will be searched in the system path +curl= +sendmail= + +# enable / disable features +for method_name in ${method_names^^} ; do + declare SEND_${method_name}="YES" + declare DEFAULT_RECIPIENT_${method_name} +done + +for method_name in ${method_names} ; do + declare -A role_recipients_${method_name} +done + +# slack configs +SLACK_WEBHOOK_URL= + +# Microsoft Team configs +MSTEAM_WEBHOOK_URL= + +# rocketchat configs +ROCKETCHAT_WEBHOOK_URL= + +# alerta configs +ALERTA_WEBHOOK_URL= +ALERTA_API_KEY= + +# flock configs +FLOCK_WEBHOOK_URL= + +# discord configs +DISCORD_WEBHOOK_URL= + +# pushover configs +PUSHOVER_APP_TOKEN= + +# pushbullet configs +PUSHBULLET_ACCESS_TOKEN= +PUSHBULLET_SOURCE_DEVICE= + +# twilio configs +TWILIO_ACCOUNT_SID= +TWILIO_ACCOUNT_TOKEN= +TWILIO_NUMBER= + +# hipchat configs +HIPCHAT_SERVER= +HIPCHAT_AUTH_TOKEN= + +# messagebird configs +MESSAGEBIRD_ACCESS_KEY= +MESSAGEBIRD_NUMBER= + +# kavenegar configs +KAVENEGAR_API_KEY= +KAVENEGAR_SENDER= + +# telegram configs +TELEGRAM_BOT_TOKEN= + +# kafka configs +SEND_KAFKA="YES" +KAFKA_URL= +KAFKA_SENDER_IP= + +# pagerduty.com configs +PD_SERVICE_KEY= + +# fleep.io configs +FLEEP_SENDER="${host}" + +# Amazon SNS configs +DEFAULT_RECIPIENT_AWSSNS= +AWSSNS_MESSAGE_FORMAT= +declare -A role_recipients_awssns=() + +# syslog configs +SYSLOG_FACILITY= + +# email configs +EMAIL_SENDER= +EMAIL_CHARSET=$(locale charmap 2>/dev/null) +EMAIL_THREADING= +DEFAULT_RECIPIENT_EMAIL="root" + +# irc configs +IRC_NICKNAME= +IRC_REALNAME= +IRC_NETWORK= + +# load the stock and user configuration files +# these will overwrite the variables above + +if [ ${unittest} ] ; + then + source "${cfgfile}" + [ $? -ne 0 ] && error "Failed to load requested config file." && exit 1 +else + for CONFIG in "${NETDATA_STOCK_CONFIG_DIR}/health_alarm_notify.conf" "${NETDATA_USER_CONFIG_DIR}/health_alarm_notify.conf" + do + if [ -f "${CONFIG}" ] + then + debug "Loading config file '${CONFIG}'..." + source "${CONFIG}" + [ $? -ne 0 ] && error "Failed to load config file '${CONFIG}'." + else + warning "Cannot find file '${CONFIG}'." + fi + done +fi + +# If we didn't autodetect the character set for e-mail and it wasn't +# set by the user, we need to set it to a reasonable default. UTF-8 +# should be correct for almost all modern UNIX systems. +if [ -z ${EMAIL_CHARSET} ] + then + EMAIL_CHARSET="UTF-8" +fi + +# If we've been asked to use FQDN's for the URL's in the alarm, do so, +# unless we're sending an alarm for a slave system which we can't get the +# FQDN of easily. +if [ "${use_fqdn}" = "YES" -a "${host}" = "$(hostname -s 2>/dev/null)" ] + then + host="$(hostname -f 2>/dev/null)" +fi + +# ----------------------------------------------------------------------------- +# filter a recipient based on alarm event severity + +filter_recipient_by_criticality() { + local method="${1}" x="${2}" r s + shift + + r="${x/|*/}" # the recipient + s="${x/*|/}" # the severity required for notifying this recipient + + # no severity filtering for this person + [ "${r}" = "${s}" ] && return 0 + + # the severity is invalid + s="${s^^}" + if [ "${s}" != "CRITICAL" ] + then + error "SEVERITY FILTERING for ${x} VIA ${method}: invalid severity '${s,,}', only 'critical' is supported." + return 0 + fi + + # create the status tracking directory for this user + [ ! -d "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}" ] && \ + mkdir -p "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}" + + case "${status}" in + CRITICAL) + # make sure he will get future notifications for this alarm too + touch "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: the alarm is CRITICAL (will now receive next status change)" + return 0 + ;; + + WARNING) + if [ -f "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" ] + then + # we do not remove the file, so that he will get future notifications of this alarm + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: recipient has been notified for this alarm in the past (will still receive next status change)" + return 0 + fi + ;; + + *) + if [ -f "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" ] + then + # remove the file, so that he will only receive notifications for CRITICAL states for this alarm + rm "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: recipient has been notified for this alarm (will only receive CRITICAL notifications from now on)" + return 0 + fi + ;; + esac + + debug "SEVERITY FILTERING for ${x} VIA ${method}: BLOCK: recipient should not receive this notification" + return 1 +} + +# ----------------------------------------------------------------------------- +# verify the delivery methods supported + +# check slack +[ -z "${SLACK_WEBHOOK_URL}" ] && SEND_SLACK="NO" + +# check rocketchat +[ -z "${ROCKETCHAT_WEBHOOK_URL}" ] && SEND_ROCKETCHAT="NO" + +# check alerta +[ -z "${ALERTA_WEBHOOK_URL}" ] && SEND_ALERTA="NO" + +# check flock +[ -z "${FLOCK_WEBHOOK_URL}" ] && SEND_FLOCK="NO" + +# check discord +[ -z "${DISCORD_WEBHOOK_URL}" ] && SEND_DISCORD="NO" + +# check pushover +[ -z "${PUSHOVER_APP_TOKEN}" ] && SEND_PUSHOVER="NO" + +# check pushbullet +[ -z "${PUSHBULLET_ACCESS_TOKEN}" ] && SEND_PUSHBULLET="NO" + +# check twilio +[ -z "${TWILIO_ACCOUNT_TOKEN}" -o -z "${TWILIO_ACCOUNT_SID}" -o -z "${TWILIO_NUMBER}" ] && SEND_TWILIO="NO" + +# check hipchat +[ -z "${HIPCHAT_AUTH_TOKEN}" ] && SEND_HIPCHAT="NO" + +# check messagebird +[ -z "${MESSAGEBIRD_ACCESS_KEY}" -o -z "${MESSAGEBIRD_NUMBER}" ] && SEND_MESSAGEBIRD="NO" + +# check kavenegar +[ -z "${KAVENEGAR_API_KEY}" -o -z "${KAVENEGAR_SENDER}" ] && SEND_KAVENEGAR="NO" + +# check telegram +[ -z "${TELEGRAM_BOT_TOKEN}" ] && SEND_TELEGRAM="NO" + +# check kafka +[ -z "${KAFKA_URL}" -o -z "${KAFKA_SENDER_IP}" ] && SEND_KAFKA="NO" + +# check irc +[ -z "${IRC_NETWORK}" ] && SEND_IRC="NO" + +# check fleep +[ -z "${FLEEP_SERVER}" -o -z "${FLEEP_SENDER}" ] && SEND_FLEEP="NO" + +# if we need curl, check for the curl command +if [ \( \ + "${SEND_PUSHOVER}" = "YES" \ + -o "${SEND_SLACK}" = "YES" \ + -o "${SEND_ROCKETCHAT}" = "YES" \ + -o "${SEND_ALERTA}" = "YES" \ + -o "${SEND_PD}" = "YES" \ + -o "${SEND_FLOCK}" = "YES" \ + -o "${SEND_DISCORD}" = "YES" \ + -o "${SEND_HIPCHAT}" = "YES" \ + -o "${SEND_TWILIO}" = "YES" \ + -o "${SEND_MESSAGEBIRD}" = "YES" \ + -o "${SEND_KAVENEGAR}" = "YES" \ + -o "${SEND_TELEGRAM}" = "YES" \ + -o "${SEND_PUSHBULLET}" = "YES" \ + -o "${SEND_KAFKA}" = "YES" \ + -o "${SEND_FLEEP}" = "YES" \ + -o "${SEND_PROWL}" = "YES" \ + -o "${SEND_CUSTOM}" = "YES" \ + -o "${SEND_MSTEAM}" = "YES" \ + \) -a -z "${curl}" ] + then + curl="$(which curl 2>/dev/null || command -v curl 2>/dev/null)" + if [ -z "${curl}" ] + then + error "Cannot find curl command in the system path. Disabling all curl based notifications." + SEND_PUSHOVER="NO" + SEND_PUSHBULLET="NO" + SEND_TELEGRAM="NO" + SEND_SLACK="NO" + SEND_MSTEAM="NO" + SEND_ROCKETCHAT="NO" + SEND_ALERTA="NO" + SEND_PD="NO" + SEND_FLOCK="NO" + SEND_DISCORD="NO" + SEND_TWILIO="NO" + SEND_HIPCHAT="NO" + SEND_MESSAGEBIRD="NO" + SEND_KAVENEGAR="NO" + SEND_KAFKA="NO" + SEND_FLEEP="NO" + SEND_PROWL="NO" + SEND_CUSTOM="NO" + fi +fi + +# if we need sendmail, check for the sendmail command +if [ "${SEND_EMAIL}" = "YES" -a -z "${sendmail}" ] + then + sendmail="$(which sendmail 2>/dev/null || command -v sendmail 2>/dev/null)" + if [ -z "${sendmail}" ] + then + debug "Cannot find sendmail command in the system path. Disabling email notifications." + SEND_EMAIL="NO" + fi +fi + +# if we need logger, check for the logger command +if [ "${SEND_SYSLOG}" = "YES" -a -z "${logger}" ] + then + logger="$(which logger 2>/dev/null || command -v logger 2>/dev/null)" + if [ -z "${logger}" ] + then + debug "Cannot find logger command in the system path. Disabling syslog notifications." + SEND_SYSLOG="NO" + fi +fi + +# if we need aws, check for the aws command +if [ "${SEND_AWSSNS}" = "YES" -a -z "${aws}" ] + then + aws="$(which aws 2>/dev/null || command -v aws 2>/dev/null)" + if [ -z "${aws}" ] + then + debug "Cannot find aws command in the system path. Disabling Amazon SNS notifications." + SEND_AWSSNS="NO" + fi +fi + +# ----------------------------------------------------------------------------- +# find the recipients' addresses per method + +# netdata may call us with multiple roles, and roles may have multiple but +# overlapping recipients - so, here we find the unique recipients. +for method_name in ${method_names} ; do + send_var="SEND_${method_name^^}" + if [ ${!send_var} = "NO" ] ; then + continue + fi + + declare -A arr_var=() + + for x in ${roles//,/ } ; do + # the roles 'silent' and 'disabled' mean: + # don't send a notification for this role + [ "${x}" = "silent" -o "${x}" = "disabled" ] && continue + + role_recipients="role_recipients_${method_name}[$x]" + default_recipient_var="DEFAULT_RECIPIENT_${method_name^^}" + + a="${!role_recipients}" + [ -z "${a}" ] && a="${!default_recipient_var}" + for r in ${a//,/ } ; do + [ "${r}" != "disabled" ] && filter_recipient_by_criticality ${method_name} "${r}" && arr_var[${r/|*/}]="1" + done + done + + # build the list of recipients + to_var="to_${method_name}" + declare to_${method_name}="${!arr_var[*]}" + + [ -z "${!to_var}" ] && declare ${send_var}="NO" +done + +# ----------------------------------------------------------------------------- +# handle fixup of the email recipient list. + +fix_to_email() { + to_email= + while [ ! -z "${1}" ] + do + [ ! -z "${to_email}" ] && to_email="${to_email}, " + to_email="${to_email}${1}" + shift 1 + done +} + +# ${to_email} without quotes here +fix_to_email ${to_email} + +# ----------------------------------------------------------------------------- +# handle output if we're running in unit test mode +if [ ${unittest} ] ; then + for method_name in ${method_names} ; do + to_var="to_${method_name}" + echo "results: ${method_name}: ${!to_var}" + done + exit 0 +fi + +# ----------------------------------------------------------------------------- +# check that we have at least a method enabled +if [ "${SEND_EMAIL}" != "YES" \ + -a "${SEND_PUSHOVER}" != "YES" \ + -a "${SEND_TELEGRAM}" != "YES" \ + -a "${SEND_SLACK}" != "YES" \ + -a "${SEND_ROCKETCHAT}" != "YES" \ + -a "${SEND_ALERTA}" != "YES" \ + -a "${SEND_FLOCK}" != "YES" \ + -a "${SEND_DISCORD}" != "YES" \ + -a "${SEND_TWILIO}" != "YES" \ + -a "${SEND_HIPCHAT}" != "YES" \ + -a "${SEND_MESSAGEBIRD}" != "YES" \ + -a "${SEND_KAVENEGAR}" != "YES" \ + -a "${SEND_PUSHBULLET}" != "YES" \ + -a "${SEND_KAFKA}" != "YES" \ + -a "${SEND_PD}" != "YES" \ + -a "${SEND_FLEEP}" != "YES" \ + -a "${SEND_CUSTOM}" != "YES" \ + -a "${SEND_IRC}" != "YES" \ + -a "${SEND_AWSSNS}" != "YES" \ + -a "${SEND_PROWL}" != "YES" \ + -a "${SEND_SYSLOG}" != "YES" \ + -a "${SEND_MSTEAM}" != "YES" \ + ] + then + fatal "All notification methods are disabled. Not sending notification for host '${host}', chart '${chart}' to '${roles}' for '${name}' = '${value}' for status '${status}'." +fi + +# ----------------------------------------------------------------------------- +# get the date the alarm happened + +date=$(date --date=@${when} "${date_format}" 2>/dev/null) +[ -z "${date}" ] && date=$(date "${date_format}" 2>/dev/null) +[ -z "${date}" ] && date=$(date --date=@${when} 2>/dev/null) +[ -z "${date}" ] && date=$(date 2>/dev/null) + +# ---------------------------------------------------------------------------- +# prepare some extra headers if we've been asked to thread e-mails +if [ "${SEND_EMAIL}" == "YES" -a "${EMAIL_THREADING}" != "NO" ] ; then + email_thread_headers="In-Reply-To: <${chart}-${name}@${host}>\nReferences: <${chart}-${name}@${host}>" +else + email_thread_headers= +fi + +# ----------------------------------------------------------------------------- +# function to URL encode a string + +urlencode() { + local string="${1}" strlen encoded pos c o + + strlen=${#string} + for (( pos=0 ; pos<strlen ; pos++ )) + do + c=${string:${pos}:1} + case "${c}" in + [-_.~a-zA-Z0-9]) + o="${c}" + ;; + + *) + printf -v o '%%%02x' "'${c}" + ;; + esac + encoded+="${o}" + done + + REPLY="${encoded}" + echo "${REPLY}" +} + +# ----------------------------------------------------------------------------- +# function to convert a duration in seconds, to a human readable duration +# using DAYS, MINUTES, SECONDS + +duration4human() { + local s="${1}" d=0 h=0 m=0 ds="day" hs="hour" ms="minute" ss="second" ret + d=$(( s / 86400 )) + s=$(( s - (d * 86400) )) + h=$(( s / 3600 )) + s=$(( s - (h * 3600) )) + m=$(( s / 60 )) + s=$(( s - (m * 60) )) + + if [ ${d} -gt 0 ] + then + [ ${m} -ge 30 ] && h=$(( h + 1 )) + [ ${d} -gt 1 ] && ds="days" + [ ${h} -gt 1 ] && hs="hours" + if [ ${h} -gt 0 ] + then + ret="${d} ${ds} and ${h} ${hs}" + else + ret="${d} ${ds}" + fi + elif [ ${h} -gt 0 ] + then + [ ${s} -ge 30 ] && m=$(( m + 1 )) + [ ${h} -gt 1 ] && hs="hours" + [ ${m} -gt 1 ] && ms="minutes" + if [ ${m} -gt 0 ] + then + ret="${h} ${hs} and ${m} ${ms}" + else + ret="${h} ${hs}" + fi + elif [ ${m} -gt 0 ] + then + [ ${m} -gt 1 ] && ms="minutes" + [ ${s} -gt 1 ] && ss="seconds" + if [ ${s} -gt 0 ] + then + ret="${m} ${ms} and ${s} ${ss}" + else + ret="${m} ${ms}" + fi + else + [ ${s} -gt 1 ] && ss="seconds" + ret="${s} ${ss}" + fi + + REPLY="${ret}" + echo "${REPLY}" +} + +# ----------------------------------------------------------------------------- +# email sender + +send_email() { + local ret= opts=() sender_email="${EMAIL_SENDER}" sender_name= + if [ "${SEND_EMAIL}" = "YES" ] + then + + if [ ! -z "${EMAIL_SENDER}" ] + then + if [[ "${EMAIL_SENDER}" =~ ^\".*\"\ \<.*\>$ ]] + then + # the name includes double quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d '"' -f 2)" + elif [[ "${EMAIL_SENDER}" =~ ^\'.*\'\ \<.*\>$ ]] + then + # the name includes single quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d "'" -f 2)" + elif [[ "${EMAIL_SENDER}" =~ ^.*\ \<.*\>$ ]] + then + # the name does not have any quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 1)" + fi + fi + + [ ! -z "${sender_email}" ] && opts+=(-f "${sender_email}") + [ ! -z "${sender_name}" ] && opts+=(-F "${sender_name}") + + if [ "${debug}" = "1" ] + then + echo >&2 "--- BEGIN sendmail command ---" + printf >&2 "%q " "${sendmail}" -t "${opts[@]}" + echo >&2 + echo >&2 "--- END sendmail command ---" + fi + + "${sendmail}" -t "${opts[@]}" + ret=$? + + if [ ${ret} -eq 0 ] + then + info "sent email notification for: ${host} ${chart}.${name} is ${status} to '${to_email}'" + return 0 + else + error "failed to send email notification for: ${host} ${chart}.${name} is ${status} to '${to_email}' with error code ${ret}." + return 1 + fi + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pushover sender + +send_pushover() { + local apptoken="${1}" usertokens="${2}" when="${3}" url="${4}" status="${5}" title="${6}" message="${7}" httpcode sent=0 user priority + + if [ "${SEND_PUSHOVER}" = "YES" -a ! -z "${apptoken}" -a ! -z "${usertokens}" -a ! -z "${title}" -a ! -z "${message}" ] + then + + # https://pushover.net/api + priority=-2 + case "${status}" in + CLEAR) priority=-1;; # low priority: no sound or vibration + WARNING) priority=0;; # normal priority: respect quiet hours + CRITICAL) priority=1;; # high priority: bypass quiet hours + *) priority=-2;; # lowest priority: no notification at all + esac + + for user in ${usertokens} + do + httpcode=$(docurl \ + --form-string "token=${apptoken}" \ + --form-string "user=${user}" \ + --form-string "html=1" \ + --form-string "title=${title}" \ + --form-string "message=${message}" \ + --form-string "timestamp=${when}" \ + --form-string "url=${url}" \ + --form-string "url_title=Open netdata dashboard to view the alarm" \ + --form-string "priority=${priority}" \ + https://api.pushover.net/1/messages.json) + + if [ "${httpcode}" = "200" ] + then + info "sent pushover notification for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send pushover notification for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pushbullet sender + +send_pushbullet() { + local userapikey="${1}" source_device="${2}" recipients="${3}" url="${4}" title="${5}" message="${6}" httpcode sent=0 user + if [ "${SEND_PUSHBULLET}" = "YES" -a ! -z "${userapikey}" -a ! -z "${recipients}" -a ! -z "${message}" -a ! -z "${title}" ] + then + #https://docs.pushbullet.com/#create-push + for user in ${recipients} + do + httpcode=$(docurl \ + --header 'Access-Token: '${userapikey}'' \ + --header 'Content-Type: application/json' \ + --data-binary @<(cat <<EOF + {"title": "${title}", + "type": "link", + "email": "${user}", + "body": "$( echo -n ${message})", + "url": "${url}", + "source_device_iden": "${source_device}"} +EOF + ) "https://api.pushbullet.com/v2/pushes" -X POST) + + if [ "${httpcode}" = "200" ] + then + info "sent pushbullet notification for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send pushbullet notification for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# kafka sender + +send_kafka() { + local httpcode sent=0 + if [ "${SEND_KAFKA}" = "YES" ] + then + httpcode=$(docurl -X POST \ + --data "{host_ip:\"${KAFKA_SENDER_IP}\",when:${when},name:\"${name}\",chart:\"${chart}\",family:\"${family}\",status:\"${status}\",old_status:\"${old_status}\",value:${value},old_value:${old_value},duration:${duration},non_clear_duration:${non_clear_duration},units:\"${units}\",info:\"${info}\"}" \ + "${KAFKA_URL}") + + if [ "${httpcode}" = "204" ] + then + info "sent kafka data for: ${host} ${chart}.${name} is ${status} and ip '${KAFKA_SENDER_IP}'" + sent=$((sent + 1)) + else + error "failed to send kafka data for: ${host} ${chart}.${name} is ${status} and ip '${KAFKA_SENDER_IP}' with HTTP error code ${httpcode}." + fi + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pagerduty.com sender + +send_pd() { + local recipients="${1}" sent=0 + unset t + case ${status} in + CLEAR) t='resolve';; + WARNING) t='trigger';; + CRITICAL) t='trigger';; + esac + + if [ ${SEND_PD} = "YES" -a ! -z "${t}" ] + then + for PD_SERVICE_KEY in ${recipients} + do + d="${status} ${name} = ${value_string} - ${host}, ${family}" + payload="$(cat << EOF + { + "service_key": "${PD_SERVICE_KEY}", + "event_type": "${t}", + "incident_key" : "${alarm_id}", + "description": "${d}", + "details": { + "value_w_units": "${value_string}", + "when": "${when}", + "duration" : "${duration}", + "roles": "${roles}", + "alarm_id" : "${alarm_id}", + "name" : "${name}", + "chart" : "${chart}", + "family" : "${family}", + "status" : "${status}", + "old_status" : "${old_status}", + "value" : "${value}", + "old_value" : "${old_value}", + "src" : "${src}", + "non_clear_duration" : "${non_clear_duration}", + "units" : "${units}", + "info" : "${info}" + } + } +EOF + )" + httpcode=$(docurl -X POST --data "${payload}" "https://events.pagerduty.com/generic/2010-04-15/create_event.json") + if [ "${httpcode}" = "200" ] + then + info "sent pagerduty notification for: ${host} ${chart}.${name} is ${status}'" + sent=$((sent + 1)) + else + error "failed to send pagerduty notification for: ${host} ${chart}.${name} is ${status}, with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# twilio sender + +send_twilio() { + local accountsid="${1}" accounttoken="${2}" twilionumber="${3}" recipients="${4}" title="${5}" message="${6}" httpcode sent=0 user + if [ "${SEND_TWILIO}" = "YES" -a ! -z "${accountsid}" -a ! -z "${accounttoken}" -a ! -z "${twilionumber}" -a ! -z "${recipients}" -a ! -z "${message}" -a ! -z "${title}" ] + then + #https://www.twilio.com/packages/labs/code/bash/twilio-sms + for user in ${recipients} + do + httpcode=$(docurl -X POST \ + --data-urlencode "From=${twilionumber}" \ + --data-urlencode "To=${user}" \ + --data-urlencode "Body=${title} ${message}" \ + -u "${accountsid}:${accounttoken}" \ + "https://api.twilio.com/2010-04-01/Accounts/${accountsid}/Messages.json") + + if [ "${httpcode}" = "201" ] + then + info "sent Twilio SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Twilio SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + + +# ----------------------------------------------------------------------------- +# hipchat sender + +send_hipchat() { + local authtoken="${1}" recipients="${2}" message="${3}" httpcode sent=0 room color sender msg_format notify + + # remove <small></small> from the message + message="${message//<small>/}" + message="${message//<\/small>/}" + + if [ "${SEND_HIPCHAT}" = "YES" -a ! -z "${HIPCHAT_SERVER}" -a ! -z "${authtoken}" -a ! -z "${recipients}" -a ! -z "${message}" ] + then + # A label to be shown in addition to the sender's name + # Valid length range: 0 - 64. + sender="netdata" + + # Valid values: html, text. + # Defaults to 'html'. + msg_format="html" + + # Background color for message. Valid values: yellow, green, red, purple, gray, random. Defaults to 'yellow'. + case "${status}" in + WARNING) color="yellow" ;; + CRITICAL) color="red" ;; + CLEAR) color="green" ;; + *) color="gray" ;; + esac + + # Whether this message should trigger a user notification (change the tab color, play a sound, notify mobile phones, etc). + # Each recipient's notification preferences are taken into account. + # Defaults to false. + notify="true" + + for room in ${recipients} + do + httpcode=$(docurl -X POST \ + -H "Content-type: application/json" \ + -H "Authorization: Bearer ${authtoken}" \ + -d "{\"color\": \"${color}\", \"from\": \"${host}\", \"message_format\": \"${msg_format}\", \"message\": \"${message}\", \"notify\": \"${notify}\"}" \ + "https://${HIPCHAT_SERVER}/v2/room/${room}/notification") + + if [ "${httpcode}" = "204" ] + then + info "sent HipChat notification for: ${host} ${chart}.${name} is ${status} to '${room}'" + sent=$((sent + 1)) + else + error "failed to send HipChat notification for: ${host} ${chart}.${name} is ${status} to '${room}' with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + + +# ----------------------------------------------------------------------------- +# messagebird sender + +send_messagebird() { + local accesskey="${1}" messagebirdnumber="${2}" recipients="${3}" title="${4}" message="${5}" httpcode sent=0 user + if [ "${SEND_MESSAGEBIRD}" = "YES" -a ! -z "${accesskey}" -a ! -z "${messagebirdnumber}" -a ! -z "${recipients}" -a ! -z "${message}" -a ! -z "${title}" ] + then + #https://developers.messagebird.com/docs/messaging + for user in ${recipients} + do + httpcode=$(docurl -X POST \ + --data-urlencode "originator=${messagebirdnumber}" \ + --data-urlencode "recipients=${user}" \ + --data-urlencode "body=${title} ${message}" \ + --data-urlencode "datacoding=auto" \ + -H "Authorization: AccessKey ${accesskey}" \ + "https://rest.messagebird.com/messages") + + if [ "${httpcode}" = "201" ] + then + info "sent Messagebird SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Messagebird SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# kavenegar sender + +send_kavenegar() { + local API_KEY="${1}" kavenegarsender="${2}" recipients="${3}" title="${4}" message="${5}" httpcode sent=0 user + if [ "${SEND_KAVENEGAR}" = "YES" -a ! -z "${API_KEY}" -a ! -z "${kavenegarsender}" -a ! -z "${recipients}" -a ! -z "${message}" -a ! -z "${title}" ] + then + # http://api.kavenegar.com/v1/{API-KEY}/sms/send.json + for user in ${recipients} + do + httpcode=$(docurl -X POST http://api.kavenegar.com/v1/${API_KEY}/sms/send.json \ + --data-urlencode "sender=${kavenegarsender}" \ + --data-urlencode "receptor=${user}" \ + --data-urlencode "message=${title} ${message}") + + if [ "${httpcode}" = "200" ] + then + info "sent Kavenegar SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Kavenegar SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# telegram sender + +send_telegram() { + local bottoken="${1}" chatids="${2}" message="${3}" httpcode sent=0 chatid emoji disableNotification="" + + if [ "${status}" = "CLEAR" ]; then disableNotification="--data-urlencode disable_notification=true"; fi + + case "${status}" in + WARNING) emoji="⚠️" ;; + CRITICAL) emoji="🔴" ;; + CLEAR) emoji="✅" ;; + *) emoji="⚪️" ;; + esac + + if [ "${SEND_TELEGRAM}" = "YES" -a ! -z "${bottoken}" -a ! -z "${chatids}" -a ! -z "${message}" ]; + then + for chatid in ${chatids} + do + # https://core.telegram.org/bots/api#sendmessage + httpcode=$(docurl ${disableNotification} \ + --data-urlencode "parse_mode=HTML" \ + --data-urlencode "disable_web_page_preview=true" \ + --data-urlencode "text=${emoji} ${message}" \ + "https://api.telegram.org/bot${bottoken}/sendMessage?chat_id=${chatid}") + + if [ "${httpcode}" = "200" ] + then + info "sent telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}'" + sent=$((sent + 1)) + elif [ "${httpcode}" = "401" ] + then + error "failed to send telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}': Wrong bot token." + else + error "failed to send telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}' with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# Microsoft Team sender + +send_msteam() { + + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_MSTEAM}" != "YES" ] && return 1 + + case "${status}" in + WARNING) icon="${MSTEAM_ICON_WARNING}" && color="${MSTEAM_COLOR_WARNING}";; + CRITICAL) icon="${MSTEAM_ICON_CRITICAL}" && color="${MSTEAM_COLOR_CRITICAL}";; + CLEAR) icon="${MSTEAM_ICON_CLEAR}" && color="${MSTEAM_COLOR_CLEAR}";; + *) icon="${MSTEAM_ICON_DEFAULT}" && color="${MSTEAM_COLOR_DEFAULT}";; + esac + + for channel in ${channels} + do + ## More details are available here regarding the payload syntax options : https://docs.microsoft.com/en-us/outlook/actionable-messages/message-card-reference + ## Online designer : https://acdesignerbeta.azurewebsites.net/ + payload="$(cat <<EOF + { + "@context": "http://schema.org/extensions", + "@type": "MessageCard", + "themeColor": "${color}", + "title": "$icon Alert ${status} from netdata for ${host}", + "text": "${host} ${status_message}, ${chart} (_${family}_), *${alarm}*", + "potentialAction": [ + { + "@type": "OpenUri", + "name": "Netdata", + "targets": [ + { "os": "default", "uri": "${goto_url}" } + ] + } + ] + } +EOF + )" + + # Replacing in the webhook CHANNEL string by the MS Teams channel name from conf file. + webhook="${webhook//CHANNEL/${channel}}" + + httpcode=$(docurl -H "Content-Type: application/json" -d "${payload}" "${webhook}") + + if [ "${httpcode}" = "200" ] + then + info "sent Microsoft team notification for: ${host} ${chart}.${name} is ${status} to '${webhook}'" + sent=$((sent + 1)) + else + error "failed to send Microsoft team notification for: ${host} ${chart}.${name} is ${status} to '${webhook}', with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + + +# slack sender + +send_slack() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_SLACK}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels} + do + # Default entry in the recipient is without a hash in front (backwards-compatible). Accept specification of channel or user. + if [ "${channel::1}" != "#" ] && [ "${channel::1}" != "@" ] ; then channel="#$channel"; fi + + # If channel is equal to "#" then do not send the channel attribute at all. Slack also defines channels and users in webhooks. + if [ "${channel}" = "#" ] ; then + ch="" + chstr="without specifying a channel" + else + ch="\"channel\": \"${channel}\"," + chstr="to '${channel}'" + fi + + payload="$(cat <<EOF + { + $ch + "username": "netdata on ${host}", + "icon_url": "${images_base_url}/images/banner-icon-144x144.png", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "attachments": [ + { + "fallback": "${alarm} - ${chart} (${family}) - ${info}", + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "short": true + }, + { + "title": "${family}", + "short": true + } + ], + "thumb_url": "${image}", + "footer": "by <${goto_url}|${host}>", + "ts": ${when} + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ] + then + info "sent slack notification for: ${host} ${chart}.${name} is ${status} ${chstr}" + sent=$((sent + 1)) + else + error "failed to send slack notification for: ${host} ${chart}.${name} is ${status} ${chstr}, with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + + +# ----------------------------------------------------------------------------- +# rocketchat sender + +send_rocketchat() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_ROCKETCHAT}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels} + do + payload="$(cat <<EOF + { + "channel": "#${channel}", + "alias": "netdata on ${host}", + "avatar": "${images_base_url}/images/banner-icon-144x144.png", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "attachments": [ + { + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "short": true, + "value": "chart" + }, + { + "title": "${family}", + "short": true, + "value": "family" + } + ], + "thumb_url": "${image}", + "ts": "${when}" + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ] + then + info "sent rocketchat notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send rocketchat notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# alerta sender + +send_alerta() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel severity resource event payload auth + + [ "${SEND_ALERTA}" != "YES" ] && return 1 + + case "${status}" in + CRITICAL) severity="critical" ;; + WARNING) severity="warning" ;; + CLEAR) severity="cleared" ;; + *) severity="indeterminate" ;; + esac + + if [[ "${chart}" == httpcheck* ]] + then + resource=$chart + event=$name + else + resource="${host}:${family}" + event="${chart}.${name}" + fi + + for channel in ${channels} + do + payload="$(cat <<EOF + { + "resource": "${resource}", + "event": "${event}", + "environment": "${channel}", + "severity": "${severity}", + "service": ["Netdata"], + "group": "Performance", + "value": "${value_string}", + "text": "${info}", + "tags": ["alarm_id:${alarm_id}"], + "attributes": { + "roles": "${roles}", + "name": "${name}", + "chart": "${chart}", + "family": "${family}", + "source": "${src}", + "moreInfo": "<a href=\"${goto_url}\">View Netdata</a>" + }, + "origin": "netdata/${host}", + "type": "netdataAlarm", + "rawData": "${BASH_ARGV[@]}" + } +EOF + )" + + if [[ -n "${ALERTA_API_KEY}" ]] + then + auth="Key ${ALERTA_API_KEY}" + fi + + httpcode=$(docurl -X POST "${webhook}/alert" -H "Content-Type: application/json" -H "Authorization: $auth" --data "${payload}") + + if [[ "${httpcode}" = "200" || "${httpcode}" = "201" ]] + then + info "sent alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + elif [[ "${httpcode}" = "202" ]] + then + info "suppressed alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + else + error "failed to send alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# flock sender + +send_flock() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_FLOCK}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels} + do + httpcode=$(docurl -X POST "${webhook}" -H "Content-Type: application/json" -d "{ + \"sendAs\": { + \"name\" : \"netdata on ${host}\", + \"profileImage\" : \"${images_base_url}/images/banner-icon-144x144.png\" + }, + \"text\": \"${host} *${status_message}*\", + \"timestamp\": \"${when}\", + \"attachments\": [ + { + \"description\": \"${chart} (${family}) - ${info}\", + \"color\": \"${color}\", + \"title\": \"${alarm}\", + \"url\": \"${goto_url}\", + \"text\": \"${info}\", + \"views\": { + \"image\": { + \"original\": { \"src\": \"${image}\", \"width\": 400, \"height\": 400 }, + \"thumbnail\": { \"src\": \"${image}\", \"width\": 50, \"height\": 50 }, + \"filename\": \"${image}\" + } + } + } + ] + }" ) + if [ "${httpcode}" = "200" ] + then + info "sent flock notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send flock notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# discord sender + +send_discord() { + local webhook="${1}/slack" channels="${2}" httpcode sent=0 channel color payload username + + [ "${SEND_DISCORD}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels} + do + username="netdata on ${host}" + [ ${#username} -gt 32 ] && username="${username:0:29}..." + + payload="$(cat <<EOF + { + "channel": "#${channel}", + "username": "${username}", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "icon_url": "${images_base_url}/images/banner-icon-144x144.png", + "attachments": [ + { + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "value": "${family}" + } + ], + "thumb_url": "${image}", + "footer_icon": "${images_base_url}/images/banner-icon-144x144.png", + "footer": "${host}", + "ts": ${when} + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ] + then + info "sent discord notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send discord notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# fleep sender + +send_fleep() { + local httpcode sent=0 webhooks="${1}" data message + if [ "${SEND_FLEEP}" = "YES" ] ; then + message="${host} ${status_message}, \`${chart}\` (${family}), *${alarm}*\\n${info}" + + for hook in "${webhooks}" ; do + data="{ " + data="${data} 'message': '${message}', " + data="${data} 'user': '${FLEEP_SENDER}' " + data="${data} }" + + httpcode=$(docurl -X POST --data "${data}" "https://fleep.io/hook/${hook}") + + if [ "${httpcode}" = "200" ] ; then + info "sent fleep data for: ${host} ${chart}.${name} is ${status} and user '${FLEEP_SENDER}'" + sent=$((sent + 1)) + else + error "failed to send fleep data for: ${host} ${chart}.${name} is ${status} and user '${FLEEP_SENDER}' with HTTP error code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# Prowl sender + +send_prowl() { + local httpcode sent=0 data message keys prio=0 alarm_url event + if [ "${SEND_PROWL}" = "YES" ] ; then + message="$(urlencode "${host} ${status_message}, \`${chart}\` (${family}), *${alarm}*\\n${info}")" + message="description=${message}" + keys="$(urlencode "$(echo "${1}" | tr ' ' ,)")" + keys="apikey=${keys}" + app="application=Netdata" + + case "${status}" in + CRITICAL) + prio=2 + ;; + WARNING) + prio=1 + ;; + esac + pri="priority=${pri}" + + alarm_url="$(urlencode ${goto_url})" + alarm_url="url=${alarm_url}" + event="$(urlencode "${host} ${status_message}")" + event="event=${event}" + + data="${keys}&${pri}&${alarm_url}&${app}&${event}&${message}" + + httpcode=$(docurl -X POST --data "${data}" "https://api.prowlapp.com/publicapi/add") + + if [ "${httpcode}" = "200" ] ; then + info "sent prowl data for: ${host} ${chart}.${name} is ${status}" + sent=1 + else + error "failed to send prowl data for: ${host} ${chart}.${name} is ${status} with with error code ${httpcode}." + fi + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# irc sender + +send_irc() { + local NICKNAME="${1}" REALNAME="${2}" CHANNELS="${3}" NETWORK="${4}" SERVERNAME="${5}" MESSAGE="${6}" sent=0 channel color send_alarm reply_codes error + + if [ "${SEND_IRC}" = "YES" -a ! -z "${NICKNAME}" -a ! -z "${REALNAME}" -a ! -z "${CHANNELS}" -a ! -z "${NETWORK}" -a ! -z "${SERVERNAME}" ] + then + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for CHANNEL in ${CHANNELS} + do + error=0 + send_alarm=$(echo -e "USER ${NICKNAME} guest ${REALNAME} ${SERVERNAME}\nNICK ${NICKNAME}\nJOIN ${CHANNEL}\nPRIVMSG ${CHANNEL} :${MESSAGE}\nQUIT\n" \ | nc ${NETWORK} 6667) + reply_codes=$(echo ${send_alarm} | cut -d ' ' -f 2 | grep -o '[0-9]*') + for code in ${reply_codes} + do + [ "${code}" -ge 400 -a "${code}" -le 599 ] && error=1 && break + done + + if [ "${error}" -eq 0 ] + then + info "sent irc notification for: ${host} ${chart}.${name} is ${status} to '${CHANNEL}'" + sent=$((sent + 1)) + else + error "failed to send irc notification for: ${host} ${chart}.${name} is ${status} to '${CHANNEL}', with error code ${code}." + fi + done + fi + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# Amazon SNS sender + +send_awssns() { + local targets="${1}" message='' sent=0 region='' + local default_format="${status} on ${host} at ${date}: ${chart} ${value_string}" + + [ "${SEND_AWSSNS}" = "YES" ] || return 1 + + message=${AWSSNS_MESSAGE_FORMAT:-${default_format}} + + for target in ${targets} ; do + # Extract the region from the target ARN. We need to explicitly specify the region so that it matches up correctly. + region="$(echo ${target} | cut -f 4 -d ':')" + ${aws} sns publish --region "${region}" --subject "${host} ${status_message} - ${name//_/ } - ${chart}" --message "${message}" --target-arn ${target} &>/dev/null + if [ $? = 0 ]; then + info "sent Amazon SNS notification for: ${host} ${chart}.${name} is ${status} to '${target}'" + sent=$((sent + 1)) + else + error "failed to send Amazon SNS notification for: ${host} ${chart}.${name} is ${status} to '${target}'" + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# syslog sender + +send_syslog() { + local facility=${SYSLOG_FACILITY:-"local6"} level='info' targets="${1}" + local priority='' message='' host='' port='' prefix='' + local temp1='' temp2='' + + [ "${SEND_SYSLOG}" = "YES" ] || return 1 + + if [ "${status}" = "CRITICAL" ] ; then + level='crit' + elif [ "${status}" = "WARNING" ] ; then + level='warning' + fi + + for target in ${targets} ; do + priority="${facility}.${level}" + message='' + host='' + port='' + prefix='' + temp1='' + temp2='' + + prefix=$(echo ${target} | cut -d '/' -f 2) + temp1=$(echo ${target} | cut -d '/' -f 1) + + if [ ${prefix} != ${temp1} ] ; then + if (echo ${temp1} | grep -q '@' ) ; then + temp2=$(echo ${temp1} | cut -d '@' -f 1) + host=$(echo ${temp1} | cut -d '@' -f 2) + + if [ ${temp2} != ${host} ] ; then + priority=${temp2} + fi + + port=$(echo ${host} | rev | cut -d ':' -f 1 | rev) + + if ( echo ${host} | grep -E -q '\[.*\]' ) ; then + if ( echo ${port} | grep -q ']' ) ; then + port='' + else + host=$(echo ${host} | rev | cut -d ':' -f 2- | rev) + fi + else + if [ ${port} = ${host} ] ; then + port='' + else + host=$(echo ${host} | cut -d ':' -f 1) + fi + fi + else + priority=${temp1} + fi + fi + + message="${prefix} ${status} on ${host} at ${date}: ${chart} ${value_string}" + + if [ ${host} ] ; then + logger_options="${logger_options} -n ${host}" + if [ ${port} ] ; then + logger_options="${logger_options} -P ${port}" + fi + fi + + ${logger} -p ${priority} ${logger_options} "${message}" + done + + return $? +} + + +# ----------------------------------------------------------------------------- +# prepare the content of the notification + +# the url to send the user on click +urlencode "${args_host}" >/dev/null; url_host="${REPLY}" +urlencode "${chart}" >/dev/null; url_chart="${REPLY}" +urlencode "${family}" >/dev/null; url_family="${REPLY}" +urlencode "${name}" >/dev/null; url_name="${REPLY}" + +redirect_params="host=${url_host}&chart=${url_chart}&family=${url_family}&alarm=${url_name}&alarm_unique_id=${unique_id}&alarm_id=${alarm_id}&alarm_event_id=${event_id}" +GOTOCLOUD=0 + +if [ "${NETDATA_REGISTRY_URL}" == "https://registry.my-netdata.io" ] ; then + if [ -z "${NETDATA_REGISTRY_UNIQUE_ID}" ] ; then + if [ -f "@registrydir_POST@/netdata.public.unique.id" ]; then + NETDATA_REGISTRY_UNIQUE_ID="$(cat "@registrydir_POST@/netdata.public.unique.id")" + fi + fi + if [ ! -z "${NETDATA_REGISTRY_UNIQUE_ID}" ] ; then + GOTOCLOUD=1 + fi +fi + +if [ ${GOTOCLOUD} -eq 0 ] ; then + goto_url="${NETDATA_REGISTRY_URL}/goto-host-from-alarm.html?${redirect_params}" +else + goto_url="https://netdata.cloud/alarms/redirect?agentID=${NETDATA_REGISTRY_UNIQUE_ID}&${redirect_params}" +fi + +# the severity of the alarm +severity="${status}" + +# the time the alarm was raised +duration4human ${duration} >/dev/null; duration_txt="${REPLY}" +duration4human ${non_clear_duration} >/dev/null; non_clear_duration_txt="${REPLY}" +raised_for="(was ${old_status,,} for ${duration_txt})" + +# the key status message +status_message="status unknown" + +# the color of the alarm +color="grey" + +# the alarm value +alarm="${name//_/ } = ${value_string}" + +# the image of the alarm +image="${images_base_url}/images/banner-icon-144x144.png" + +# prepare the title based on status +case "${status}" in + CRITICAL) + image="${images_base_url}/images/alert-128-red.png" + status_message="is critical" + color="#ca414b" + ;; + + WARNING) + image="${images_base_url}/images/alert-128-orange.png" + status_message="needs attention" + color="#ffc107" + ;; + + CLEAR) + image="${images_base_url}/images/check-mark-2-128-green.png" + status_message="recovered" + color="#77ca6d" + ;; +esac + +if [ "${status}" = "CLEAR" ] +then + severity="Recovered from ${old_status}" + if [ ${non_clear_duration} -gt ${duration} ] + then + raised_for="(alarm was raised for ${non_clear_duration_txt})" + fi + + # don't show the value when the status is CLEAR + # for certain alarms, this value might not have any meaning + alarm="${name//_/ } ${raised_for}" + +elif [ "${old_status}" = "WARNING" -a "${status}" = "CRITICAL" ] +then + severity="Escalated to ${status}" + if [ ${non_clear_duration} -gt ${duration} ] + then + raised_for="(alarm is raised for ${non_clear_duration_txt})" + fi + +elif [ "${old_status}" = "CRITICAL" -a "${status}" = "WARNING" ] +then + severity="Demoted to ${status}" + if [ ${non_clear_duration} -gt ${duration} ] + then + raised_for="(alarm is raised for ${non_clear_duration_txt})" + fi + +else + raised_for= +fi + +# prepare HTML versions of elements +info_html= +[ ! -z "${info}" ] && info_html=" <small><br/>${info}</small>" + +raised_for_html= +[ ! -z "${raised_for}" ] && raised_for_html="<br/><small>${raised_for}</small>" + +# ----------------------------------------------------------------------------- +# send the slack notification + +# slack aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_slack "${SLACK_WEBHOOK_URL}" "${to_slack}" +SENT_SLACK=$? + +# ----------------------------------------------------------------------------- +# send the Microsoft notification + +# Microsoft team aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_msteam "${MSTEAM_WEBHOOK_URL}" "${to_msteam}" +SENT_MSTEAM=$? + +# ----------------------------------------------------------------------------- +# send the rocketchat notification + +# rocketchat aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_rocketchat "${ROCKETCHAT_WEBHOOK_URL}" "${to_rocketchat}" +SENT_ROCKETCHAT=$? + +# ----------------------------------------------------------------------------- +# send the alerta notification + +# alerta aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_alerta "${ALERTA_WEBHOOK_URL}" "${to_alerta}" +SENT_ALERTA=$? + +# ----------------------------------------------------------------------------- +# send the flock notification + +# flock aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_flock "${FLOCK_WEBHOOK_URL}" "${to_flock}" +SENT_FLOCK=$? + +# ----------------------------------------------------------------------------- +# send the discord notification + +# discord aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_discord "${DISCORD_WEBHOOK_URL}" "${to_discord}" +SENT_DISCORD=$? + +# ----------------------------------------------------------------------------- +# send the pushover notification + +send_pushover "${PUSHOVER_APP_TOKEN}" "${to_pushover}" "${when}" "${goto_url}" "${status}" "${host} ${status_message} - ${name//_/ } - ${chart}" " +<font color=\"${color}\"><b>${alarm}</b></font>${info_html}<br/> +<small><b>${chart}</b><br/>Chart<br/> </small> +<small><b>${family}</b><br/>Family<br/> </small> +<small><b>${severity}</b><br/>Severity<br/> </small> +<small><b>${date}${raised_for_html}</b><br/>Time<br/> </small> +<a href=\"${goto_url}\">View Netdata</a><br/> +<small><small>The source of this alarm is line ${src}</small></small> +" + +SENT_PUSHOVER=$? + +# ----------------------------------------------------------------------------- +# send the pushbullet notification + +send_pushbullet "${PUSHBULLET_ACCESS_TOKEN}" "${PUSHBULLET_SOURCE_DEVICE}" "${to_pushbullet}" "${goto_url}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm}\n +Severity: ${severity}\n +Chart: ${chart}\n +Family: ${family}\n +$(date -d @${when})\n +The source of this alarm is line ${src}" + +SENT_PUSHBULLET=$? + +# ----------------------------------------------------------------------------- +# send the twilio SMS + +send_twilio "${TWILIO_ACCOUNT_SID}" "${TWILIO_ACCOUNT_TOKEN}" "${TWILIO_NUMBER}" "${to_twilio}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_TWILIO=$? + +# ----------------------------------------------------------------------------- +# send the messagebird SMS + +send_messagebird "${MESSAGEBIRD_ACCESS_KEY}" "${MESSAGEBIRD_NUMBER}" "${to_messagebird}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_MESSAGEBIRD=$? + + +# ----------------------------------------------------------------------------- +# send the kavenegar SMS + +send_kavenegar "${KAVENEGAR_API_KEY}" "${KAVENEGAR_SENDER}" "${to_kavenegar}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_KAVENEGAR=$? + + +# ----------------------------------------------------------------------------- +# send the telegram.org message + +# https://core.telegram.org/bots/api#formatting-options +send_telegram "${TELEGRAM_BOT_TOKEN}" "${to_telegram}" "${host} ${status_message} - <b>${name//_/ }</b> +${chart} (${family}) +<a href=\"${goto_url}\">${alarm}</a> +<i>${info}</i>" + +SENT_TELEGRAM=$? + + +# ----------------------------------------------------------------------------- +# send the kafka message + +send_kafka +SENT_KAFKA=$? + + +# ----------------------------------------------------------------------------- +# send the pagerduty.com message + +send_pd "${to_pd}" +SENT_PD=$? + +# ----------------------------------------------------------------------------- +# send the fleep message + +send_fleep "${to_fleep}" +SENT_FLEEP=$? + +# ----------------------------------------------------------------------------- +# send the Prowl message + +send_prowl "${to_prowl}" +SENT_PROWL=$? + +# ----------------------------------------------------------------------------- +# send the irc message + +send_irc "${IRC_NICKNAME}" "${IRC_REALNAME}" "${to_irc}" "${IRC_NETWORK}" "${host}" "${host} ${status_message} - ${name//_/ } - ${chart} ----- ${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_IRC=$? + +# ----------------------------------------------------------------------------- +# send the custom message + +send_custom() { + # is it enabled? + [ "${SEND_CUSTOM}" != "YES" ] && return 1 + + # do we have any sender? + [ -z "${1}" ] && return 1 + + # call the custom_sender function + custom_sender "${@}" +} + +send_custom "${to_custom}" +SENT_CUSTOM=$? + + +# ----------------------------------------------------------------------------- +# send hipchat message + +send_hipchat "${HIPCHAT_AUTH_TOKEN}" "${to_hipchat}" " \ +${host} ${status_message}<br/> \ +<b>${alarm}</b> ${info_html}<br/> \ +<b>${chart}</b> (family <b>${family}</b>)<br/> \ +<b>${date}${raised_for_html}</b><br/> \ +<a href=\\\"${goto_url}\\\">View netdata dashboard</a> \ +(source of alarm ${src}) \ +" + +SENT_HIPCHAT=$? + + +# ----------------------------------------------------------------------------- +# send the Amazon SNS message + +send_awssns ${to_awssns} + +SENT_AWSSNS=$? + + +# ----------------------------------------------------------------------------- +# send the syslog message + +send_syslog ${to_syslog} + +SENT_SYSLOG=$? + + +# ----------------------------------------------------------------------------- +# send the email + +send_email <<EOF +To: ${to_email} +Subject: ${host} ${status_message} - ${name//_/ } - ${chart} +MIME-Version: 1.0 +Content-Type: multipart/alternative; boundary="multipart-boundary" +${email_thread_headers} + +This is a MIME-encoded multipart message + +--multipart-boundary +Content-Type: text/plain; encoding=${EMAIL_CHARSET} +Content-Disposition: inline +Content-Transfer-Encoding: 8bit + +${host} ${status_message} + +${alarm} ${info} +${raised_for} + +Chart : ${chart} +Family : ${family} +Severity: ${severity} +URL : ${goto_url} +Source : ${src} +Date : ${date} +Notification generated on ${host} + +Evaluated Expression : ${calc_expression} +Expression Variables : ${calc_param_values} + +The host has ${total_warnings} WARNING and ${total_critical} CRITICAL alarm(s) raised. + +--multipart-boundary +Content-Type: text/html; encoding=${EMAIL_CHARSET} +Content-Disposition: inline +Content-Transfer-Encoding: 8bit + +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> +<html xmlns="http://www.w3.org/1999/xhtml" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0; padding: 0;"> +<body style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; width: 100% !important; min-height: 100%; line-height: 1.6; background: #f6f6f6; margin:0; padding: 0;"> +<table> + <tbody> + <tr> + <td style="vertical-align: top;" valign="top"></td> + <td width="700" style="vertical-align: top; display: block !important; max-width: 700px !important; clear: both !important; margin: 0 auto; padding: 0;" valign="top"> + <div style="max-width: 700px; display: block; margin: 0 auto; padding: 20px;"> + <table width="100%" cellpadding="0" cellspacing="0" style="background: #fff; border: 1px solid #e9e9e9;"> + <tbody> + <tr> + <td bgcolor="#eee" style="padding: 5px 20px 5px 20px; background-color: #eee;"> + <div style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 20px; color: #777; font-weight: bold;">netdata notification</div> + </td> + </tr> + <tr> + <td bgcolor="${color}" style="font-size: 16px; vertical-align: top; font-weight: 400; text-align: center; margin: 0; padding: 10px; color: #ffffff; background: ${color} !important; border: 1px solid ${color}; border-top-color: ${color};" align="center" valign="top"> + <h1 style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-weight: 400; margin: 0;">${host} ${status_message}</h1> + </td> + </tr> + <tr> + <td style="vertical-align: top;" valign="top"> + <div style="margin: 0; padding: 20px; max-width: 700px;"> + <table width="100%" cellpadding="0" cellspacing="0" style="max-width:700px"> + <tbody> + <tr> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding:0 0 20px;" align="left" valign="top"> + <span>${chart}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Chart</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span><b>${alarm}</b>${info_html}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Alarm</span> + </td> + </tr> + <tr> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span>${family}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Family</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span>${severity}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Severity</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"><span>${date}</span> + <span>${raised_for_html}</span> <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Time</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span>${calc_expression}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Evaluated Expression</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span>${calc_param_values}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Expression Variables</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + The host has ${total_warnings} WARNING and ${total_critical} CRITICAL alarm(s) raised. + </td> + </tr> + + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;"> + <a href="${goto_url}" style="font-size: 14px; color: #ffffff; text-decoration: none; line-height: 1.5; font-weight: bold; text-align: center; display: inline-block; text-transform: capitalize; background: #35568d; border-width: 1px; border-style: solid; border-color: #2b4c86; margin: 0; padding: 10px 15px;" target="_blank">View Netdata</a> + </td> + </tr> + <tr style="text-align: center; margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 11px; vertical-align: top; margin: 0; padding: 10px 0 0 0; color: #666666;" align="center" valign="bottom">The source of this alarm is line <code>${src}</code><br/>(alarms are configurable, edit this file to adapt the alarm to your needs) + </td> + </tr> + <tr style="text-align: center; margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; vertical-align: top; margin:0; padding: 20px 0 0 0; color: #666666; border-top: 1px solid #f0f0f0;" align="center" valign="bottom">Sent by + <a href="https://mynetdata.io/" target="_blank">netdata</a>, the real-time performance and health monitoring, on <code>${host}</code>. + </td> + </tr> + </tbody> + </table> + </div> + </td> + </tr> + </tbody> + </table> + </div> + </td> + </tr> + </tbody> +</table> +</body> +</html> +--multipart-boundary-- +EOF + +SENT_EMAIL=$? + +# ----------------------------------------------------------------------------- +# let netdata know + +if [ ${SENT_EMAIL} -eq 0 \ + -o ${SENT_PUSHOVER} -eq 0 \ + -o ${SENT_TELEGRAM} -eq 0 \ + -o ${SENT_SLACK} -eq 0 \ + -o ${SENT_MSTEAM} -eq 0 \ + -o ${SENT_ROCKETCHAT} -eq 0 \ + -o ${SENT_ALERTA} -eq 0 \ + -o ${SENT_FLOCK} -eq 0 \ + -o ${SENT_DISCORD} -eq 0 \ + -o ${SENT_TWILIO} -eq 0 \ + -o ${SENT_HIPCHAT} -eq 0 \ + -o ${SENT_MESSAGEBIRD} -eq 0 \ + -o ${SENT_KAVENEGAR} -eq 0 \ + -o ${SENT_PUSHBULLET} -eq 0 \ + -o ${SENT_KAFKA} -eq 0 \ + -o ${SENT_PD} -eq 0 \ + -o ${SENT_FLEEP} -eq 0 \ + -o ${SENT_PROWL} -eq 0 \ + -o ${SENT_IRC} -eq 0 \ + -o ${SENT_AWSSNS} -eq 0 \ + -o ${SENT_CUSTOM} -eq 0 \ + -o ${SENT_SYSLOG} -eq 0 \ + ] + then + # we did send something + exit 0 +fi + +# we did not send anything +exit 1 diff --git a/health/notifications/alarm-test.sh b/health/notifications/alarm-test.sh new file mode 100755 index 0000000..828aa75 --- /dev/null +++ b/health/notifications/alarm-test.sh @@ -0,0 +1,12 @@ +#!/usr/bin/env bash + +# netdata +# real-time performance and health monitoring, done right! +# (C) 2017 Costa Tsaousis <costa@tsaousis.gr> +# SPDX-License-Identifier: GPL-3.0-or-later +# +# Script to test alarm notifications for netdata + +dir="$(dirname "${0}")" +"${dir}/alarm-notify.sh" test "${1}" +exit $? diff --git a/health/notifications/alerta/Makefile.inc b/health/notifications/alerta/Makefile.inc new file mode 100644 index 0000000..32fa089 --- /dev/null +++ b/health/notifications/alerta/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + alerta/README.md \ + alerta/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/alerta/README.md b/health/notifications/alerta/README.md new file mode 100644 index 0000000..2826fe7 --- /dev/null +++ b/health/notifications/alerta/README.md @@ -0,0 +1,82 @@ +# alerta.io + +The [Alerta](https://alerta.io) monitoring system is a tool used to +consolidate and de-duplicate alerts from multiple sources for quick +‘at-a-glance’ visualisation. With just one system you can monitor +alerts from many other monitoring tools on a single screen. + +![](https://docs.alerta.io/en/latest/_images/alerta-screen-shot-3.png) + +Netadata alarms can be sent to Alerta so you can see in one place +alerts coming from many Netdata hosts or also from a multi-host +Netadata configuration. The big advantage over other notifications +systems is that there is a main view of all active alarms with +the most recent state, and it is also possible to view alarm history. + +## Deploying Alerta + +It is recommended to set up the server in a separated server, VM or +container. If you have other Nginx or Apache server in your organization, +it is recommended to proxy to this new server. + +The easiest way to install Alerta is to use the Docker image available +on [Docker hub][1]. Alternatively, follow the ["getting started"][2] +tutorial to deploy Alerta to an Ubuntu server. More advanced +configurations are out os scope of this tutorial but information +about different deployment scenaries can be found in the [docs][3]. + +[1]: https://hub.docker.com/r/alerta/alerta-web/ +[2]: http://alerta.readthedocs.io/en/latest/gettingstarted/tutorial-1-deploy-alerta.html +[3]: http://docs.alerta.io/en/latest/deployment.html + +## Send alarms to Alerta + +Step 1. Create an API key (if authentication is enabled) + +You will need an API key to send messages from any source, if +Alerta is configured to use authentication (recommended). To +create an API key go to "Configuration -> API Keys" and create +a new API key called "netdata" with `write:alerts` permission. + +Step 2. configure Netdata to send alarms to Alerta + +On your system run: + + $ /etc/netdata/edit-config health_alarm_notify.conf + +and modify the file as below: + +``` +# enable/disable sending alerta notifications +SEND_ALERTA="YES" + +# here set your alerta server API url +# this is the API url you defined when installed Alerta server, +# it is the same for all users. Do not include last slash. +ALERTA_WEBHOOK_URL="http://yourserver/alerta/api" + +# Login with an administrative user to you Alerta server and create an API KEY +# with write permissions. +ALERTA_API_KEY="INSERT_YOUR_API_KEY_HERE" + +# you can define environments in /etc/alertad.conf option ALLOWED_ENVIRONMENTS +# standard environments are Production and Development +# if a role's recipients are not configured, a notification will be send to +# this Environment (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_ALERTA="Production" +``` + +## Test alarms + +We can test alarms using the standard approach: + + $ /opt/netdata/netdata-plugins/plugins.d/alarm-notify.sh test + +Note: Netdata will send 3 alarms, and because last alarm is "CLEAR" +you will not see them in main Alerta page, you need to select to see +"closed" alarma in top-right lookup. A little change in `alarm-notify.sh` +that let us test each state one by one will be useful. + +For more information see [https://docs.alerta.io](https://docs.alerta.io) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Falerta%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/awssns/Makefile.inc b/health/notifications/awssns/Makefile.inc new file mode 100644 index 0000000..3d8e58f --- /dev/null +++ b/health/notifications/awssns/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + awssns/README.md \ + awssns/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/awssns/README.md b/health/notifications/awssns/README.md new file mode 100644 index 0000000..5205d4c --- /dev/null +++ b/health/notifications/awssns/README.md @@ -0,0 +1,33 @@ +# Amazon SNS + +As part of it's AWS suite, Amazon provides a notification broker service called 'Simple Notification Service' or SNS. Amazon SNS works kind of similarly to Netdata's own notification system, allowing dispatch of a single notification to multiple subscribers of different types. Among other things, SNS supports sending notifications to: + +* Email addresses. +* Mobile Phones via SMS. +* HTTP or HTTPS web hooks. +* AWS Lambda functions. +* AWS SQS queues. +* Mobile applications via push notifications. + +To get this working, you will need: + +* The Amazon Web Services CLI tools. Most distributions provide these with the package name `awscli`. +* An actual home directory for the user you run Netdata as, instead of just using `/` as a home directory. Setup of this is distribution specific. `/var/lib/netdata` is the recommended directory (because the permissions will already be correct) if you are using a dedicated user (which is how most distributions work). +* An Amazon SNS topic to send notifications to with one or more subscribers. The [Getting Started](https://docs.aws.amazon.com/sns/latest/dg/GettingStarted.html) section of the Amazon SNS documentation covers the basics of how to set this up. Make note of the Topic ARN when you create the topic. +* While not mandatory, it is highly recommended to create a dedicated IAM user on your account for netdata to send notifications. This user needs to have programmatic access, and should only allow access to SNS. If you're really paranoid, you can create one for each system or group of systems. + +Once you have all the above, run the follwing command as the user netdata runs under: + + aws configure + +THis will prompt you for the access key and secret key for accessing Amazon SNS (as well as the default region and output format, but you can leave those blank because we don't use them). + +Once that's done, you're ready to go and can specify the desired topic ARN as a recipient. + +Notes: + + * Netdata's native email notification support is far better in almost all respects than it's support through Amazon SNS. If you want email notifications, use the native support, not SNS. + * If you need to change the notification format for SNS notifications, you can do so by specifying the format in `AWSSNS_MESSAGE_FORMAT` in the configuration. This variable supports all the same vairiables you can use in custom notifications. + * While Amazon SNS supports sending differently formatted messages for different delivery methods, netdata does not currently support this functionality. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fawssns%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/discord/Makefile.inc b/health/notifications/discord/Makefile.inc new file mode 100644 index 0000000..03d0339 --- /dev/null +++ b/health/notifications/discord/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + discord/README.md \ + discord/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/discord/README.md b/health/notifications/discord/README.md new file mode 100644 index 0000000..7694fef --- /dev/null +++ b/health/notifications/discord/README.md @@ -0,0 +1,46 @@ +# Discordapp.com + +This is what you will get: + +![image](https://cloud.githubusercontent.com/assets/7321975/22215935/b49ede7e-e162-11e6-98d0-ae8541e6b92e.png) + +You need: + +1. The **incoming webhook URL** as given by Discord. Create a webhook by following the official [Discord documentation](https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks). You can use the same on all your netdata servers (or you can have multiple if you like - your decision). +2. One or more Discord channels to post the messages to. + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# sending discord notifications + +# note: multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending discord notifications +SEND_DISCORD="YES" + +# Create a webhook by following the official documentation - +# https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks +DISCORD_WEBHOOK_URL="https://discordapp.com/api/webhooks/XXXXXXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role's recipients are not configured, a notification will be send to +# this discord channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_DISCORD="alarms" + +``` + +You can define multiple channels like this: `alarms systems`. +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_discord[sysadmin]="systems" +role_recipients_discord[dba]="databases systems" +role_recipients_discord[webmaster]="marketing development" +``` + +The keywords `systems`, `databases`, `marketing`, `development` are discordapp.com channels (they should already exist within your discord server). + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fdiscord%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/email/Makefile.inc b/health/notifications/email/Makefile.inc new file mode 100644 index 0000000..62dd18a --- /dev/null +++ b/health/notifications/email/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + email/README.md \ + email/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/email/README.md b/health/notifications/email/README.md new file mode 100644 index 0000000..163839b --- /dev/null +++ b/health/notifications/email/README.md @@ -0,0 +1,33 @@ +# email + +You need a working `sendmail` command for email alerts to work. Almost all MTAs provide a `sendmail` interface. + +netdata sends all emails as user `netdata`, so make sure your `sendmail` works for local users. + +email notifications look like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/18407294/e9218c68-7714-11e6-8739-e4dd8a498252.png) + +## configuration + +To edit `health_alarm_notify.conf` on your system run `/etc/netdata/edit-config health_alarm_notify.conf`. + +You can configure recipients in [`/etc/netdata/health_alarm_notify.conf`](https://github.com/netdata/netdata/blob/99d44b7d0c4e006b11318a28ba4a7e7d3f9b3bae/conf.d/health_alarm_notify.conf#L101). + +You can also configure per role recipients [in the same file, a few lines below](https://github.com/netdata/netdata/blob/99d44b7d0c4e006b11318a28ba4a7e7d3f9b3bae/conf.d/health_alarm_notify.conf#L313). + +Changes to this file do not require netdata restart. + +You can test your configuration by issuing the commands: + +```sh +# become user netdata +sudo su -s /bin/bash netdata + +# send a test alarm +/usr/libexec/netdata/plugins.d/alarm-notify.sh test [ROLE] +``` + +Where `[ROLE]` is the role you want to test. The default (if you don't give a `[ROLE]`) is `sysadmin`. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Femail%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/flock/Makefile.inc b/health/notifications/flock/Makefile.inc new file mode 100644 index 0000000..fbff309 --- /dev/null +++ b/health/notifications/flock/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + flock/README.md \ + flock/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/flock/README.md b/health/notifications/flock/README.md new file mode 100644 index 0000000..0d679ce --- /dev/null +++ b/health/notifications/flock/README.md @@ -0,0 +1,33 @@ +# flock.com + +This is what you will get: + + +![Flock](https://i.imgur.com/ok9bRzw.png) + +You need: + +The **incoming webhook URL** as given by flock.com. You can use the same on all your netdata servers (or you can have multiple if you like - your decision). + +Get them here: https://admin.flock.com/webhooks + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# sending flock notifications + +# enable/disable sending pushover notifications +SEND_FLOCK="YES" + +# Login to flock.com and create an incoming webhook. +# You need only one for all your netdata servers. +# Without it, netdata cannot send flock notifications. +FLOCK_WEBHOOK_URL="https://api.flock.com/hooks/sendMessage/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role recipient is not configured, no notification will be sent +DEFAULT_RECIPIENT_FLOCK="alarms" + +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fflock%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/health_alarm_notify.conf b/health/notifications/health_alarm_notify.conf new file mode 100755 index 0000000..b96cf57 --- /dev/null +++ b/health/notifications/health_alarm_notify.conf @@ -0,0 +1,1019 @@ +# Configuration for alarm notifications +# +# This configuration is used by: alarm-notify.sh +# changes take effect immediately (the next alarm will use them). +# +# alarm-notify.sh can send: +# - e-mails (using the sendmail command), +# - push notifications to your mobile phone (pushover.net), +# - messages to your slack team (slack.com), +# - messages to your alerta server (alerta.io), +# - messages to your flock team (flock.com), +# - messages to your discord guild (discordapp.com), +# - messages to your telegram chat / group chat (telegram.org) +# - sms messages to your cell phone or any sms enabled device (twilio.com) +# - sms messages to your cell phone or any sms enabled device (messagebird.com) +# - notifications to users on pagerduty.com +# - push notifications to iOS devices (via prowlapp.com) +# - notifications to Amazon SNS topics (aws.amazon.com) +# - messages to your irc channel on your selected network +# - messages to a local or remote syslog daemon +# - message to Microsoft Team (thru webhook) +# +# The 'to' line given at netdata alarms defines a *role*, so that many +# people can be notified for each role. +# +# This file is a BASH script itself. +# +# +#------------------------------------------------------------------------------ +# proxy configuration +# +# If you need to send curl based notifications (pushover, pushbullet, slack, alerta, +# flock, discord, telegram) via a proxy, set these to your proxy address: +#export http_proxy="http://10.0.0.1:3128/" +#export https_proxy="http://10.0.0.1:3128/" + + +#------------------------------------------------------------------------------ +# notifications images +# +# Images in notifications need to be downloaded from an Internet facing site. +# To allow notification providers fetch the icons/images, by default we set +# the URL of the global public netdata registry. +# If you have an Internet facing netdata (or you have copied the images/ folder +# of netdata to your web server), set its URL here, to fetch the notification +# images from it. +#images_base_url="http://my.public.netdata.server:19999" + + +#------------------------------------------------------------------------------ +# date handling +# +# You can configure netdata alerts to send dates in any format you want. +# This uses standard `date` command format strings. See `man date` for +# more info on what you can put in here. Note that this has to start with a '+', otherwise it won't work. +# +# For ISO 8601 dates, use '+%FT%T%z' +# For RFC 5322 dates, use '+%a, %d %b %Y %H:%M:%S %z' +# For RFC 3339 dates, use '+%F %T%:z' +# For RFC 1123 dates, use '+%a, %d %b %Y %H:%M:%S %Z' +# For RFC 1036 dates, use '+%A, %d-%b-%y %H:%M:%S %Z' +# For a reasonably local date and time (in that order), use '+%x %X' +# For the old default behavior (compatible with ANSI C's asctime() function), leave this empty. +date_format='' + + +#------------------------------------------------------------------------------ +# hostname handling +# +# By default, Netdata will use the simple hostname for the system (the +# hostname with everything after the first `.` removed) when displaying +# the hostname in alert notifications. If you prefer, you can uncomment +# the line below to have Netdata instead use the host's fully qualified +# domain name. +# +# This does not report correct FQDN's for slave systems for which this +# sytem is a master. +# +# Additionally, if the system host name is overridden in /etc/netdata.conf +# with the `hostname` option, that name will be used unconditionally +# instead of this. +#use_fqdn='YES' + + +#------------------------------------------------------------------------------ +# external commands + +# The full path to the sendmail command. +# If empty, the system $PATH will be searched for it. +# If not found, email notifications will be disabled (silently). +sendmail="" + +# The full path of the curl command. +# If empty, the system $PATH will be searched for it. +# If not found, most notifications will be silently disabled. +curl="" + +# The full path of the nc command. +# If empty, the system $PATH will be searched for it. +# If not found, irc notifications will be silently disabled. +nc="" + +# The full path of the logger command. +# If empty, the system $PATH will be searched for it. +# If not found, syslog notifications will be silently disabled. +logger="" + +# The full path of the aws command. +# If empty, the system $PATH will be searched for it. +# If not found, Amazon SNS notifications will be silently disabled. +aws="" + +#------------------------------------------------------------------------------ +# extra options for external commands +# +# In some cases, you may need to change what options get passed to an +# external command. Such cases are covered here. + +# Extra options to pass to curl. In most cases, you shouldn't need to add anything +# to this. If you're having issues with HTTPS connections, you might try adding +# '--insecure' here, but be warned that it will make it much easier for +# third-parties to block notification delivery, and may allow disclosure +# of potentially sensitive information. +#curl_options="--insecure" + +# Extra options to pass to logger. You shouldn't have to specify anything +# here in most cases. +#logger_options="" + +#------------------------------------------------------------------------------ +# extra options + +# By default don't do anything if this is CLEAR, but it was not WARNING or CRITICAL. +# You can send it always if your system makes deduplication for alarms. +#clear_alarm_always='YES' + +# +#------------------------------------------------------------------------------ +# NOTE ABOUT RECIPIENTS +# +# When you define recipients (all types): +# +# - emails addresses +# - pushover user tokens +# - telegram chat ids +# - slack channels +# - alerta environment +# - flock rooms +# - discord channels +# - hipchat rooms +# - sms phone numbers +# - pagerduty.com (pd) services +# - irc channels +# +# You can append |critical to limit the notifications to be sent. +# +# In these examples, the first recipient receives all the alarms +# while the second one receives only the critical ones: +# +# email : "user1@example.com user2@example.com|critical" +# pushover : "2987343...9437837 8756278...2362736|critical" +# telegram : "111827421 112746832|critical" +# slack : "alarms disasters|critical" +# alerta : "alarms disasters|critical" +# flock : "alarms disasters|critical" +# discord : "alarms disasters|critical" +# twilio : "+15555555555 +17777777777|critical" +# messagebird: "+15555555555 +17777777777|critical" +# kavenegar : "09155555555 09177777777|critical" +# pd : "<pd_service_key_1> <pd_service_key_2>|critical" +# irc : "<irc_channel_1> <irc_channel_2>|critical" +# +# If a recipient is set to empty string, the default recipient of the given +# notification method (email, pushover, telegram, slack, alerta, etc) will be used. +# To disable a notification, use the recipient called: disabled +# This works for all notification methods (including the default recipients). + + +#------------------------------------------------------------------------------ +# email global notification options + +# multiple recipients can be given like this: +# "admin1@example.com admin2@example.com ..." + +# the email address sending email notifications +# the default is the system user netdata runs as (usually: netdata) +# The following formats are supported: +# EMAIL_SENDER="user@domain" +# EMAIL_SENDER="User Name <user@domain>" +# EMAIL_SENDER="'User Name' <user@domain>" +# EMAIL_SENDER="\"User Name\" <user@domain>" +EMAIL_SENDER="" + +# enable/disable sending emails +SEND_EMAIL="YES" + +# if a role recipient is not configured, an email will be send to: +DEFAULT_RECIPIENT_EMAIL="root" +# to receive only critical alarms, set it to "root|critical" + +# Optionally specify the encoding to list in the Content-Type header. +# This doesn't change what encoding the e-mail is sent with, just what +# the headers say it was encoded as. +# This shouldn't need to be changed as it will almost always be +# autodetected from the environment. +#EMAIL_CHARSET="UTF-8" + +# You can also have netdata add headers to the message that will +# cause most e-mail clients to treat all notifications for a given +# chart+alarm+host combination as a single thread. This can help +# simplify tracking of alarms, as it provides an easy wway for scripts +# to corelate messages and also will cause most clients to group all the +# messages together. This is enabled by default, uncomment the line +# below if you want to disable it. +#EMAIL_THREADING="NO" + + +#------------------------------------------------------------------------------ +# pushover (pushover.net) global notification options + +# multiple recipients can be given like this: +# "USERTOKEN1 USERTOKEN2 ..." + +# enable/disable sending pushover notifications +SEND_PUSHOVER="YES" + +# Login to pushover.net to get your pushover app token. +# You need only one for all your netdata servers (or you can have one for +# each of your netdata - your call). +# Without an app token, netdata cannot send pushover notifications. +PUSHOVER_APP_TOKEN="" + +# if a role's recipients are not configured, a notification will be send to +# this pushover user token (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_PUSHOVER="" + + +#------------------------------------------------------------------------------ +# pushbullet (pushbullet.com) push notification options + +# multiple recipients can be given like this: +# "user1@email.com user2@mail.com" + +# enable/disable sending pushbullet notifications +SEND_PUSHBULLET="YES" + +# Signup and Login to pushbullet.com +# To get your Access Token, go to https://www.pushbullet.com/#settings/account +# Create a new access token and paste it below. +# Then just set the recipients' emails. +# Please note that the if the email in the DEFAULT_RECIPIENT_PUSHBULLET does +# not have a pushbullet account, the pushbullet service will send an email +# to that address instead. + +# Without an access token, netdata cannot send pushbullet notifications. +PUSHBULLET_ACCESS_TOKEN="" +DEFAULT_RECIPIENT_PUSHBULLET="" + +# Device iden of the sending device. Optional. +PUSHBULLET_SOURCE_DEVICE="" + + +#------------------------------------------------------------------------------ +# Twilio (twilio.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending twilio SMS +SEND_TWILIO="YES" + +# Signup for free trial and select a SMS capable Twilio Number +# To get your Account SID and Token, go to https://www.twilio.com/console +# Place your sid, token and number below. +# Then just set the recipients' phone numbers. +# The trial account is only allowed to use the number specified when set up. + +# Without an account sid and token, netdata cannot send Twilio text messages. +TWILIO_ACCOUNT_SID="" +TWILIO_ACCOUNT_TOKEN="" +TWILIO_NUMBER="" +DEFAULT_RECIPIENT_TWILIO="" + + +#------------------------------------------------------------------------------ +# Messagebird (messagebird.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending messagebird SMS +SEND_MESSAGEBIRD="YES" + +# to get an access key, create a free account at https://www.messagebird.com +# verify and activate the account (no CC info needed) +# login to your account and enter your phonenumber to get some free credits +# to get the API key, click on 'API' in the sidebar, then 'API Access (REST)' +# click 'Add access key' and fill in data (you want a live key to send SMS) + +# Without an access key, netdata cannot send Messagebird text messages. +MESSAGEBIRD_ACCESS_KEY="" +MESSAGEBIRD_NUMBER="" +DEFAULT_RECIPIENT_MESSAGEBIRD="" + + +#------------------------------------------------------------------------------ +# Kavenegar (Kavenegar.com) SMS options + +# multiple recipients can be given like this: +# "09155555555 09177777777" + +# enable/disable sending kavenegar SMS +SEND_KAVENEGAR="YES" + +# to get an access key, after selecting and purchasing your desired service +# at http://kavenegar.com/pricing.html +# login to your account, go to your dashboard and my account are +# https://panel.kavenegar.com/Client/setting/account from API Key +# copy your api key. You can generate new API Key too. +# You can find and select kevenegar sender number from this place. + +# Without an API key, netdata cannot send KAVENEGAR text messages. +KAVENEGAR_API_KEY="" +KAVENEGAR_SENDER="" +DEFAULT_RECIPIENT_KAVENEGAR="" + + +#------------------------------------------------------------------------------ +# telegram (telegram.org) global notification options + +# To get your chat ID send the command /my_id to telegram bot @get_id. +# Users also need to open a query with the bot (see below). + +# note: multiple recipients can be given like this: +# "CHAT_ID_1 CHAT_ID_2 ..." + +# enable/disable sending telegram messages +SEND_TELEGRAM="YES" + +# Contact the bot @BotFather to create a new bot and receive a bot token. +# Without it, netdata cannot send telegram messages. +TELEGRAM_BOT_TOKEN="" + +# If a role's recipients are not configured, a message will be send to +# this chat id (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_TELEGRAM="" + + +#------------------------------------------------------------------------------ +# slack (slack.com) global notification options + +# multiple recipients can be given like this: +# "RECIPIENT1 RECIPIENT2 ..." + +# enable/disable sending slack notifications +SEND_SLACK="YES" + +# Login to slack.com and create an incoming webhook. You need only one for all +# your netdata servers (or you can have one for each of your netdata). +# Without it, netdata cannot send slack notifications. +# Get yours from: https://api.slack.com/incoming-webhooks +SLACK_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to: +# - A slack channel (syntax: '#channel' or 'channel') +# - A slack user (syntax: '@user') +# - The channel or user defined in slack for the webhook (syntax: '#') +# empty = do not send a notification for unconfigured roles +DEFAULT_RECIPIENT_SLACK="" + +#------------------------------------------------------------------------------ +# Microsoft Team (office.com) global notification options +# More details are available here regarding the payload syntax options : https://docs.microsoft.com/en-us/outlook/actionable-messages/message-card-reference +# Online designer : https://acdesignerbeta.azurewebsites.net/ +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending team notifications +SEND_MSTEAM="YES" + +# if a role's recipients are not configured, a notification will be send to +# this slack channel (empty = do not send a notification for unconfigured +# roles): +# For team the channel name is encoded in the URI after ....IncomingWebhook/___/..... +# This value will be replaced in the webhook value to publish to several channels in a same Team. +# In order to get it working properly, you have to replace the value between [] ....IncomingWebhook/[___]/..... by "CHANNEL" string. +DEFAULT_RECIPIENT_MSTEAM="" +# Based on the way MS Teams is working, put the differents channels here like : "CHANNEL1 CHANNEL2 ..." +# AT LEAST ONE CHANNEL IS MANDATORY +MSTEAM_WEBHOOK_URL="" + +# Define the default color scheme for alert to MS Team - icon and color +# Icons - go to https://emojipedia.org/bomb/ +MSTEAM_ICON_DEFAULT="♡" +MSTEAM_ICON_CLEAR="💚" +MSTEAM_ICON_WARNING="⚠️" +MSTEAM_ICON_CRITICAL="🔥" + +# Colors +MSTEAM_COLOR_DEFAULT="0076D7" +MSTEAM_COLOR_CLEAR="65A677" +MSTEAM_COLOR_WARNING="FFA500" +MSTEAM_COLOR_CRITICAL="D93F3C" + + +#------------------------------------------------------------------------------ +# rocketchat (rocket.chat) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending rocketchat notifications +SEND_ROCKETCHAT="YES" + +# Login to rocket.chat and create an incoming webhook. You need only one for all +# your netdata servers (or you can have one for each of your netdata). +# Without it, netdata cannot send rocketchat notifications. +ROCKETCHAT_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to +# this rocketchat channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_ROCKETCHAT="" + + +#------------------------------------------------------------------------------ +# alerta (alerta.io) global notification options + +# multiple recipients (Environments) can be given like this: +# "Production Development ..." + +# enable/disable sending alerta notifications +SEND_ALERTA="YES" + +# here set your alerta server API url +# this is the API url you defined when installed Alerta server, +# it is the same for all users. Do not include last slash. +# ALERTA_WEBHOOK_URL="https://<server>/alerta/api" +ALERTA_WEBHOOK_URL="" + +# Login with an administrative user to you Alerta server and create an API KEY +# with write permissions. +ALERTA_API_KEY="" + +# you can define environments in /etc/alertad.conf option ALLOWED_ENVIRONMENTS +# standard environments are Production and Development +# if a role's recipients are not configured, a notification will be send to +# this Environment (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_ALERTA="" + + +#------------------------------------------------------------------------------ +# flock (flock.com) global notification options + +# enable/disable sending flock notifications +SEND_FLOCK="YES" + +# Login to flock.com and create an incoming webhook. You need only one for all +# your netdata servers (or you can have one for each of your netdata). +# Without it, netdata cannot send flock notifications. +FLOCK_WEBHOOK_URL="" + +# if a role recipient is not configured, no notification will be sent +DEFAULT_RECIPIENT_FLOCK="" + + +#------------------------------------------------------------------------------ +# discord (discordapp.com) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending discord notifications +SEND_DISCORD="YES" + +# Create a webhook by following the official documentation - +# https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks +DISCORD_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to +# this discord channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_DISCORD="" + + +#------------------------------------------------------------------------------ +# hipchat global notification options + +# multiple recipients can be given like this: +# "ROOM1 ROOM2 ..." + +# enable/disable sending hipchat notifications +SEND_HIPCHAT="YES" + +# define hipchat server +HIPCHAT_SERVER="api.hipchat.com" + +# api.hipchat.com authorization token +# Without this, netdata cannot send hipchat notifications. +HIPCHAT_AUTH_TOKEN="" + +# if a role's recipients are not configured, a notification will be send to +# this hipchat room (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_HIPCHAT="" + + +#------------------------------------------------------------------------------ +# kafka notification options + +# enable/disable sending kafka notifications +SEND_KAFKA="YES" + +# The URL to POST kafka alarm data to. It should be the full URL. +KAFKA_URL="" + +# The IP to be used in the kafka message as the sender. +KAFKA_SENDER_IP="" + + +#------------------------------------------------------------------------------ +# pagerduty.com notification options +# +# pagerduty.com notifications require a "Generic API" (Events v1) +# pagerduty service. +# https://support.pagerduty.com/docs/services-and-integrations + +# multiple recipients can be given like this: +# "<pd_service_key_1> <pd_service_key_2> ..." + +# enable/disable sending pagerduty notifications +SEND_PD="YES" + +# if a role's recipients are not configured, a notification will be sent to +# the "General API" pagerduty.com service that uses this service key. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_PD="" + + +#------------------------------------------------------------------------------ +# fleep notification options +# +# To send fleep.io notifications, you will need a webhook for the +# conversation you want to send to. + +# Fleep recipients are specified as the last part of the webhook URL. +# So, for a webhook URL of: https://fleep.io/hook/IJONmBuuSlWlkb_ttqyXJg, the +# recipient name would be: 'IJONmBuuSlWlkb_ttqyXJg'. + +# enable/disable sending fleep notifications +SEND_FLEEP="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_FLEEP="" + +# The user name to label the messages with. If this is unset, +# the hostname of the system the notification is for will be used. +FLEEP_SENDER="" + + +#------------------------------------------------------------------------------ +# irc notification options +# +# irc notifications require only the nc utility to be installed. + +# multiple recipients can be given like this: +# "<irc_channel_1> <irc_channel_2> ..." + +# enable/disable sending irc notifications +SEND_IRC="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_IRC="" + +# The irc network to which the recipients belong. It must be the full network. +# e.g. "irc.freenode.net" +IRC_NETWORK="" + +# The irc nickname which is required to send the notification. It must not be +# an already registered name as the connection's MODE is defined as a 'guest'. +IRC_NICKNAME="" + +# The irc realname which is required in order to make the connection and is an +# extra identifier. +IRC_REALNAME="" + + +#------------------------------------------------------------------------------ +# syslog notifications +# +# syslog notifications only need you to have a working logger command, which +# should be the case on pretty much any Linux system. + +# enable/disable sending syslog notifications +# NOTE: make sure you have everything else configured the way you want +# it _before_ turning this on. +SEND_SYSLOG="NO" + +# A note on log levels and facilities: +# +# The traditional UNIX syslog mechanism has the concept of both log +# levels and facilities. A log level indicates the relaitve severity of +# the message, while a facility specifies a generic source for the message +# (for example, the `mail` facility is where sendmail and postfix log +# their messages). All major syslog daemons have the ability to filter +# messages based on both log level and facility, and can often also make +# routing decisions for messages based on both factors. +# +# On Linux, the eight log levels in decreasing order of severity are: +# emerg, alert, crit, err, warning, notice, info, debug +# +# By default, warnings will be logged at the warning level, critical +# alerts at the crit level, and clear notifications at the invo level. +# +# And the 19 facilities you can log to are: +# auth, authpriv, cron, daemon, ftp, lpr, mail, news, syslog, user, +# uucp, local0, local1, local2, local3, local4, local5, local6, and local7 +# +# By default, netdata alerts will be logged to the local6 facility. +# +# Depending on your distribution, this means that either all your +# netdata alerts will by default end up in the main system log (usually +# /var/log/messages), or they won't be logged to a file at all. +# Neither of these are likely to be what you actually want, but any +# configuration to change that needs to happen in the syslog daemon +# configuration, not here. + +# This controls which facility is used by defalt for logging. Defaults +# to local6. +SYSLOG_FACILITY='' + +# If a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +# +# The recipient format for syslog uses the following format: +# [[facility.level][@host[:port]]/]prefix +# +# `prefix` gets appended to the front of all log messages generated for +# that recipient. The prefix is mandatory. +# 'host' and 'port' can be used to specify a remote syslog server to +# send messages to. Leave these out if you want messages to be delivered +# locally. 'host' can be either a hostname or an IP address. +# IPv6 addresses must have square around them. +# 'facility' and 'level' are used to override the default logging facility +# set above and the log level. If one is specified, both must be present. +# +# For example, to send messages with a 'netdata' prefix to a syslog +# daemon listening on port 514 on 'loghost' using the daemon facility and +# notice log level: +# DEFAULT_RECIPIENT_SYSLOG='daemon.notice@loghost:514/netdata' +# +DEFAULT_RECIPIENT_SYSLOG="netdata" + +#------------------------------------------------------------------------------ +# iOS Push Notifications + +# enable/disable sending iOS push notifications +SEND_PROWL="YES" + +# If a role's recipients are not configured, use the following, +# (empty = do not send a notiication for unconfigured roles) +# +# Recipients for iOS push notifications are Prowl API keys. +# +# A recipient may also consist of multiple Prowl API keys separated by +# commas, in which case notifications will be simultaneously sent for all +# of those API keys. +DEFAULT_RECIPIENT_PROWL="" + +#------------------------------------------------------------------------------ +# Amazon SNS notifications +# +# This method requires potentially complex manual configuration. See the +# netdata wiki for information on what is needed. + +# enable/disable sending Amazon SNS notifications +SEND_AWSSNS="YES" + +# Specify a template for the Amazon SNS notifications. This supports +# the same set of variables that are usable in the `custom_sender()` +# function in the custom notification configuration below. +# +AWSSNS_MESSAGE_FORMAT="${status} on ${host} at ${date}: ${chart} ${value_string}" + +# If a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +# +# Recipients for AWS SNS notifications are specified as topic ARN's. +# +DEFAULT_RECIPIENT_AWSSNS="" + +#------------------------------------------------------------------------------ +# custom notifications +# + +# enable/disable sending custom notifications +SEND_CUSTOM="YES" + +# if a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +DEFAULT_RECIPIENT_CUSTOM="" + +# The custom_sender() is a custom function to do whatever you need to do +custom_sender() { + # variables you can use: + # ${host} the host generated this event + # ${url_host} same as ${host} but URL encoded + # ${unique_id} the unique id of this event + # ${alarm_id} the unique id of the alarm that generated this event + # ${event_id} the incremental id of the event, for this alarm id + # ${when} the timestamp this event occurred + # ${name} the name of the alarm, as given in netdata health.d entries + # ${url_name} same as ${name} but URL encoded + # ${chart} the name of the chart (type.id) + # ${url_chart} same as ${chart} but URL encoded + # ${family} the family of the chart + # ${url_family} same as ${family} but URL encoded + # ${status} the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + # ${old_status} the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + # ${value} the current value of the alarm + # ${old_value} the previous value of the alarm + # ${src} the line number and file the alarm has been configured + # ${duration} the duration in seconds of the previous alarm state + # ${duration_txt} same as ${duration} for humans + # ${non_clear_duration} the total duration in seconds this is/was non-clear + # ${non_clear_duration_txt} same as ${non_clear_duration} for humans + # ${units} the units of the value + # ${info} a short description of the alarm + # ${value_string} friendly value (with units) + # ${old_value_string} friendly old value (with units) + # ${image} the URL of an image to represent the status of the alarm + # ${color} a color in #AABBCC format for the alarm + # ${goto_url} the URL the user can click to see the netdata dashboard + + # these are more human friendly: + # ${alarm} like "name = value units" + # ${status_message} like "needs attention", "recovered", "is critical" + # ${severity} like "Escalated to CRITICAL", "Recovered from WARNING" + # ${raised_for} like "(alarm was raised for 10 minutes)" + + # example human readable SMS + local msg="${host} ${status_message}: ${alarm} ${raised_for}" + + # limit it to 160 characters and encode it for use in a URL + urlencode "${msg:0:160}" >/dev/null; msg="${REPLY}" + + # a space separated list of the recipients to send alarms to + to="${1}" + + info "not sending custom notification to ${to}, for ${status} of '${host}.${chart}.${name}' - custom_sender() is not configured." +} + + +############################################################################### +# RECIPIENTS PER ROLE + +# ----------------------------------------------------------------------------- +# generic system alarms +# CPU, disks, network interfaces, entropy, etc + +role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_pushover[sysadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[sysadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[sysadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[sysadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[sysadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[sysadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[sysadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[sysadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[sysadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[sysadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[sysadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[sysadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[sysadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[sysadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[sysadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[sysadming]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[sysadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[sysadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[sysadmin]="${DEFAULT_RECIPIENT_MSTEAM}" + +# ----------------------------------------------------------------------------- +# DNS related alarms + +role_recipients_email[domainadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_pushover[domainadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[domainadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[domainadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[domainadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[domainadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[domainadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[domainadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[domainadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[domainadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[domainadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[domainadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[domainadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[domainadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[domainadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[domainadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[domainadmin]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[domainadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[domainadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[domainadmin]="${DEFAULT_RECIPIENT_MSTEAM}" + +# ----------------------------------------------------------------------------- +# database servers alarms +# mysql, redis, memcached, postgres, etc + +role_recipients_email[dba]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_pushover[dba]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[dba]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[dba]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[dba]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[dba]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[dba]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[dba]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[dba]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[dba]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[dba]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[dba]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[dba]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[dba]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[dba]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[dba]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[dba]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[dba]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[dba]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[dba]="${DEFAULT_RECIPIENT_MSTEAM}" + +# ----------------------------------------------------------------------------- +# web servers alarms +# apache, nginx, lighttpd, etc + +role_recipients_email[webmaster]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_pushover[webmaster]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[webmaster]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[webmaster]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[webmaster]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[webmaster]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[webmaster]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[webmaster]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[webmaster]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[webmaster]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[webmaster]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[webmaster]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[webmaster]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[webmaster]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[webmaster]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[webmaster]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[webmaster]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[webmaster]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[webmaster]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[webmaster]="${DEFAULT_RECIPIENT_MSTEAM}" + +# ----------------------------------------------------------------------------- +# proxy servers alarms +# squid, etc + +role_recipients_email[proxyadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_pushover[proxyadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[proxyadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[proxyadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[proxyadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[proxyadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[proxyadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[proxyadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[proxyadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[proxyadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[proxyadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[proxyadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[proxyadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[proxyadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[proxyadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[proxyadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[proxyadmin]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[porxyadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[proxyadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[proxyadmin]="${DEFAULT_RECIPIENT_MSTEAM}" + +# ----------------------------------------------------------------------------- +# peripheral devices +# UPS, photovoltaics, etc + +role_recipients_email[sitemgr]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_pushover[sitemgr]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[sitemgr]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[sitemgr]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[sitemgr]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[sitemgr]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[sitemgr]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[sitemgr]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[sitemgr]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[sitemgr]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[sitemgr]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[sitemgr]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[sitemgr]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[sitemgr]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_syslog[sitemgr]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[sitemgr]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[sitemgr]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[sitemgr]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[sitemgr]="${DEFAULT_RECIPIENT_MSTEAM}" diff --git a/health/notifications/health_email_recipients.conf b/health/notifications/health_email_recipients.conf new file mode 100644 index 0000000..f56c6c6 --- /dev/null +++ b/health/notifications/health_email_recipients.conf @@ -0,0 +1,2 @@ +# OBSOLETE FILE +# REPLACED WITH health_alarm_notify.conf diff --git a/health/notifications/irc/Makefile.inc b/health/notifications/irc/Makefile.inc new file mode 100644 index 0000000..23be721 --- /dev/null +++ b/health/notifications/irc/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + irc/README.md \ + irc/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/irc/README.md b/health/notifications/irc/README.md new file mode 100644 index 0000000..9ea86e9 --- /dev/null +++ b/health/notifications/irc/README.md @@ -0,0 +1,75 @@ +# IRC + +This is what you will get: + +IRCCloud web client: +![image](https://user-images.githubusercontent.com/31221999/36793487-3735673e-1ca6-11e8-8880-d1d8b6cd3bc0.png) + +Irssi terminal client: +![image](https://user-images.githubusercontent.com/31221999/36793486-3713ada6-1ca6-11e8-8c12-70d956ad801e.png) + + +You need: +1. The `nc` utility. If you do not set the path, netdata will search for it in your system `$PATH`. + +Set the path for `nc` in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# external commands +# +# The full path of the nc command. +# If empty, the system $PATH will be searched for it. +# If not found, irc notifications will be silently disabled. +nc="/usr/bin/nc" + +``` + +2. Αn `IRC_NETWORK` to which your preffered channels belong to. +3. One or more channels ( `DEFAULT_RECIPIENT_IRC` ) to post the messages to. +4. An `IRC_NICKNAME` and an `IRC_REALNAME` to identify in IRC. + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# irc notification options +# +# irc notifications require only the nc utility to be installed. + +# multiple recipients can be given like this: +# "<irc_channel_1> <irc_channel_2> ..." + +# enable/disable sending irc notifications +SEND_IRC="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_IRC="#system-alarms" + +# The irc network to which the recipients belong. It must be the full network. +IRC_NETWORK="irc.freenode.net" + +# The irc nickname which is required to send the notification. It must not be +# an already registered name as the connection's MODE is defined as a 'guest'. +IRC_NICKNAME="netdata-alarm-user" + +# The irc realname which is required in order to make the connection and is an +# extra identifier. +IRC_REALNAME="netdata-user" + +``` + +You can define multiple channels like this: `#system-alarms #networking-alarms`. +You can also filter the notifications like this: `#system-alarms|critical`. +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_irc[sysadmin]="#user-alarms #networking-alarms #system-alarms" +role_recipients_irc[dba]="#databases-alarms" +role_recipients_irc[webmaster]="#networking-alarms" +``` + +The keywords `#user-alarms`, `#networking-alarms`, `#system-alarms`, `#databases-alarms` are irc channels which belong to the specified IRC network. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Firc%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/kavenegar/Makefile.inc b/health/notifications/kavenegar/Makefile.inc new file mode 100644 index 0000000..6a17c34 --- /dev/null +++ b/health/notifications/kavenegar/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + kavenegar/README.md \ + kavenegar/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/kavenegar/README.md b/health/notifications/kavenegar/README.md new file mode 100644 index 0000000..d833eef --- /dev/null +++ b/health/notifications/kavenegar/README.md @@ -0,0 +1,41 @@ +# Kavenegar + +[Kavenegar](https://www.kavenegar.com/) as service for software developers, based in Iran, provides send and receive SMS, calling voice by using its APIs. + +Will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to kavenegar.com +2. Get your APIKEY and Sender from http://panel.kavenegar.com/client/setting/account +3. Fill in KAVENEGAR_API_KEY="" KAVENEGAR_SENDER="" +4. Add the recipient phone numbers to DEFAULT_RECIPIENT_KAVENEGAR="" + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# Kavenegar (kavenegar.com) SMS options + +# multiple recipients can be given like this: +# "09155555555 09177777777" + +# enable/disable sending kavenegar SMS +SEND_KAVENEGAR="YES" + +# to get an access key, after selecting and purchasing your desired service +# at http://kavenegar.com/pricing.html +# login to your account, go to your dashboard and my account are +# https://panel.kavenegar.com/Client/setting/account from API Key +# copy your api key. You can generate new API Key too. +# You can find and select kevenegar sender number from this place. + +# Without an API key, netdata cannot send KAVENEGAR text messages. +KAVENEGAR_API_KEY="" +KAVENEGAR_SENDER="" +DEFAULT_RECIPIENT_KAVENEGAR="" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fkavenegar%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/messagebird/Makefile.inc b/health/notifications/messagebird/Makefile.inc new file mode 100644 index 0000000..8132fec --- /dev/null +++ b/health/notifications/messagebird/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + messagebird/README.md \ + messagebird/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/messagebird/README.md b/health/notifications/messagebird/README.md new file mode 100644 index 0000000..cdb3e8d --- /dev/null +++ b/health/notifications/messagebird/README.md @@ -0,0 +1,41 @@ +# Messagebird + +The messagebird notifications will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to messagebird.com +2. Pick an SMS capable number after sign up to get some free credits +3. Go to <https://www.messagebird.com/app/settings/developers/access> +4. Create a new access key under 'API ACCESS (REST)' (you will want a live key) +3. Fill in MESSAGEBIRD_ACCESS_KEY="XXXXXXXX" MESSAGEBIRD_NUMBER="+XXXXXXXXXXX" +4. Add the recipient phone numbers to DEFAULT_RECIPIENT_MESSAGEBIRD="+XXXXXXXXXXX" + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# Messagebird (messagebird.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending messagebird SMS +SEND_MESSAGEBIRD="YES" + +# to get an access key, create a free account at https://www.messagebird.com +# verify and activate the account (no CC info needed) +# login to your account and enter your phonenumber to get some free credits +# to get the API key, click on 'API' in the sidebar, then 'API Access (REST)' +# click 'Add access key' and fill in data (you want a live key to send SMS) + +# Without an access key, netdata cannot send Messagebird text messages. +MESSAGEBIRD_ACCESS_KEY="XXXXXXXX" +MESSAGEBIRD_NUMBER="XXXXXXX" +DEFAULT_RECIPIENT_MESSAGEBIRD="XXXXXXX" + +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fmessagebird%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/pagerduty/Makefile.inc b/health/notifications/pagerduty/Makefile.inc new file mode 100644 index 0000000..6012d20 --- /dev/null +++ b/health/notifications/pagerduty/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pagerduty/README.md \ + pagerduty/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pagerduty/README.md b/health/notifications/pagerduty/README.md new file mode 100644 index 0000000..884b979 --- /dev/null +++ b/health/notifications/pagerduty/README.md @@ -0,0 +1,37 @@ +# PagerDuty + +[PagerDuty](https://www.pagerduty.com/company/) is the enterprise incident resolution service that integrates with ITOps and DevOps monitoring stacks to improve operational reliability and agility. From enriching and aggregating events to correlating them into incidents, PagerDuty streamlines the incident management process by reducing alert noise and resolution times. + +Here is an example of a PagerDuty dashboard with netdata notifications: + +![PagerDuty dashboard with netdata notifications](https://cloud.githubusercontent.com/assets/19278582/21233877/b466a08a-c2a5-11e6-8d66-ee6eed43818f.png) + +To have netdata send notifications to PagerDuty, you'll first need to set up a PagerDuty `Generic API` service and install the PagerDuty agent on the host running netdata. See the following guide for details: + +https://www.pagerduty.com/docs/guides/agent-install-guide/ + +During the setup of the `Generic API` PagerDuty service, you'll obtain a `pagerduty service key`. Keep this **service key** handy. + +Once the PagerDuty agent is installed on your host and can send notifications from your host to your `Generic API` service on PagerDuty, add the **service key** to `DEFAULT_RECIPIENT_PD` in `health_alarm_notify.conf`: + +``` +#------------------------------------------------------------------------------ +# pagerduty.com notification options +# +# pagerduty.com notifications require the pagerduty agent to be installed and +# a "Generic API" pagerduty service. +# https://www.pagerduty.com/docs/guides/agent-install-guide/ + +# multiple recipients can be given like this: +# "<pd_service_key_1> <pd_service_key_2> ..." + +# enable/disable sending pagerduty notifications +SEND_PD="YES" + +# if a role's recipients are not configured, a notification will be sent to +# the "General API" pagerduty.com service that uses this service key. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_PD="<service key>" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fpagerduty%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/prowl/Makefile.inc b/health/notifications/prowl/Makefile.inc new file mode 100644 index 0000000..08e4c2e --- /dev/null +++ b/health/notifications/prowl/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + prowl/README.md \ + prowl/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/prowl/README.md b/health/notifications/prowl/README.md new file mode 100644 index 0000000..1f060ed --- /dev/null +++ b/health/notifications/prowl/README.md @@ -0,0 +1,22 @@ +# prowl + +(Prowl)[1] is a push notification service for iOS devices. Netdata +supprots delivering notifications to iOS devices through Prowl. + +Because of how Netdata integrates with Prowl, there is a hard limit of +at most 1000 notifications per hour (starting from the first notification +sent). Any alerts beyond the first thousand in an hour will be dropped. + +Warning messages will be sent with the 'High' priority, critical messages +will be sent with the 'Emergency' priority, and all other messages will +be sent with the normal priority. Opening the notification's associated +URL will take you to the Netdata dashboard of the system that issued +the alert, directly to the chart that it triggered on. + +## configuration + +To use this, you will need a Prowl API key, which can be rquested through +the Prowl website after registering. + +Once you have an API key, simply specify that as a recipient for Prowl +notifications. diff --git a/health/notifications/pushbullet/Makefile.inc b/health/notifications/pushbullet/Makefile.inc new file mode 100644 index 0000000..693a0ff --- /dev/null +++ b/health/notifications/pushbullet/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pushbullet/README.md \ + pushbullet/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pushbullet/README.md b/health/notifications/pushbullet/README.md new file mode 100644 index 0000000..42b343e --- /dev/null +++ b/health/notifications/pushbullet/README.md @@ -0,0 +1,44 @@ +# PushBullet + +Will look like this on your browser: +![image](https://cloud.githubusercontent.com/assets/4300670/19109636/278b1c0c-8aee-11e6-8a09-7fc94fdbfec8.png) + +And like this on your Android device: + + +![image](https://cloud.githubusercontent.com/assets/4300670/19109635/278a1dde-8aee-11e6-9984-0bc87a13312d.png) + +You will need: + +1. Signup and Login to pushbullet.com +2. Get your Access Token, go to https://www.pushbullet.com/#settings/account and create a new one +3. Fill in the PUSHBULLET_ACCESS_TOKEN with that value +4. Add the recipient emails to DEFAULT_RECIPIENT_PUSHBULLET +!!PLEASE NOTE THAT IF THE RECIPIENT DOES NOT HAVE A PUSHBULLET ACCOUNT, PUSHBULLET SERVICE WILL SEND AN EMAIL!! + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# pushbullet (pushbullet.com) push notification options + +# multiple recipients can be given like this: +# "user1@email.com user2@mail.com" + +# enable/disable sending pushbullet notifications +SEND_PUSHBULLET="YES" + +# Signup and Login to pushbullet.com +# To get your Access Token, go to https://www.pushbullet.com/#settings/account +# And create a new access token +# Then just set the recipients emails +# Please note that the if the email in the DEFAULT_RECIPIENT_PUSHBULLET does +# not have a pushbullet account, the pushbullet service will send an email +# to that address instead + +# Without an access token, netdata cannot send pushbullet notifications. +PUSHBULLET_ACCESS_TOKEN="o.Sometokenhere" +DEFAULT_RECIPIENT_PUSHBULLET="admin1@example.com admin3@somemail.com" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fpushbullet%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/pushover/Makefile.inc b/health/notifications/pushover/Makefile.inc new file mode 100644 index 0000000..926ac7c --- /dev/null +++ b/health/notifications/pushover/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pushover/README.md \ + pushover/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pushover/README.md b/health/notifications/pushover/README.md new file mode 100644 index 0000000..1debf5d --- /dev/null +++ b/health/notifications/pushover/README.md @@ -0,0 +1,18 @@ +# PushOver + +pushover.net allows you to receive push notifications on your mobile phone. The service seems free for up to 7.500 messages per month. + +netdata will send warning messages with priority `0` and critical messages with priority `1`. pushover.net allows you to select do-not-disturb hours. The way this is configured, critical notifications will ring and vibrate your phone, even during the do-not-disturb-hours. All other notifications will be delivered silently. + +You need: + +1. APP TOKEN. You can use the same on all your netdata servers. +2. USER TOKEN for each user you are going to send notifications to. This is the actual recipient of the notification. + +The configuration is like above (slack messages). + +pushover.net notifications look like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/18407319/839c10c4-7715-11e6-92c0-12f8215128d3.png) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fpushover%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/rocketchat/Makefile.inc b/health/notifications/rocketchat/Makefile.inc new file mode 100644 index 0000000..a6fc5d5 --- /dev/null +++ b/health/notifications/rocketchat/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + rocketchat/README.md \ + rocketchat/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/rocketchat/README.md b/health/notifications/rocketchat/README.md new file mode 100644 index 0000000..f05e73f --- /dev/null +++ b/health/notifications/rocketchat/README.md @@ -0,0 +1,48 @@ +# Rocket.Chat + +This is what you will get: +![Netdata on RocketChat](https://i.imgur.com/Zu4t3j3.png) +You need: + +1. The **incoming webhook URL** as given by RocketChat. You can use the same on all your netdata servers (or you can have multiple if you like - your decision). +2. One or more channels to post the messages to. + +Get them here: https://rocket.chat/docs/administrator-guides/integrations/index.html#how-to-create-a-new-incoming-webhook + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# rocketchat (rocket.chat) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending rocketchat notifications +SEND_ROCKETCHAT="YES" + +# Login to rocket.chat and create an incoming webhook. You need only one for all +# your netdata servers (or you can have one for each of your netdata). +# Without it, netdata cannot send rocketchat notifications. +ROCKETCHAT_WEBHOOK_URL="<your_incoming_webhook_url>" + +# if a role's recipients are not configured, a notification will be send to +# this rocketchat channel (empty = do not send a notification for unconfigured +# roles). +DEFAULT_RECIPIENT_ROCKETCHAT="monitoring_alarms" + +``` + +You can define multiple channels like this: `alarms systems`. +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_rocketchat[sysadmin]="systems" +role_recipients_rocketchat[dba]="databases systems" +role_recipients_rocketchat[webmaster]="marketing development" +``` + +The keywords `systems`, `databases`, `marketing`, `development` are RocketChat channels (they should already exist). +Both public and private channels can be used, even if they differ from the channel configured in yout RocketChat incomming webhook. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Frocketchat%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/slack/Makefile.inc b/health/notifications/slack/Makefile.inc new file mode 100644 index 0000000..955a8c7 --- /dev/null +++ b/health/notifications/slack/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + slack/README.md \ + slack/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/slack/README.md b/health/notifications/slack/README.md new file mode 100644 index 0000000..6e57828 --- /dev/null +++ b/health/notifications/slack/README.md @@ -0,0 +1,54 @@ +# Slack + +This is what you will get: +![image](https://cloud.githubusercontent.com/assets/2662304/18407116/bbd0fee6-7710-11e6-81cf-58c0defaee2b.png) + +You need: + +1. The **incoming webhook URL** as given by slack.com. You can use the same on all your netdata servers (or you can have multiple if you like - your decision). +2. One or more channels to post the messages to. + +Get them here: https://api.slack.com/incoming-webhooks + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# sending slack notifications + +# note: multiple recipients can be given like this: +# "RECIPIENT1 RECIPIENT2 ..." + +# enable/disable sending pushover notifications +SEND_SLACK="YES" + +# Login to slack.com and create an incoming webhook. +# You need only one for all your netdata servers. +# Without it, netdata cannot send slack notifications. +SLACK_WEBHOOK_URL="https://hooks.slack.com/services/XXXXXXXX/XXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role's recipients are not configured, a notification will be send to: +# - A slack channel (syntax: '#channel' or 'channel') +# - A slack user (syntax: '@user') +# - The channel or user defined in slack for the webhook (syntax: '#') +# empty = do not send a notification for unconfigured roles +DEFAULT_RECIPIENT_SLACK="alarms" + +``` + +You can define multiple recipients like this: `# #alarms systems @myuser`. +This example will send the alarm to: +- The recipient defined in slack for the webhook (not known to netdata) +- The channel 'alarms' +- The channel 'systems' +- The user @myuser + +You can give different recipients per **role** using these (at the same file): + +``` +role_recipients_slack[sysadmin]="systems" +role_recipients_slack[dba]="databases systems" +role_recipients_slack[webmaster]="marketing development" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fslack%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/syslog/Makefile.inc b/health/notifications/syslog/Makefile.inc new file mode 100644 index 0000000..1792b9d --- /dev/null +++ b/health/notifications/syslog/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + syslog/README.md \ + syslog/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/syslog/README.md b/health/notifications/syslog/README.md new file mode 100644 index 0000000..597db0c --- /dev/null +++ b/health/notifications/syslog/README.md @@ -0,0 +1,25 @@ +# Syslog + +You need a working `logger` command for this to work. This is the case on pretty much every Linux system in existence, and most BSD systems. + +Logged messages will look like this: + + netdata WARNING on hostname at Tue Apr 3 09:00:00 EDT 2018: disk_space._ out of disk space time = 5h + +## configuration + +System log targets are configured as recipients in [`/etc/netdata/health_alarm_notify.conf`](https://github.com/netdata/netdata/blob/36bedc044584dea791fd29455bdcd287c3306cb2/conf.d/health_alarm_notify.conf#L534) (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`). + +You can als configure per-role targets in the same file a bit further down. + +Targets are defined as follows: + + [[facility.level][@host[:port]]/]prefix + +`prefix` defines what the log messages are prefixed with. By default, all lines are prefixed with 'netdata'. + +The `facility` and `level` are the standard syslog facility and level options, for more info on them see your local `logger` and `syslog` documentation. By default, netdata will log to the `local6` facility, with a log level dependent on the type of message (`crit` for CRITICAL, `warning` for WARNING, and `info` for everything else). + +You can configure sending directly to remote log servers by specifying a host (and optionally a port). However, this has a somewhat high overhead, so it is much preferred to use your local syslog daemon to handle the forwarding of messages to remote systems (pretty much all of them allow at least simple forwarding, and most of the really popular ones support complex queueing and routing of messages to remote log servers). + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fsyslog%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/telegram/Makefile.inc b/health/notifications/telegram/Makefile.inc new file mode 100644 index 0000000..003996b --- /dev/null +++ b/health/notifications/telegram/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + telegram/README.md \ + telegram/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/telegram/README.md b/health/notifications/telegram/README.md new file mode 100644 index 0000000..9d65254 --- /dev/null +++ b/health/notifications/telegram/README.md @@ -0,0 +1,21 @@ +# Telegram + +[Telegram](https://telegram.org/) is a messaging app with a focus on speed and security, it’s super-fast, simple and free. You can use Telegram on all your devices at the same time — your messages sync seamlessly across any number of your phones, tablets or computers. + +With Telegram, you can send messages, photos, videos and files of any type (doc, zip, mp3, etc), as well as create groups for up to 30,000 people or channels for broadcasting to unlimited audiences. You can write to your phone contacts and find people by their usernames. As a result, Telegram is like SMS and email combined — and can take care of all your personal or business messaging needs. + +netdata will send warning messages without vibration. + +You need: + +1. A bot token. To get one, contact the [@BotFather](https://t.me/BotFather) bot and send the command `/newbot`. Follow the instructions. +2. A chat id for every chat you want to send messages to. Contact the [@myidbot](https://t.me/myidbot) bot and send the command `/getid` to get your personal chat id or invite him into a group and issue the same command to get the group chat id. +3. Start a conversation with your bot or invite him into a group you want to sent messages to. + +See slack for configuration. + +Telegram messages look like this: + +![image](https://fb.hash.works/ytl/preview.jpg) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Ftelegram%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/twilio/Makefile.inc b/health/notifications/twilio/Makefile.inc new file mode 100644 index 0000000..5bd00a2 --- /dev/null +++ b/health/notifications/twilio/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + twilio/README.md \ + twilio/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/twilio/README.md b/health/notifications/twilio/README.md new file mode 100644 index 0000000..743f54e --- /dev/null +++ b/health/notifications/twilio/README.md @@ -0,0 +1,42 @@ +# Twilio + +Will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to twilio.com +2. Pick an SMS capable number during sign up. +3. Get your SID, and Token from <https://www.twilio.com/console> +3. Fill in TWILIO_ACCOUNT_SID="XXXXXXXX" TWILIO_ACCOUNT_TOKEN="XXXXXXXXX" TWILIO_NUMBER="+XXXXXXXXXXX" +4. Add the recipient phone numbers to DEFAULT_RECIPIENT_TWILIO="+XXXXXXXXXXX" + +!!PLEASE NOTE THAT IF YOUR ACCOUNT IS A TRIAL ACCOUNT YOU WILL ONLY BE ABLE TO SEND NOTIFICATIONS TO THE NUMBER YOU SIGNED UP WITH + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# Twilio (twilio.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending twilio SMS +SEND_TWILIO="YES" + +# Signup for free trial and select a SMS capable Twilio Number +# To get your Account SID and Token, go to https://www.twilio.com/console +# Place your sid, token and number below. +# Then just set the recipients' phone numbers. +# The trial account is only allowed to use the number specified when set up. + +# Without an account sid and token, netdata cannot send Twilio text messages. +TWILIO_ACCOUNT_SID="xxxxxxxxx" +TWILIO_ACCOUNT_TOKEN="xxxxxxxxxx" +TWILIO_NUMBER="xxxxxxxxxxx" +DEFAULT_RECIPIENT_TWILIO="+15555555555" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Ftwilio%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/web/Makefile.inc b/health/notifications/web/Makefile.inc new file mode 100644 index 0000000..8908243 --- /dev/null +++ b/health/notifications/web/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + web/README.md \ + web/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/web/README.md b/health/notifications/web/README.md new file mode 100644 index 0000000..0aac941 --- /dev/null +++ b/health/notifications/web/README.md @@ -0,0 +1,8 @@ +# Dashboard + +The netdata dashboard shows HTML notifications, when it is open. + +Such web notifications look like this: +![image](https://cloud.githubusercontent.com/assets/2662304/18407279/82bac6a6-7714-11e6-847e-c2e84eeacbfb.png) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fweb%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() |