diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 11:08:07 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-27 11:08:07 +0000 |
commit | c69cb8cc094cc916adbc516b09e944cd3d137c01 (patch) | |
tree | f2878ec41fb6d0e3613906c6722fc02b934eeb80 /health | |
parent | Initial commit. (diff) | |
download | netdata-c69cb8cc094cc916adbc516b09e944cd3d137c01.tar.xz netdata-c69cb8cc094cc916adbc516b09e944cd3d137c01.zip |
Adding upstream version 1.29.3.upstream/1.29.3upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'health')
150 files changed, 14120 insertions, 0 deletions
diff --git a/health/Makefile.am b/health/Makefile.am new file mode 100644 index 0000000..399d6df --- /dev/null +++ b/health/Makefile.am @@ -0,0 +1,111 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +AUTOMAKE_OPTIONS = subdir-objects +MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +SUBDIRS = \ + notifications \ + $(NULL) + +CLEANFILES = \ + $(NULL) + +dist_noinst_DATA = \ + README.md \ + $(NULL) + +userhealthconfigdir=$(configdir)/health.d +dist_userhealthconfig_DATA = \ + $(NULL) + +# Explicitly install directories to avoid permission issues due to umask +install-exec-local: + $(INSTALL) -d $(DESTDIR)$(userhealthconfigdir) + +healthconfigdir=$(libconfigdir)/health.d +dist_healthconfig_DATA = \ + health.d/adaptec_raid.conf \ + health.d/am2320.conf \ + health.d/anomalies.conf \ + health.d/apache.conf \ + health.d/apcupsd.conf \ + health.d/apps_plugin.conf \ + health.d/backend.conf \ + health.d/bcache.conf \ + health.d/beanstalkd.conf \ + health.d/bind_rndc.conf \ + health.d/boinc.conf \ + health.d/btrfs.conf \ + health.d/ceph.conf \ + health.d/cgroups.conf \ + health.d/cpu.conf \ + health.d/cockroachdb.conf \ + health.d/couchdb.conf \ + health.d/disks.conf \ + health.d/dnsmasq_dhcp.conf \ + health.d/dns_query.conf \ + health.d/dockerd.conf \ + health.d/elasticsearch.conf \ + health.d/entropy.conf \ + health.d/exporting.conf \ + health.d/fping.conf \ + health.d/ioping.conf \ + health.d/fronius.conf \ + health.d/gearman.conf \ + health.d/haproxy.conf \ + health.d/hdfs.conf \ + health.d/httpcheck.conf \ + health.d/ipc.conf \ + health.d/ipfs.conf \ + health.d/ipmi.conf \ + health.d/isc_dhcpd.conf \ + health.d/kubelet.conf \ + health.d/lighttpd.conf \ + health.d/linux_power_supply.conf \ + health.d/load.conf \ + health.d/mdstat.conf \ + health.d/megacli.conf \ + health.d/memcached.conf \ + health.d/memory.conf \ + health.d/mongodb.conf \ + health.d/mysql.conf \ + health.d/named.conf \ + health.d/net.conf \ + health.d/netfilter.conf \ + health.d/nginx.conf \ + health.d/nginx_plus.conf \ + health.d/pihole.conf \ + health.d/phpfpm.conf \ + health.d/portcheck.conf \ + health.d/postgres.conf \ + health.d/processes.conf \ + health.d/pulsar.conf \ + health.d/qos.conf \ + health.d/ram.conf \ + health.d/redis.conf \ + health.d/retroshare.conf \ + health.d/riakkv.conf \ + health.d/scaleio.conf \ + health.d/softnet.conf \ + health.d/squid.conf \ + health.d/stiebeleltron.conf \ + health.d/swap.conf \ + health.d/tcp_conn.conf \ + health.d/tcp_listen.conf \ + health.d/tcp_mem.conf \ + health.d/tcp_orphans.conf \ + health.d/tcp_resets.conf \ + health.d/udp_errors.conf \ + health.d/unbound.conf \ + health.d/varnish.conf \ + health.d/vcsa.conf \ + health.d/vernemq.conf \ + health.d/vsphere.conf \ + health.d/web_log.conf \ + health.d/whoisquery.conf \ + health.d/wmi.conf \ + health.d/x509check.conf \ + health.d/zfs.conf \ + health.d/zookeeper.conf \ + health.d/dbengine.conf \ + $(NULL) diff --git a/health/QUICKSTART.md b/health/QUICKSTART.md new file mode 100644 index 0000000..bc91caf --- /dev/null +++ b/health/QUICKSTART.md @@ -0,0 +1,143 @@ +<!-- +title: "Health quickstart" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/QUICKSTART.md +--> + +# Health quickstart + +In this quickstart guide, you'll learn the basics of editing health configuration files. With this knowledge, you +will be able to customize how and when Netdata triggers alarms based on the health and performance of your system or +infrastructure. + +To learn about more advanced health configurations, visit the [health reference guide](/health/REFERENCE.md). + +## Edit health configuration files + +You should [use `edit-config`](/docs/configure/nodes.md) to edit Netdata's health configuration files. `edit-config` +will open your system's default terminal editor for you to make your changes. Once you've saved and closed the editor, +`edit-config` will copy your edited file into `/etc/netdata/health.d/`, which will override the stock file in +`/usr/lib/netdata/conf.d/health.d/` and ensure your customizations are persistent between updates. + +For example, to edit the `cpu.conf` health configuration file, you would run: + +```bash +cd /etc/netdata/ # Replace with your Netdata configuration directory, if not /etc/netdata/ +./edit-config health.d/cpu.conf +``` + +Each health configuration file contains one or more health entities, which always begin with an `alarm:` or `template:` +line. You can edit these entities based on your needs. To make any changes live, be sure to [reload your health +configuration](#reload-health-configuration). + +## Reference Netdata's stock health configuration files + +While you should always [use `edit-config`](#edit-health-configuration-files), you might also want to view the stock +health configuration files Netdata ships with. Stock files can be useful as reference material, or to determine which +file you should edit with `edit-config`. + +By default, Netdata will put health configuration files in `/usr/lib/netdata/conf.d/health.d`. However, you can +double-check the location of these files by navigating to `http://NODE:19999/netdata.conf`, replacing `NODE` with the IP +address or hostname for your Agent dashboard, looking for the `stock health configuration directory` option. The value +here will show the correct path for your installation. + +```conf +[health] + ... + # stock health configuration directory = /usr/lib/netdata/conf.d/health.d +``` + +Navigate to the health configuration directory to see all the available files and open them for reading. + +```bash +cd /usr/lib/netdata/conf.d/health.d/ +ls +adaptec_raid.conf entropy.conf memory.conf squid.conf +am2320.conf fping.conf mongodb.conf stiebeleltron.conf +apache.conf fronius.conf mysql.conf swap.conf +... +``` + +> ⚠️ If you edit configuration files in your stock health configuration directory, Netdata will overwrite them during +> any updates. Please use `edit-config` as described in the [section above](#edit-health-configuration-files). + +## Write a new health entity + +While tuning existing alarms may work in some cases, you may need to write entirely new health entities based on how +your systems and applications work. + +To write a new health entity, let's create a new file inside of the `health.d/` directory. We'll name our file +`example.conf` for now. + +```bash +./edit-config health.d/example.conf +``` + +As an example, let's build a health entity that triggers an alarm your system's RAM usage goes above 80%. Copy and paste +the following into the editor: + +```yaml + alarm: ram_usage + on: system.ram +lookup: average -1m percentage of used + units: % + every: 1m + warn: $this > 80 + crit: $this > 90 + info: The percentage of RAM used by the system. +``` + +Let's look into each of the lines to see how they create a working health entity. + +- `alarm`: The name for your new entity. The name needs to follow these requirements: + - Any alphabet letter or number. + - The symbols `.` and `_`. + - Cannot be `chart name`, `dimension name`, `family name`, or `chart variable names`. +- `on`: Which chart the entity listens to. +- `lookup`: Which metrics the alarm monitors, the duration of time to monitor, and how to process the metrics into a + usable format. + - `average`: Calculate the average of all the metrics collected. + - `-1m`: Use metrics from 1 minute ago until now to calculate that average. + - `percentage`: Clarify that we're calculating a percentage of RAM usage. + - `of used`: Specify which dimension (`used`) on the `system.ram` chart you want to monitor with this entity. +- `units`: Use percentages rather than absolute units. +- `every`: How often to perform the `lookup` calculation to decide whether or not to trigger this alarm. +- `warn`/`crit`: The value at which Netdata should trigger a warning or critical alarm. +- `info`: A description of the alarm, which will appear in the dashboard and notifications. + +Let's put all these lines into a human-readable format. + +This health entity, named **ram_usage**, watches at the **system.ram** chart. It looks up the last **1 minute** of +metrics from the **used** dimension and calculates the **average** of all those metrics in a **percentage** format, +using a **% unit**. The entity performs this lookup **every minute**. If the average RAM usage percentage over the last +1 minute is **more than 80%**, the entity triggers a warning alarm. If the usage is **more than 90%**, the entity +triggers a critical alarm. + +Now that you've written a new health entity, you need to reload it to see it live on the dashboard. + +## Reload health configuration + +To make any changes to your health configuration live, you must reload Netdata's health monitoring system. To do that +without restarting all of Netdata, run the following: + +```bash +netdatacli reload-health +``` + +If you receive an error like `command not found`, this means that `netdatacli` is not installed in your `$PATH`. In that + case, you can reload only the health component by sending a `SIGUSR2` to Netdata: + +```bash +killall -USR2 netdata +``` +## What's next? + +To learn about all of Netdata's health configuration options, view the [reference guide](/health/REFERENCE.md) and +[daemon configuration](/daemon/config/README.md#health-section-options) for additional options available in the +`[health]` section of `netdata.conf`. + +Or, get guided insights into specific health configurations with our [health guides](/health/README.md#guides). + +Finally, move on to Netdata's [notification system](/health/notifications/README.md) to learn more about how Netdata can +let you know when the health of your systems or apps goes awry. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fquickstart%2F&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/README.md b/health/README.md new file mode 100644 index 0000000..37f09e8 --- /dev/null +++ b/health/README.md @@ -0,0 +1,38 @@ +<!-- +title: "Health monitoring" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/README.md +--> + +# Health monitoring + +The Netdata Agent is a health watchdog for the health and performance of your systems, services, and applications. We've +worked closely with our community of DevOps engineers, SREs, and developers to define hundreds of production-ready +alarms that work without any configuration. + +The Agent's health monitoring system is also dynamic and fully customizable. You can write entirely new alarms, tune the +community-configured alarms for every app/service [the Agent collects metrics from](/collectors/COLLECTORS.md), or +silence anything you're not interested in. You can even power complex lookups by running statistical algorithms against +your metrics. + +Ready to take the next steps with health monitoring? + +[Quickstart](/health/QUICKSTART.md) + +[Configuration reference](/health/REFERENCE.md) + +## Guides + +Every infrastructure is different, so we're not interested in mandating how you should configure Netdata's health +monitoring features. Instead, these guides should give you the details you need to tweak alarms to your heart's +content. + +[Stopping notifications for individual alarms](/docs/guides/monitor/stop-notifications-alarms.md) + +[Use dimension templates to create dynamic alarms](/docs/guides/monitor/dimension-templates.md) + +## Related features + +**[Notifications](/health/notifications/README.md)**: Get notified about ongoing alarms from your Agents via your +favorite platform(s), such as Slack, Discord, PagerDuty, email, and much more. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/REFERENCE.md b/health/REFERENCE.md new file mode 100644 index 0000000..bc5f40c --- /dev/null +++ b/health/REFERENCE.md @@ -0,0 +1,797 @@ +<!-- +title: "Health configuration reference" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/REFERENCE.md +--> + +# Health configuration reference + +Welcome to the health configuration reference. + +This guide contains information about editing health configuration files to tweak existing alarms or create new health +entities that are customized to the needs of your infrastructure. + +To learn the basics of locating and editing health configuration files, see the [health +quickstart](/health/QUICKSTART.md). + +## Health configuration files + +You can configure the Agent's health watchdog service by editing files in two locations: + +- The `[health]` section in `netdata.conf`. By editing the daemon's behavior, you can disable health monitoring + altogether, run health checks more or less often, and more. See [daemon + configuration](/daemon/config/README.md#health-section-options) for a table of all the available settings, their + default values, and what they control. +- The individual `.conf` files in `health.d/`. These health entity files are organized by the type of metric they are + performing calculations on or their associated collector. You should edit these files using the `edit-config` + script. For example: `sudo ./edit-config health.d/cpu.conf`. + +## Health entity reference + +The following reference contains information about the syntax and options of _health entities_, which Netdata attaches +to charts in order to trigger alarms. + +### Entity types + +There are two entity types: **alarms** and **templates**. They have the same format and feature set—the only difference +is their label. + +**Alarms** are attached to specific charts and use the `alarm` label. + +**Templates** define rules that apply to all charts of a specific context, and use the `template` label. Templates help +you apply one entity to all disks, all network interfaces, all MySQL databases, and so on. + +Alarms have higher precedence and will override templates. If an alarm and template entity have the same name and attach +to the same chart, Netdata will use the alarm. + +### Entity format + +Netdata parses the following lines. Beneath the table is an in-depth explanation of each line's purpose and syntax. + +- The `on` and `lookup` lines are **always required**. +- Each entity **must** have one of the following lines: `calc`, `warn`, or `crit`. +- The `alarm` or `template` line must be the first line of any entity. +- A few lines use space-separated lists to define how the entity behaves. You can use `*` as a wildcard or prefix with + `!` for a negative match. Order is important, too! See our [simple patterns docs](../libnetdata/simple_pattern/) for + more examples. + +| line | required | functionality | +| --------------------------------------------------- | --------------- | ------------------------------------------------------------------------------------- | +| [`alarm`/`template`](#alarm-line-alarm-or-template) | yes | Name of the alarm/template. | +| [`on`](#alarm-line-on) | yes | The chart this alarm should attach to. | +| [`os`](#alarm-line-os) | no | Which operating systems to run this chart. | +| [`hosts`](#alarm-line-hosts) | no | Which hostnames will run this alarm. | +| [`plugin`](#alarm-line-plugin) | no | Restrict an alarm or template to only a certain plugin. | +| [`module`](#alarm-line-module) | no | Restrict an alarm or template to only a certain module. | +| [`families`](#alarm-line-families) | no | Restrict a template to only certain families. | +| [`lookup`](#alarm-line-lookup) | yes | The database lookup to find and process metrics for the chart specified through `on`. | +| [`calc`](#alarm-line-calc) | yes (see above) | A calculation to apply to the value found via `lookup` or another variable. | +| [`every`](#alarm-line-every) | no | The frequency of the alarm. | +| [`green`/`red`](#alarm-lines-green-and-red) | no | Set the green and red thresholds of a chart. | +| [`warn`/`crit`](#alarm-lines-warn-and-crit) | yes (see above) | Expressions evaluating to true or false, and when true, will trigger the alarm. | +| [`to`](#alarm-line-to) | no | A list of roles to send notifications to. | +| [`exec`](#alarm-line-exec) | no | The script to execute when the alarm changes status. | +| [`delay`](#alarm-line-delay) | no | Optional hysteresis settings to prevent floods of notifications. | +| [`repeat`](#alarm-line-repeat) | no | The interval for sending notifications when an alarm is in WARNING or CRITICAL mode. | +| [`option`](#alarm-line-option) | no | Add an option to not clear alarms. | +| [`host labels`](#alarm-line-host-labels) | no | List of labels present on a host. | + +The `alarm` or `template` line must be the first line of any entity. + +#### Alarm line `alarm` or `template` + +This line starts an alarm or template based on the [entity type](#entity-types) you're interested in creating. + +**Alarm:** + +```yaml +alarm: NAME +``` + +**Template:** + +```yaml +template: NAME +``` + +`NAME` can be any alpha character, with `.` (period) and `_` (underscore) as the only allowed symbols, but the names +cannot be `chart name`, `dimension name`, `family name`, or `chart variables names`. + +#### Alarm line `on` + +This line defines the chart this alarm should attach to. + +**Alarms:** + +```yaml +on: CHART +``` + +The value `CHART` should be the unique ID or name of the chart you're interested in, as shown on the dashboard. In the +image below, the unique ID is `system.cpu`. + +![Finding the unique ID of a +chart](https://user-images.githubusercontent.com/1153921/67443082-43b16e80-f5b8-11e9-8d33-d6ee052c6678.png) + +**Template:** + +```yaml +on: CONTEXT +``` + +The value `CONTEXT` should be the context you want this template to attach to. + +Need to find the context? Hover over the date on any given chart and look at the tooltip. In the image below, which +shows a disk I/O chart, the tooltip reads: `proc:/proc/diskstats, disk.io`. + +![Finding the context of a chart via the tooltip](https://user-images.githubusercontent.com/1153921/68882856-2b230880-06cd-11ea-923b-b28c4632d479.png) + +You're interested in what comes after the comma: `disk.io`. That's the name of the chart's context. + +If you create a template using the `disk.io` context, it will apply an alarm to every disk available on your system. + +#### Alarm line `os` + +The alarm or template will be used only if the operating system of the host matches this list specified in `os`. The +value is a space-separated list. + +The following example enables the entity on Linux, FreeBSD, and macOS, but no other operating systems. + +```yaml +os: linux freebsd macos +``` + +#### Alarm line `hosts` + +The alarm or template will be used only if the hostname of the host matches this space-separated list. + +The following example will load on systems with the hostnames `server` and `server2`, and any system with hostnames that +begin with `database`. It _will not load_ on the host `redis3`, but will load on any _other_ systems with hostnames that +begin with `redis`. + +```yaml +hosts: server1 server2 database* !redis3 redis* +``` + +#### Alarm line `plugin` + +The `plugin` line filters which plugin within the context this alarm should apply to. The value is a space-separated +list of [simple patterns](/libnetdata/simple_pattern/README.md). For example, +you can create a filter for an alarm that applies specifically to `python.d.plugin`: + +```yaml +plugin: python.d.plugin +``` + +The `plugin` line is best used with other options like `module`. When used alone, the `plugin` line creates a very +inclusive filter that is unlikely to be of much use in production. See [`module`](#alarm-line-module) for a +comprehensive example using both. + +#### Alarm line `module` + +The `module` line filters which module within the context this alarm should apply to. The value is a space-separated +list of [simple patterns](/libnetdata/simple_pattern/README.md). For +example, you can create an alarm that applies only on the `isc_dhcpd` module started by `python.d.plugin`: + +```yaml +plugin: python.d.plugin +module: isc_dhcpd +``` + +#### Alarm line `families` + +The `families` line, used only alongside templates, filters which families within the context this alarm should apply +to. The value is a space-separated list. + +The value is a space-separate list of simple patterns. See our [simple patterns docs](../libnetdata/simple_pattern/) for +some examples. + +For example, you can create a template on the `disk.io` context, but filter it to only the `sda` and `sdb` families: + +```yaml +families: sda sdb +``` + +#### Alarm line `lookup` + +This line makes a database lookup to find a value. This result of this lookup is available as `$this`. + +The format is: + +```yaml +lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] [foreach DIMENSIONS] +``` + +Everything is the same with [badges](../web/api/badges/). In short: + +- `METHOD` is one of `average`, `min`, `max`, `sum`, `incremental-sum`. + This is required. + +- `AFTER` is a relative number of seconds, but it also accepts a single letter for changing + the units, like `-1s` = 1 second in the past, `-1m` = 1 minute in the past, `-1h` = 1 hour + in the past, `-1d` = 1 day in the past. You need a negative number (i.e. how far in the past + to look for the value). **This is required**. + +- `at BEFORE` is by default 0 and is not required. Using this you can define the end of the + lookup. So data will be evaluated between `AFTER` and `BEFORE`. + +- `every DURATION` sets the updated frequency of the lookup (supports single letter units as + above too). + +- `OPTIONS` is a space separated list of `percentage`, `absolute`, `min2max`, `unaligned`, + `match-ids`, `match-names`. Check the badges documentation for more info. + +- `of DIMENSIONS` is optional and has to be the last parameter. Dimensions have to be separated + by `,` or `|`. The space characters found in dimensions will be kept as-is (a few dimensions + have spaces in their names). This accepts Netdata simple patterns _(with `words` separated by + `,` or `|` instead of spaces)_ and the `match-ids` and `match-names` options affect the searches + for dimensions. + +- `foreach DIMENSIONS` is optional, will always be the last parameter, and uses the same `,`/`|` + rules as the `of` parameter. Each dimension you specify in `foreach` will use the same rule + to trigger an alarm. If you set both `of` and `foreach`, Netdata will ignore the `of` parameter + and replace it with one of the dimensions you gave to `foreach`. + +The result of the lookup will be available as `$this` and `$NAME` in expressions. +The timestamps of the timeframe evaluated by the database lookup is available as variables +`$after` and `$before` (both are unix timestamps). + +#### Alarm line `calc` + +A `calc` is designed to apply some calculation to the values or variables available to the entity. The result of the +calculation will be made available at the `$this` variable, overwriting the value from your `lookup`, to use in warning +and critical expressions. + +When paired with `lookup`, `calc` will perform the calculation just after `lookup` has retrieved a value from Netdata's +database. + +You can use `calc` without `lookup` if you are using [other available variables](#variables). + +The `calc` line uses [expressions](#expressions) for its syntax. + +```yaml +calc: EXPRESSION +``` + +#### Alarm line `every` + +Sets the update frequency of this alarm. This is the same to the `every DURATION` given +in the `lookup` lines. + +Format: + +```yaml +every: DURATION +``` + +`DURATION` accepts `s` for seconds, `m` is minutes, `h` for hours, `d` for days. + +#### Alarm lines `green` and `red` + +Set the green and red thresholds of a chart. Both are available as `$green` and `$red` in expressions. If multiple +alarms define different thresholds, the ones defined by the first alarm will be used. These will eventually visualized +on the dashboard, so only one set of them is allowed. If you need multiple sets of them in different alarms, use +absolute numbers instead of `$red` and `$green`. + +Format: + +```yaml +green: NUMBER +red: NUMBER +``` + +#### Alarm lines `warn` and `crit` + +Define the expression that triggers either a warning or critical alarm. These are optional, and should evaluate to +either true or false (or zero/non-zero). + +The format uses Netdata's [expressions syntax](#expressions). + +```yaml +warn: EXPRESSION +crit: EXPRESSION +``` + +#### Alarm line `to` + +This will be the first parameter of the script to be executed when the alarm switches status. Its meaning is left up to +the `exec` script. + +The default `exec` script, `alarm-notify.sh`, uses this field as a space separated list of roles, which are then +consulted to find the exact recipients per notification method. + +Format: + +```yaml +to: ROLE1 ROLE2 ROLE3 ... +``` + +#### Alarm line `exec` + +The script that will be executed when the alarm changes status. + +Format: + +```yaml +exec: SCRIPT +``` + +The default `SCRIPT` is Netdata's `alarm-notify.sh`, which supports all the notifications methods Netdata supports, +including custom hooks. + +#### Alarm line `delay` + +This is used to provide optional hysteresis settings for the notifications, to defend against notification floods. These +settings do not affect the actual alarm - only the time the `exec` script is executed. + +Format: + +```yaml +delay: [[[up U] [down D] multiplier M] max X] +``` + +- `up U` defines the delay to be applied to a notification for an alarm that raised its status + (i.e. CLEAR to WARNING, CLEAR to CRITICAL, WARNING to CRITICAL). For example, `up 10s`, the + notification for this event will be sent 10 seconds after the actual event. This is used in + hope the alarm will get back to its previous state within the duration given. The default `U` + is zero. + +- `down D` defines the delay to be applied to a notification for an alarm that moves to lower + state (i.e. CRITICAL to WARNING, CRITICAL to CLEAR, WARNING to CLEAR). For example, `down 1m` + will delay the notification by 1 minute. This is used to prevent notifications for flapping + alarms. The default `D` is zero. + +- `multiplier M` multiplies `U` and `D` when an alarm changes state, while a notification is + delayed. The default multiplier is `1.0`. + +- `max X` defines the maximum absolute notification delay an alarm may get. The default `X` + is `max(U * M, D * M)` (i.e. the max duration of `U` or `D` multiplied once with `M`). + + Example: + + `delay: up 10s down 15m multiplier 2 max 1h` + + The time is `00:00:00` and the status of the alarm is CLEAR. + + | time of event | new status | delay | notification will be sent | why | + | ------------- | ---------- | --- | ------------------------- | --- | + | 00:00:01 | WARNING | `up 10s` | 00:00:11 | first state switch | + | 00:00:05 | CLEAR | `down 15m x2` | 00:30:05 | the alarm changes state while a notification is delayed, so it was multiplied | + | 00:00:06 | WARNING | `up 10s x2 x2` | 00:00:26 | multiplied twice | + | 00:00:07 | CLEAR | `down 15m x2 x2 x2` | 00:45:07 | multiplied 3 times. | + + So: + + - `U` and `D` are multiplied by `M` every time the alarm changes state (any state, not just + their matching one) and a delay is in place. + - All are reset to their defaults when the alarm switches state without a delay in place. + +#### Alarm line `repeat` + +Defines the interval between repeating notifications for the alarms in CRITICAL or WARNING mode. This will override the +default interval settings inherited from health settings in `netdata.conf`. The default settings for repeating +notifications are `default repeat warning = DURATION` and `default repeat critical = DURATION` which can be found in +health stock configuration, when one of these interval is bigger than 0, Netdata will activate the repeat notification +for `CRITICAL`, `CLEAR` and `WARNING` messages. + +Format: + +```yaml +repeat: [off] [warning DURATION] [critical DURATION] +``` + +- `off`: Turns off the repeating feature for the current alarm. This is effective when the default repeat settings has + been enabled in health configuration. +- `warning DURATION`: Defines the interval when the alarm is in WARNING state. Use `0s` to turn off the repeating + notification for WARNING mode. +- `critical DURATION`: Defines the interval when the alarm is in CRITICAL state. Use `0s` to turn off the repeating + notification for CRITICAL mode. + +#### Alarm line `option` + +The only possible value for the `option` line is + +```yaml +option: no-clear-notification +``` + +For some alarms we need compare two time-frames, to detect anomalies. For example, `health.d/httpcheck.conf` has an +alarm template called `web_service_slow` that compares the average http call response time over the last 3 minutes, +compared to the average over the last hour. It triggers a warning alarm when the average of the last 3 minutes is twice +the average of the last hour. In such cases, it is easy to trigger the alarm, but difficult to tell when the alarm is +cleared. As time passes, the newest window moves into the older, so the average response time of the last hour will keep +increasing. Eventually, the comparison will find the averages in the two time-frames close enough to clear the alarm. +However, the issue was not resolved, it's just a matter of the newer data "polluting" the old. For such alarms, it's a +good idea to tell Netdata to not clear the notification, by using the `no-clear-notification` option. + +#### Alarm line `host labels` + +Defines the list of labels present on a host. See our [host labels guide](/docs/guides/using-host-labels.md) for +an explanation of host labels and how to implement them. + +For example, let's suppose that `netdata.conf` is configured with the following labels: + +```yaml +[host labels] + installed = 20191211 + room = server +``` + +And more labels in `netdata.conf` for workstations: + +```yaml +[host labels] + installed = 201705 + room = workstation +``` + +By defining labels inside of `netdata.conf`, you can now apply labels to alarms. For example, you can add the following +line to any alarms you'd like to apply to hosts that have the label `room = server`. + +```yaml +host labels: room = server +``` + +The `host labels` is a space-separated list that accepts simple patterns. For example, you can create an alarm +that will be applied to all hosts installed in the last decade with the following line: + +```yaml +host labels: installed = 201* +``` + +See our [simple patterns docs](../libnetdata/simple_pattern/) for more examples. + +## Expressions + +Netdata has an internal [infix expression parser](../libnetdata/eval). This parses expressions and creates an internal +structure that allows fast execution of them. + +These operators are supported `+`, `-`, `*`, `/`, `<`, `<=`, `<>`, `!=`, `>`, `>=`, `&&`, `||`, `!`, `AND`, `OR`, `NOT`. +Boolean operators result in either `1` (true) or `0` (false). + +The conditional evaluation operator `?` is supported too. Using this operator IF-THEN-ELSE conditional statements can be +specified. The format is: `(condition) ? (true expression) : (false expression)`. So, Netdata will first evaluate the +`condition` and based on the result will either evaluate `true expression` or `false expression`. + +Example: `($this > 0) ? ($avail * 2) : ($used / 2)`. + +Nested such expressions are also supported (i.e. `true expression` and `false expression` can contain conditional +evaluations). + +Expressions also support the `abs()` function. + +Expressions can have variables. Variables start with `$`. Check below for more information. + +There are two special values you can use: + +- `nan`, for example `$this != nan` will check if the variable `this` is available. A variable can be `nan` if the + database lookup failed. All calculations (i.e. addition, multiplication, etc) with a `nan` result in a `nan`. + +- `inf`, for example `$this != inf` will check if `this` is not infinite. A value or variable can be set to infinite + if divided by zero. All calculations (i.e. addition, multiplication, etc) with a `inf` result in a `inf`. + +### Special use of the conditional operator + +A common (but not necessarily obvious) use of the conditional evaluation operator is to provide +[hysteresis](https://en.wikipedia.org/wiki/Hysteresis) around the critical or warning thresholds. This usage helps to +avoid bogus messages resulting from small variations in the value when it is varying regularly but staying close to the +threshold value, without needing to delay sending messages at all. + +An example of such usage from the default CPU usage alarms bundled with Netdata is: + +```yaml +warn: $this > (($status >= $WARNING) ? (75) : (85)) +crit: $this > (($status == $CRITICAL) ? (85) : (95)) +``` + +The above say: + +- If the alarm is currently a warning, then the threshold for being considered a warning is 75, otherwise it's 85. + +- If the alarm is currently critical, then the threshold for being considered critical is 85, otherwise it's 95. + +Which in turn, results in the following behavior: + +- While the value is rising, it will trigger a warning when it exceeds 85, and a critical alert when it exceeds 95. + +- While the value is falling, it will return to a warning state when it goes below 85, and a normal state when it goes + below 75. + +- If the value is constantly varying between 80 and 90, then it will trigger a warning the first time it goes above + 85, but will remain a warning until it goes below 75 (or goes above 85). + +- If the value is constantly varying between 90 and 100, then it will trigger a critical alert the first time it goes + above 95, but will remain a critical alert goes below 85 (at which point it will return to being a warning). + +## Variables + +You can find all the variables that can be used for a given chart, using +`http://NODE:19999/api/v1/alarm_variables?chart=CHART_NAME`, replacing `NODE` with the IP address or hostname for your +Agent dashboard. For example, [variables for the `system.cpu` chart of the +registry](https://registry.my-netdata.io/api/v1/alarm_variables?chart=system.cpu). + +> If you don't know how to find the CHART_NAME, you can read about it [here](../web/README.md#charts). + +Netdata supports 3 internal indexes for variables that will be used in health monitoring. + +<details markdown="1"><summary>The variables below can be used in both chart alarms and context templates.</summary> + +Although the `alarm_variables` link shows you variables for a particular chart, the same variables can also be used in +templates for charts belonging to a given [context](../web/README.md#contexts). The reason is that all charts of a given +context are essentially identical, with the only difference being the [family](../web/README.md#families) that +identifies a particular hardware or software instance. Charts and templates do not apply to specific families anyway, +unless if you explicitly limit an alarm with the [alarm line `families`](#alarm-line-families). + +</details> + +- **chart local variables**. All the dimensions of the chart are exposed as local variables. The value of `$this` for + the other configured alarms of the chart also appears, under the name of each configured alarm. + + Charts also define a few special variables: + + - `$last_collected_t` is the unix timestamp of the last data collection + - `$collected_total_raw` is the sum of all the dimensions (their last collected values) + - `$update_every` is the update frequency of the chart + - `$green` and `$red` the threshold defined in alarms (these are per chart - the charts + inherits them from the the first alarm that defined them) + + Chart dimensions define their last calculated (i.e. interpolated) value, exactly as + shown on the charts, but also a variable with their name and suffix `_raw` that resolves + to the last collected value - as collected and another with suffix `_last_collected_t` + that resolves to unix timestamp the dimension was last collected (there may be dimensions + that fail to be collected while others continue normally). + +- **family variables**. Families are used to group charts together. For example all `eth0` + charts, have `family = eth0`. This index includes all local variables, but if there are + overlapping variables, only the first are exposed. + +- **host variables**. All the dimensions of all charts, including all alarms, in fullname. + Fullname is `CHART.VARIABLE`, where `CHART` is either the chart id or the chart name (both + are supported). + +- **special variables\*** are: + + - `$this`, which is resolved to the value of the current alarm. + + - `$status`, which is resolved to the current status of the alarm (the current = the last + status, i.e. before the current database lookup and the evaluation of the `calc` line). + This values can be compared with `$REMOVED`, `$UNINITIALIZED`, `$UNDEFINED`, `$CLEAR`, + `$WARNING`, `$CRITICAL`. These values are incremental, ie. `$status > $CLEAR` works as + expected. + + - `$now`, which is resolved to current unix timestamp. + +## Alarm statuses + +Alarms can have the following statuses: + +- `REMOVED` - the alarm has been deleted (this happens when a SIGUSR2 is sent to Netdata + to reload health configuration) + +- `UNINITIALIZED` - the alarm is not initialized yet + +- `UNDEFINED` - the alarm failed to be calculated (i.e. the database lookup failed, + a division by zero occurred, etc) + +- `CLEAR` - the alarm is not armed / raised (i.e. is OK) + +- `WARNING` - the warning expression resulted in true or non-zero + +- `CRITICAL` - the critical expression resulted in true or non-zero + +The external script will be called for all status changes. + +## Example alarms + +Check the `health/health.d/` directory for all alarms shipped with Netdata. + +Here are a few examples: + +### Example 1 + +A simple check if an apache server is alive: + +```yaml +template: apache_last_collected_secs + on: apache.requests + calc: $now - $last_collected_t + every: 10s + warn: $this > ( 5 * $update_every) + crit: $this > (10 * $update_every) +``` + +The above checks that Netdata is able to collect data from apache. In detail: + +```yaml +template: apache_last_collected_secs +``` + +The above defines a **template** named `apache_last_collected_secs`. +The name is important since `$apache_last_collected_secs` resolves to the `calc` line. +So, try to give something descriptive. + +```yaml + on: apache.requests +``` + +The above applies the **template** to all charts that have `context = apache.requests` +(i.e. all your apache servers). + +```yaml + calc: $now - $last_collected_t +``` + +- `$now` is a standard variable that resolves to the current timestamp. + +- `$last_collected_t` is the last data collection timestamp of the chart. + So this calculation gives the number of seconds passed since the last data collection. + +```yaml + every: 10s +``` + +The alarm will be evaluated every 10 seconds. + +```yaml + warn: $this > ( 5 * $update_every) + crit: $this > (10 * $update_every) +``` + +If these result in non-zero or true, they trigger the alarm. + +- `$this` refers to the value of this alarm (i.e. the result of the `calc` line. + We could also use `$apache_last_collected_secs`. + +`$update_every` is the update frequency of the chart, in seconds. + +So, the warning condition checks if we have not collected data from apache for 5 +iterations and the critical condition checks for 10 iterations. + +### Example 2 + +Check if any of the disks is critically low on disk space: + +```yaml +template: disk_full_percent + on: disk.space + calc: $used * 100 / ($avail + $used) + every: 1m + warn: $this > 80 + crit: $this > 95 + repeat: warning 120s critical 10s +``` + +`$used` and `$avail` are the `used` and `avail` chart dimensions as shown on the dashboard. + +So, the `calc` line finds the percentage of used space. `$this` resolves to this percentage. + +This is a repeating alarm and if the alarm becomes CRITICAL it repeats the notifications every 10 seconds. It also +repeats notifications every 2 minutes if the alarm goes into WARNING mode. + +### Example 3 + +Predict if any disk will run out of space in the near future. + +We do this in 2 steps: + +Calculate the disk fill rate: + +```yaml + template: disk_fill_rate + on: disk.space + lookup: max -1s at -30m unaligned of avail + calc: ($this - $avail) / (30 * 60) + every: 15s +``` + +In the `calc` line: `$this` is the result of the `lookup` line (i.e. the free space 30 minutes +ago) and `$avail` is the current disk free space. So the `calc` line will either have a positive +number of GB/second if the disk if filling up, or a negative number of GB/second if the disk is +freeing up space. + +There is no `warn` or `crit` lines here. So, this template will just do the calculation and +nothing more. + +Predict the hours after which the disk will run out of space: + +```yaml + template: disk_full_after_hours + on: disk.space + calc: $avail / $disk_fill_rate / 3600 + every: 10s + warn: $this > 0 and $this < 48 + crit: $this > 0 and $this < 24 +``` + +The `calc` line estimates the time in hours, we will run out of disk space. Of course, only +positive values are interesting for this check, so the warning and critical conditions check +for positive values and that we have enough free space for 48 and 24 hours respectively. + +Once this alarm triggers we will receive an email like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/17839993/87872b32-6802-11e6-8e08-b2e4afef93bb.png) + +### Example 4 + +Check if any network interface is dropping packets: + +```yaml +template: 30min_packet_drops + on: net.drops + lookup: sum -30m unaligned absolute + every: 10s + crit: $this > 0 +``` + +The `lookup` line will calculate the sum of the all dropped packets in the last 30 minutes. + +The `crit` line will issue a critical alarm if even a single packet has been dropped. + +Note that the drops chart does not exist if a network interface has never dropped a single packet. +When Netdata detects a dropped packet, it will add the chart and it will automatically attach this +alarm to it. + +### Example 5 + +Check if user or system dimension is using more than 50% of cpu: + +```yaml + alarm: dim_template + on: system.cpu + os: linux +lookup: average -3s percentage foreach system,user + units: % + every: 10s + warn: $this > 50 + crit: $this > 80 +``` + +The `lookup` line will calculate the average CPU usage from system and user in the last 3 seconds. Because we have +the foreach in the `lookup` line, Netdata will create two independent alarms called `dim_template_system` +and `dim_template_user` that will have all the other parameters shared among them. + +### Example 6 + +Check if all dimensions are using more than 50% of cpu: + +```yaml + alarm: dim_template + on: system.cpu + os: linux +lookup: average -3s percentage foreach * + units: % + every: 10s + warn: $this > 50 + crit: $this > 80 +``` + +The `lookup` line will calculate the average of CPU usage from system and user in the last 3 seconds. In this case +Netdata will create alarms for all dimensions of the chart. + +## Troubleshooting + +You can compile Netdata with [debugging](/daemon/README.md#debugging) and then set in `netdata.conf`: + +```yaml +[global] + debug flags = 0x0000000000800000 +``` + +Then check your `/var/log/netdata/debug.log`. It will show you how it works. Important: this will generate a lot of +output in debug.log. + +You can find the context of charts by looking up the chart in either `http://NODE:19999/netdata.conf` or +`http://NODE:19999/api/v1/charts`, replacing `NODE` with the IP address or hostname for your Agent dashboard. + +You can find how Netdata interpreted the expressions by examining the alarm at +`http://NODE:19999/api/v1/alarms?all`. For each expression, Netdata will return the expression as given in its +config file, and the same expression with additional parentheses added to indicate the evaluation flow of the +expression. + +## Disabling health checks or silencing notifications at runtime + +It's currently not possible to schedule notifications from within the alarm template. For those scenarios where you need +to temporary disable notifications (for instance when running backups triggers a disk alert) you can disable or silence +notifications are runtime. The health checks can be controlled at runtime via the [health management +api](/web/api/health/README.md). + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Freference%2F&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/health.c b/health/health.c new file mode 100644 index 0000000..b81361e --- /dev/null +++ b/health/health.c @@ -0,0 +1,1047 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +unsigned int default_health_enabled = 1; +char *silencers_filename; + +// the queue of executed alarm notifications that haven't been waited for yet +static struct { + ALARM_ENTRY *head; // oldest + ALARM_ENTRY *tail; // latest +} alarm_notifications_in_progress = {NULL, NULL}; + +static inline void enqueue_alarm_notify_in_progress(ALARM_ENTRY *ae) +{ + ae->prev_in_progress = NULL; + ae->next_in_progress = NULL; + + if (NULL != alarm_notifications_in_progress.tail) { + ae->prev_in_progress = alarm_notifications_in_progress.tail; + alarm_notifications_in_progress.tail->next_in_progress = ae; + } + if (NULL == alarm_notifications_in_progress.head) { + alarm_notifications_in_progress.head = ae; + } + alarm_notifications_in_progress.tail = ae; + +} + +static inline void unlink_alarm_notify_in_progress(ALARM_ENTRY *ae) +{ + struct alarm_entry *prev = ae->prev_in_progress; + struct alarm_entry *next = ae->next_in_progress; + + if (NULL != prev) { + prev->next_in_progress = next; + } + if (NULL != next) { + next->prev_in_progress = prev; + } + if (ae == alarm_notifications_in_progress.head) { + alarm_notifications_in_progress.head = next; + } + if (ae == alarm_notifications_in_progress.tail) { + alarm_notifications_in_progress.tail = prev; + } +} +// ---------------------------------------------------------------------------- +// health initialization + +/** + * User Config directory + * + * Get the config directory for health and return it. + * + * @return a pointer to the user config directory + */ +inline char *health_user_config_dir(void) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%s/health.d", netdata_configured_user_config_dir); + return config_get(CONFIG_SECTION_HEALTH, "health configuration directory", buffer); +} + +/** + * Stock Config Directory + * + * Get the Stock config directory and return it. + * + * @return a pointer to the stock config directory. + */ +inline char *health_stock_config_dir(void) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%s/health.d", netdata_configured_stock_config_dir); + return config_get(CONFIG_SECTION_HEALTH, "stock health configuration directory", buffer); +} + +/** + * Silencers init + * + * Function used to initialize the silencer structure. + */ +static void health_silencers_init(void) { + FILE *fd = fopen(silencers_filename, "r"); + if (fd) { + fseek(fd, 0 , SEEK_END); + off_t length = (off_t) ftell(fd); + fseek(fd, 0 , SEEK_SET); + + if (length > 0 && length < HEALTH_SILENCERS_MAX_FILE_LEN) { + char *str = mallocz((length+1)* sizeof(char)); + if(str) { + size_t copied; + copied = fread(str, sizeof(char), length, fd); + if (copied == (length* sizeof(char))) { + str[length] = 0x00; + json_parse(str, NULL, health_silencers_json_read_callback); + info("Parsed health silencers file %s", silencers_filename); + } else { + error("Cannot read the data from health silencers file %s", silencers_filename); + } + freez(str); + } + } else { + error("Health silencers file %s has the size %ld that is out of range[ 1 , %d ]. Aborting read.", silencers_filename, length, HEALTH_SILENCERS_MAX_FILE_LEN); + } + fclose(fd); + } else { + info("Cannot open the file %s, so Netdata will work with the default health configuration.",silencers_filename); + } +} + +/** + * Health Init + * + * Initialize the health thread. + */ +void health_init(void) { + debug(D_HEALTH, "Health configuration initializing"); + + if(!(default_health_enabled = (unsigned int)config_get_boolean(CONFIG_SECTION_HEALTH, "enabled", default_health_enabled))) { + debug(D_HEALTH, "Health is disabled."); + return; + } + + health_silencers_init(); +} + +// ---------------------------------------------------------------------------- +// re-load health configuration + +/** + * Reload host + * + * Reload configuration for a specific host. + * + * @param host the structure of the host that the function will reload the configuration. + */ +static void health_reload_host(RRDHOST *host) { + if(unlikely(!host->health_enabled)) + return; + + char *user_path = health_user_config_dir(); + char *stock_path = health_stock_config_dir(); + + // free all running alarms + rrdhost_wrlock(host); + + while(host->templates) + rrdcalctemplate_unlink_and_free(host, host->templates); + + RRDCALCTEMPLATE *rt,*next; + for(rt = host->alarms_template_with_foreach; rt ; rt = next) { + next = rt->next; + rrdcalctemplate_free(rt); + } + host->alarms_template_with_foreach = NULL; + + while(host->alarms) + rrdcalc_unlink_and_free(host, host->alarms); + + RRDCALC *rc,*nc; + for(rc = host->alarms_with_foreach; rc ; rc = nc) { + nc = rc->next; + rrdcalc_free(rc); + } + host->alarms_with_foreach = NULL; + + rrdhost_unlock(host); + + // invalidate all previous entries in the alarm log + ALARM_ENTRY *t; + for(t = host->health_log.alarms ; t ; t = t->next) { + if(t->new_status != RRDCALC_STATUS_REMOVED) + t->flags |= HEALTH_ENTRY_FLAG_UPDATED; + } + + rrdhost_rdlock(host); + // reset all thresholds to all charts + RRDSET *st; + rrdset_foreach_read(st, host) { + st->green = NAN; + st->red = NAN; + } + rrdhost_unlock(host); + + // load the new alarms + rrdhost_wrlock(host); + health_readdir(host, user_path, stock_path, NULL); + + //Discard alarms with labels that do not apply to host + rrdcalc_labels_unlink_alarm_from_host(host); + + // link the loaded alarms to their charts + RRDDIM *rd; + rrdset_foreach_write(st, host) { + if (rrdset_flag_check(st, RRDSET_FLAG_ARCHIVED)) + continue; + rrdsetcalc_link_matching(st); + rrdcalctemplate_link_matching(st); + + //This loop must be the last, because ` rrdcalctemplate_link_matching` will create alarms related to it. + rrdset_rdlock(st); + rrddim_foreach_read(rd, st) { + rrdcalc_link_to_rrddim(rd, st, host); + } + rrdset_unlock(st); + } + + rrdhost_unlock(host); +} + +/** + * Reload + * + * Reload the host configuration for all hosts. + */ +void health_reload(void) { +#ifdef ENABLE_ACLK + if (netdata_cloud_setting) + aclk_single_update_disable(); +#endif + rrd_rdlock(); + + RRDHOST *host; + rrdhost_foreach_read(host) + health_reload_host(host); + + rrd_unlock(); +#ifdef ENABLE_ACLK + if (netdata_cloud_setting) { + aclk_single_update_enable(); + aclk_alarm_reload(); + } +#endif +} + +// ---------------------------------------------------------------------------- +// health main thread and friends + +static inline RRDCALC_STATUS rrdcalc_value2status(calculated_number n) { + if(isnan(n) || isinf(n)) return RRDCALC_STATUS_UNDEFINED; + if(n) return RRDCALC_STATUS_RAISED; + return RRDCALC_STATUS_CLEAR; +} + +#define ALARM_EXEC_COMMAND_LENGTH 8192 + +static inline void health_alarm_execute(RRDHOST *host, ALARM_ENTRY *ae) { + ae->flags |= HEALTH_ENTRY_FLAG_PROCESSED; + + if(unlikely(ae->new_status < RRDCALC_STATUS_CLEAR)) { + // do not send notifications for internal statuses + debug(D_HEALTH, "Health not sending notification for alarm '%s.%s' status %s (internal statuses)", ae->chart, ae->name, rrdcalc_status2string(ae->new_status)); + goto done; + } + + if(unlikely(ae->new_status <= RRDCALC_STATUS_CLEAR && (ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION))) { + // do not send notifications for disabled statuses + debug(D_HEALTH, "Health not sending notification for alarm '%s.%s' status %s (it has no-clear-notification enabled)", ae->chart, ae->name, rrdcalc_status2string(ae->new_status)); + // mark it as run, so that we will send the same alarm if it happens again + goto done; + } + + // find the previous notification for the same alarm + // which we have run the exec script + // exception: alarms with HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION set + if(likely(!(ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION))) { + uint32_t id = ae->alarm_id; + ALARM_ENTRY *t; + for(t = ae->next; t ; t = t->next) { + if(t->alarm_id == id && t->flags & HEALTH_ENTRY_FLAG_EXEC_RUN) + break; + } + + if(likely(t)) { + // we have executed this alarm notification in the past + if(t && t->new_status == ae->new_status) { + // don't send the notification for the same status again + debug(D_HEALTH, "Health not sending again notification for alarm '%s.%s' status %s", ae->chart, ae->name + , rrdcalc_status2string(ae->new_status)); + goto done; + } + } + else { + // we have not executed this alarm notification in the past + // so, don't send CLEAR notifications + if(unlikely(ae->new_status == RRDCALC_STATUS_CLEAR)) { + if((!(ae->flags & HEALTH_ENTRY_RUN_ONCE)) || (ae->flags & HEALTH_ENTRY_RUN_ONCE && ae->old_status < RRDCALC_STATUS_RAISED) ) { + debug(D_HEALTH, "Health not sending notification for first initialization of alarm '%s.%s' status %s" + , ae->chart, ae->name, rrdcalc_status2string(ae->new_status)); + goto done; + } + } + } + } + + // Check if alarm notifications are silenced + if (ae->flags & HEALTH_ENTRY_FLAG_SILENCED) { + info("Health not sending notification for alarm '%s.%s' status %s (command API has disabled notifications)", ae->chart, ae->name, rrdcalc_status2string(ae->new_status)); + goto done; + } + + static char command_to_run[ALARM_EXEC_COMMAND_LENGTH + 1]; + + const char *exec = (ae->exec) ? ae->exec : host->health_default_exec; + const char *recipient = (ae->recipient) ? ae->recipient : host->health_default_recipient; + + int n_warn=0, n_crit=0; + RRDCALC *rc; + EVAL_EXPRESSION *expr=NULL; + + for(rc = host->alarms; rc ; rc = rc->next) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + + if(unlikely(rc->status == RRDCALC_STATUS_WARNING)) { + n_warn++; + if (ae->alarm_id == rc->id) + expr=rc->warning; + } else if (unlikely(rc->status == RRDCALC_STATUS_CRITICAL)) { + n_crit++; + if (ae->alarm_id == rc->id) + expr=rc->critical; + } else if (unlikely(rc->status == RRDCALC_STATUS_CLEAR)) { + if (ae->alarm_id == rc->id) + expr=rc->warning; + } + } + + snprintfz(command_to_run, ALARM_EXEC_COMMAND_LENGTH, "exec %s '%s' '%s' '%u' '%u' '%u' '%lu' '%s' '%s' '%s' '%s' '%s' '" CALCULATED_NUMBER_FORMAT_ZERO "' '" CALCULATED_NUMBER_FORMAT_ZERO "' '%s' '%u' '%u' '%s' '%s' '%s' '%s' '%s' '%s' '%d' '%d'", + exec, + recipient, + host->registry_hostname, + ae->unique_id, + ae->alarm_id, + ae->alarm_event_id, + (unsigned long)ae->when, + ae->name, + ae->chart?ae->chart:"NOCHART", + ae->family?ae->family:"NOFAMILY", + rrdcalc_status2string(ae->new_status), + rrdcalc_status2string(ae->old_status), + ae->new_value, + ae->old_value, + ae->source?ae->source:"UNKNOWN", + (uint32_t)ae->duration, + (uint32_t)ae->non_clear_duration, + ae->units?ae->units:"", + ae->info?ae->info:"", + ae->new_value_string, + ae->old_value_string, + (expr && expr->source)?expr->source:"NOSOURCE", + (expr && expr->error_msg)?buffer_tostring(expr->error_msg):"NOERRMSG", + n_warn, + n_crit + ); + + ae->flags |= HEALTH_ENTRY_FLAG_EXEC_RUN; + ae->exec_run_timestamp = now_realtime_sec(); /* will be updated by real time after spawning */ + + debug(D_HEALTH, "executing command '%s'", command_to_run); + ae->flags |= HEALTH_ENTRY_FLAG_EXEC_IN_PROGRESS; + ae->exec_spawn_serial = spawn_enq_cmd(command_to_run); + enqueue_alarm_notify_in_progress(ae); + + return; //health_alarm_wait_for_execution +done: + health_alarm_log_save(host, ae); +} + +static inline void health_alarm_wait_for_execution(ALARM_ENTRY *ae) { + if (!(ae->flags & HEALTH_ENTRY_FLAG_EXEC_IN_PROGRESS)) + return; + + spawn_wait_cmd(ae->exec_spawn_serial, &ae->exec_code, &ae->exec_run_timestamp); + debug(D_HEALTH, "done executing command - returned with code %d", ae->exec_code); + ae->flags &= ~HEALTH_ENTRY_FLAG_EXEC_IN_PROGRESS; + + if(ae->exec_code != 0) + ae->flags |= HEALTH_ENTRY_FLAG_EXEC_FAILED; + + unlink_alarm_notify_in_progress(ae); +} + +static inline void health_process_notifications(RRDHOST *host, ALARM_ENTRY *ae) { + debug(D_HEALTH, "Health alarm '%s.%s' = " CALCULATED_NUMBER_FORMAT_AUTO " - changed status from %s to %s", + ae->chart?ae->chart:"NOCHART", ae->name, + ae->new_value, + rrdcalc_status2string(ae->old_status), + rrdcalc_status2string(ae->new_status) + ); + + health_alarm_execute(host, ae); +} + +static inline void health_alarm_log_process(RRDHOST *host) { + uint32_t first_waiting = (host->health_log.alarms)?host->health_log.alarms->unique_id:0; + time_t now = now_realtime_sec(); + + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae && ae->unique_id >= host->health_last_processed_id; ae = ae->next) { + if(likely(!alarm_entry_isrepeating(host, ae))) { + if(unlikely( + !(ae->flags & HEALTH_ENTRY_FLAG_PROCESSED) && + !(ae->flags & HEALTH_ENTRY_FLAG_UPDATED) + )) { + if(unlikely(ae->unique_id < first_waiting)) + first_waiting = ae->unique_id; + + if(likely(now >= ae->delay_up_to_timestamp)) + health_process_notifications(host, ae); + } + } + } + + // remember this for the next iteration + host->health_last_processed_id = first_waiting; + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + if(host->health_log.count <= host->health_log.max) + return; + + // cleanup excess entries in the log + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *last = NULL; + unsigned int count = host->health_log.max * 2 / 3; + for(ae = host->health_log.alarms; ae && count ; count--, last = ae, ae = ae->next) ; + + if(ae && last && last->next == ae) + last->next = NULL; + else + ae = NULL; + + while(ae) { + debug(D_HEALTH, "Health removing alarm log entry with id: %u", ae->unique_id); + + ALARM_ENTRY *t = ae->next; + + if(likely(!alarm_entry_isrepeating(host, ae))) { + health_alarm_wait_for_execution(ae); + health_alarm_log_free_one_nochecks_nounlink(ae); + host->health_log.count--; + } + + ae = t; + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} + +static inline int rrdcalc_isrunnable(RRDCALC *rc, time_t now, time_t *next_run) { + if(unlikely(!rc->rrdset)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. It is not linked to a chart.", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(rc->next_update > now)) { + if (unlikely(*next_run > rc->next_update)) { + // update the next_run time of the main loop + // to run this alarm precisely the time required + *next_run = rc->next_update; + } + + debug(D_HEALTH, "Health not examining alarm '%s.%s' yet (will do in %d secs).", rc->chart?rc->chart:"NOCHART", rc->name, (int) (rc->next_update - now)); + return 0; + } + + if(unlikely(!rc->update_every)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. It does not have an update frequency", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(rrdset_flag_check(rc->rrdset, RRDSET_FLAG_OBSOLETE))) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. The chart has been marked as obsolete", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(!rrdset_flag_check(rc->rrdset, RRDSET_FLAG_ENABLED))) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. The chart is not enabled", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(rrdset_flag_check(rc->rrdset, RRDSET_FLAG_ARCHIVED))) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. The chart has been marked as archived", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(unlikely(!rc->rrdset->last_collected_time.tv_sec || rc->rrdset->counter_done < 2)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. Chart is not fully collected yet.", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + int update_every = rc->rrdset->update_every; + rrdset_rdlock(rc->rrdset); + time_t first = rrdset_first_entry_t_nolock(rc->rrdset); + time_t last = rrdset_last_entry_t_nolock(rc->rrdset); + rrdset_unlock(rc->rrdset); + + if(unlikely(now + update_every < first /* || now - update_every > last */)) { + debug(D_HEALTH + , "Health not examining alarm '%s.%s' yet (wanted time is out of bounds - we need %lu but got %lu - %lu)." + , rc->chart ? rc->chart : "NOCHART", rc->name, (unsigned long) now, (unsigned long) first + , (unsigned long) last); + return 0; + } + + if(RRDCALC_HAS_DB_LOOKUP(rc)) { + time_t needed = now + rc->before + rc->after; + + if(needed + update_every < first || needed - update_every > last) { + debug(D_HEALTH + , "Health not examining alarm '%s.%s' yet (not enough data yet - we need %lu but got %lu - %lu)." + , rc->chart ? rc->chart : "NOCHART", rc->name, (unsigned long) needed, (unsigned long) first + , (unsigned long) last); + return 0; + } + } + + return 1; +} + +static inline int check_if_resumed_from_suspention(void) { + static usec_t last_realtime = 0, last_monotonic = 0; + usec_t realtime = now_realtime_usec(), monotonic = now_monotonic_usec(); + int ret = 0; + + // detect if monotonic and realtime have twice the difference + // in which case we assume the system was just waken from hibernation + + if(last_realtime && last_monotonic && realtime - last_realtime > 2 * (monotonic - last_monotonic)) + ret = 1; + + last_realtime = realtime; + last_monotonic = monotonic; + + return ret; +} + +static void health_main_cleanup(void *ptr) { + struct netdata_static_thread *static_thread = (struct netdata_static_thread *)ptr; + static_thread->enabled = NETDATA_MAIN_THREAD_EXITING; + + info("cleaning up..."); + + static_thread->enabled = NETDATA_MAIN_THREAD_EXITED; +} + +static SILENCE_TYPE check_silenced(RRDCALC *rc, char* host, SILENCERS *silencers) { + SILENCER *s; + debug(D_HEALTH, "Checking if alarm was silenced via the command API. Alarm info name:%s context:%s chart:%s host:%s family:%s", + rc->name, (rc->rrdset)?rc->rrdset->context:"", rc->chart, host, (rc->rrdset)?rc->rrdset->family:""); + + for (s = silencers->silencers; s!=NULL; s=s->next){ + if ( + (!s->alarms_pattern || (rc->name && s->alarms_pattern && simple_pattern_matches(s->alarms_pattern,rc->name))) && + (!s->contexts_pattern || (rc->rrdset && rc->rrdset->context && s->contexts_pattern && simple_pattern_matches(s->contexts_pattern,rc->rrdset->context))) && + (!s->hosts_pattern || (host && s->hosts_pattern && simple_pattern_matches(s->hosts_pattern,host))) && + (!s->charts_pattern || (rc->chart && s->charts_pattern && simple_pattern_matches(s->charts_pattern,rc->chart))) && + (!s->families_pattern || (rc->rrdset && rc->rrdset->family && s->families_pattern && simple_pattern_matches(s->families_pattern,rc->rrdset->family))) + ) { + debug(D_HEALTH, "Alarm matches command API silence entry %s:%s:%s:%s:%s", s->alarms,s->charts, s->contexts, s->hosts, s->families); + if (unlikely(silencers->stype == STYPE_NONE)) { + debug(D_HEALTH, "Alarm %s matched a silence entry, but no SILENCE or DISABLE command was issued via the command API. The match has no effect.", rc->name); + } else { + debug(D_HEALTH, "Alarm %s via the command API - name:%s context:%s chart:%s host:%s family:%s" + , (silencers->stype == STYPE_DISABLE_ALARMS)?"Disabled":"Silenced" + , rc->name + , (rc->rrdset)?rc->rrdset->context:"" + , rc->chart + , host + , (rc->rrdset)?rc->rrdset->family:"" + ); + } + return silencers->stype; + } + } + return STYPE_NONE; +} + +/** + * Update Disabled Silenced + * + * Update the variable rrdcalc_flags of the structure RRDCALC according with the values of the host structure + * + * @param host structure that contains information about the host monitored. + * @param rc structure with information about the alarm + * + * @return It returns 1 case rrdcalc_flags is DISABLED or 0 otherwise + */ +static int update_disabled_silenced(RRDHOST *host, RRDCALC *rc) { + uint32_t rrdcalc_flags_old = rc->rrdcalc_flags; + // Clear the flags + rc->rrdcalc_flags &= ~(RRDCALC_FLAG_DISABLED | RRDCALC_FLAG_SILENCED); + if (unlikely(silencers->all_alarms)) { + if (silencers->stype == STYPE_DISABLE_ALARMS) rc->rrdcalc_flags |= RRDCALC_FLAG_DISABLED; + else if (silencers->stype == STYPE_SILENCE_NOTIFICATIONS) rc->rrdcalc_flags |= RRDCALC_FLAG_SILENCED; + } else { + SILENCE_TYPE st = check_silenced(rc, host->hostname, silencers); + if (st == STYPE_DISABLE_ALARMS) rc->rrdcalc_flags |= RRDCALC_FLAG_DISABLED; + else if (st == STYPE_SILENCE_NOTIFICATIONS) rc->rrdcalc_flags |= RRDCALC_FLAG_SILENCED; + } + + if (rrdcalc_flags_old != rc->rrdcalc_flags) { + info("Alarm silencing changed for host '%s' alarm '%s': Disabled %s->%s Silenced %s->%s", + host->hostname, + rc->name, + (rrdcalc_flags_old & RRDCALC_FLAG_DISABLED)?"true":"false", + (rc->rrdcalc_flags & RRDCALC_FLAG_DISABLED)?"true":"false", + (rrdcalc_flags_old & RRDCALC_FLAG_SILENCED)?"true":"false", + (rc->rrdcalc_flags & RRDCALC_FLAG_SILENCED)?"true":"false" + ); + } + if (rc->rrdcalc_flags & RRDCALC_FLAG_DISABLED) + return 1; + else + return 0; +} + +/** + * Health Main + * + * The main thread of the health system. In this function all the alarms will be processed. + * + * @param ptr is a pointer to the netdata_static_thread structure. + * + * @return It always returns NULL + */ +void *health_main(void *ptr) { + netdata_thread_cleanup_push(health_main_cleanup, ptr); + + int min_run_every = (int)config_get_number(CONFIG_SECTION_HEALTH, "run at least every seconds", 10); + if(min_run_every < 1) min_run_every = 1; + + time_t now = now_realtime_sec(); + time_t hibernation_delay = config_get_number(CONFIG_SECTION_HEALTH, "postpone alarms during hibernation for seconds", 60); + + rrdcalc_labels_unlink(); + + unsigned int loop = 0; + while(!netdata_exit) { + loop++; + debug(D_HEALTH, "Health monitoring iteration no %u started", loop); + + int runnable = 0, apply_hibernation_delay = 0; + time_t next_run = now + min_run_every; + RRDCALC *rc; + + if (unlikely(check_if_resumed_from_suspention())) { + apply_hibernation_delay = 1; + + info("Postponing alarm checks for %ld seconds, because it seems that the system was just resumed from suspension.", + hibernation_delay + ); + } + + if (unlikely(silencers->all_alarms && silencers->stype == STYPE_DISABLE_ALARMS)) { + static int logged=0; + if (!logged) { + info("Skipping health checks, because all alarms are disabled via a %s command.", + HEALTH_CMDAPI_CMD_DISABLEALL); + logged = 1; + } + } + + rrd_rdlock(); + + RRDHOST *host; + rrdhost_foreach_read(host) { + if (unlikely(!host->health_enabled)) + continue; + + if (unlikely(apply_hibernation_delay)) { + + info("Postponing health checks for %ld seconds, on host '%s'.", hibernation_delay, host->hostname + ); + + host->health_delay_up_to = now + hibernation_delay; + } + + if (unlikely(host->health_delay_up_to)) { + if (unlikely(now < host->health_delay_up_to)) + continue; + + info("Resuming health checks on host '%s'.", host->hostname); + host->health_delay_up_to = 0; + } + + rrdhost_rdlock(host); + + // the first loop is to lookup values from the db + for (rc = host->alarms; rc; rc = rc->next) { + + if (update_disabled_silenced(host, rc)) + continue; + + if (unlikely(!rrdcalc_isrunnable(rc, now, &next_run))) { + if (unlikely(rc->rrdcalc_flags & RRDCALC_FLAG_RUNNABLE)) + rc->rrdcalc_flags &= ~RRDCALC_FLAG_RUNNABLE; + continue; + } + + runnable++; + rc->old_value = rc->value; + rc->rrdcalc_flags |= RRDCALC_FLAG_RUNNABLE; + + // ------------------------------------------------------------ + // if there is database lookup, do it + + if (unlikely(RRDCALC_HAS_DB_LOOKUP(rc))) { + /* time_t old_db_timestamp = rc->db_before; */ + int value_is_null = 0; + + int ret = rrdset2value_api_v1(rc->rrdset, NULL, &rc->value, rc->dimensions, 1, rc->after, + rc->before, rc->group, 0, rc->options, &rc->db_after, + &rc->db_before, &value_is_null + ); + + if (unlikely(ret != 200)) { + // database lookup failed + rc->value = NAN; + rc->rrdcalc_flags |= RRDCALC_FLAG_DB_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup returned error %d", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, ret + ); + } else + rc->rrdcalc_flags &= ~RRDCALC_FLAG_DB_ERROR; + + /* - RRDCALC_FLAG_DB_STALE not currently used + if (unlikely(old_db_timestamp == rc->db_before)) { + // database is stale + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database is stale", host->hostname, rc->chart?rc->chart:"NOCHART", rc->name); + + if (unlikely(!(rc->rrdcalc_flags & RRDCALC_FLAG_DB_STALE))) { + rc->rrdcalc_flags |= RRDCALC_FLAG_DB_STALE; + error("Health on host '%s', alarm '%s.%s': database is stale", host->hostname, rc->chart?rc->chart:"NOCHART", rc->name); + } + } + else if (unlikely(rc->rrdcalc_flags & RRDCALC_FLAG_DB_STALE)) + rc->rrdcalc_flags &= ~RRDCALC_FLAG_DB_STALE; + */ + + if (unlikely(value_is_null)) { + // collected value is null + rc->value = NAN; + rc->rrdcalc_flags |= RRDCALC_FLAG_DB_NAN; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': database lookup returned empty value (possibly value is not collected yet)", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name + ); + } else + rc->rrdcalc_flags &= ~RRDCALC_FLAG_DB_NAN; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup gave value " + CALCULATED_NUMBER_FORMAT, host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + rc->value + ); + } + + // ------------------------------------------------------------ + // if there is calculation expression, run it + + if (unlikely(rc->calculation)) { + if (unlikely(!expression_evaluate(rc->calculation))) { + // calculation failed + rc->value = NAN; + rc->rrdcalc_flags |= RRDCALC_FLAG_CALC_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' failed: %s", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + rc->calculation->parsed_as, buffer_tostring(rc->calculation->error_msg) + ); + } else { + rc->rrdcalc_flags &= ~RRDCALC_FLAG_CALC_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' gave value " + CALCULATED_NUMBER_FORMAT + ": %s (source: %s)", host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + rc->calculation->parsed_as, rc->calculation->result, + buffer_tostring(rc->calculation->error_msg), rc->source + ); + + rc->value = rc->calculation->result; + + if (rc->local) rc->local->last_updated = now; + if (rc->family) rc->family->last_updated = now; + if (rc->hostid) rc->hostid->last_updated = now; + if (rc->hostname) rc->hostname->last_updated = now; + } + } + } + + rrdhost_unlock(host); + + if (unlikely(runnable && !netdata_exit)) { + rrdhost_rdlock(host); + + for (rc = host->alarms; rc; rc = rc->next) { + if (unlikely(!(rc->rrdcalc_flags & RRDCALC_FLAG_RUNNABLE))) + continue; + + if (rc->rrdcalc_flags & RRDCALC_FLAG_DISABLED) { + continue; + } + RRDCALC_STATUS warning_status = RRDCALC_STATUS_UNDEFINED; + RRDCALC_STATUS critical_status = RRDCALC_STATUS_UNDEFINED; + + // -------------------------------------------------------- + // check the warning expression + + if (likely(rc->warning)) { + if (unlikely(!expression_evaluate(rc->warning))) { + // calculation failed + rc->rrdcalc_flags |= RRDCALC_FLAG_WARN_ERROR; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': warning expression failed with error: %s", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + buffer_tostring(rc->warning->error_msg) + ); + } else { + rc->rrdcalc_flags &= ~RRDCALC_FLAG_WARN_ERROR; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': warning expression gave value " + CALCULATED_NUMBER_FORMAT + ": %s (source: %s)", host->hostname, rc->chart ? rc->chart : "NOCHART", + rc->name, rc->warning->result, buffer_tostring(rc->warning->error_msg), rc->source + ); + warning_status = rrdcalc_value2status(rc->warning->result); + } + } + + // -------------------------------------------------------- + // check the critical expression + + if (likely(rc->critical)) { + if (unlikely(!expression_evaluate(rc->critical))) { + // calculation failed + rc->rrdcalc_flags |= RRDCALC_FLAG_CRIT_ERROR; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': critical expression failed with error: %s", + host->hostname, rc->chart ? rc->chart : "NOCHART", rc->name, + buffer_tostring(rc->critical->error_msg) + ); + } else { + rc->rrdcalc_flags &= ~RRDCALC_FLAG_CRIT_ERROR; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': critical expression gave value " + CALCULATED_NUMBER_FORMAT + ": %s (source: %s)", host->hostname, rc->chart ? rc->chart : "NOCHART", + rc->name, rc->critical->result, buffer_tostring(rc->critical->error_msg), + rc->source + ); + critical_status = rrdcalc_value2status(rc->critical->result); + } + } + + // -------------------------------------------------------- + // decide the final alarm status + + RRDCALC_STATUS status = RRDCALC_STATUS_UNDEFINED; + + switch (warning_status) { + case RRDCALC_STATUS_CLEAR: + status = RRDCALC_STATUS_CLEAR; + break; + + case RRDCALC_STATUS_RAISED: + status = RRDCALC_STATUS_WARNING; + break; + + default: + break; + } + + switch (critical_status) { + case RRDCALC_STATUS_CLEAR: + if (status == RRDCALC_STATUS_UNDEFINED) + status = RRDCALC_STATUS_CLEAR; + break; + + case RRDCALC_STATUS_RAISED: + status = RRDCALC_STATUS_CRITICAL; + break; + + default: + break; + } + + // -------------------------------------------------------- + // check if the new status and the old differ + + if (status != rc->status) { + int delay = 0; + + // apply trigger hysteresis + + if (now > rc->delay_up_to_timestamp) { + rc->delay_up_current = rc->delay_up_duration; + rc->delay_down_current = rc->delay_down_duration; + rc->delay_last = 0; + rc->delay_up_to_timestamp = 0; + } else { + rc->delay_up_current = (int) (rc->delay_up_current * rc->delay_multiplier); + if (rc->delay_up_current > rc->delay_max_duration) + rc->delay_up_current = rc->delay_max_duration; + + rc->delay_down_current = (int) (rc->delay_down_current * rc->delay_multiplier); + if (rc->delay_down_current > rc->delay_max_duration) + rc->delay_down_current = rc->delay_max_duration; + } + + if (status > rc->status) + delay = rc->delay_up_current; + else + delay = rc->delay_down_current; + + // COMMENTED: because we do need to send raising alarms + // if(now + delay < rc->delay_up_to_timestamp) + // delay = (int)(rc->delay_up_to_timestamp - now); + + rc->delay_last = delay; + rc->delay_up_to_timestamp = now + delay; + + if(likely(!rrdcalc_isrepeating(rc))) { + ALARM_ENTRY *ae = health_create_alarm_entry( + host, rc->id, rc->next_event_id++, now, rc->name, rc->rrdset->id, + rc->rrdset->family, rc->exec, rc->recipient, now - rc->last_status_change, + rc->old_value, rc->value, rc->status, status, rc->source, rc->units, rc->info, + rc->delay_last, + ( + ((rc->options & RRDCALC_FLAG_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | + ((rc->rrdcalc_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) + ) + ); + health_alarm_log(host, ae); + } + rc->last_status_change = now; + rc->old_status = rc->status; + rc->status = status; + } + + rc->last_updated = now; + rc->next_update = now + rc->update_every; + + if (next_run > rc->next_update) + next_run = rc->next_update; + } + + // process repeating alarms + RRDCALC *rc; + for(rc = host->alarms; rc ; rc = rc->next) { + int repeat_every = 0; + if(unlikely(rrdcalc_isrepeating(rc))) { + if(unlikely(rc->status == RRDCALC_STATUS_WARNING)) { + rc->rrdcalc_flags &= ~RRDCALC_FLAG_RUN_ONCE; + repeat_every = rc->warn_repeat_every; + } else if(unlikely(rc->status == RRDCALC_STATUS_CRITICAL)) { + rc->rrdcalc_flags &= ~RRDCALC_FLAG_RUN_ONCE; + repeat_every = rc->crit_repeat_every; + } else if(unlikely(rc->status == RRDCALC_STATUS_CLEAR)) { + if(!(rc->rrdcalc_flags & RRDCALC_FLAG_RUN_ONCE)) { + if(rc->old_status == RRDCALC_STATUS_CRITICAL) { + repeat_every = rc->crit_repeat_every; + } else if (rc->old_status == RRDCALC_STATUS_WARNING) { + repeat_every = rc->warn_repeat_every; + } + } + } + } + + if(unlikely(repeat_every > 0 && (rc->last_repeat + repeat_every) <= now)) { + rc->last_repeat = now; + ALARM_ENTRY *ae = health_create_alarm_entry( + host, rc->id, rc->next_event_id++, now, rc->name, rc->rrdset->id, + rc->rrdset->family, rc->exec, rc->recipient, now - rc->last_status_change, + rc->old_value, rc->value, rc->old_status, rc->status, rc->source, rc->units, rc->info, + rc->delay_last, + ( + ((rc->options & RRDCALC_FLAG_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | + ((rc->rrdcalc_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) + ) + ); + ae->last_repeat = rc->last_repeat; + if (!(rc->rrdcalc_flags & RRDCALC_FLAG_RUN_ONCE) && rc->status == RRDCALC_STATUS_CLEAR) { + ae->flags |= HEALTH_ENTRY_RUN_ONCE; + } + rc->rrdcalc_flags |= RRDCALC_FLAG_RUN_ONCE; + health_process_notifications(host, ae); + debug(D_HEALTH, "Notification sent for the repeating alarm %u.", ae->alarm_id); + health_alarm_wait_for_execution(ae); + health_alarm_log_free_one_nochecks_nounlink(ae); + } + } + + rrdhost_unlock(host); + } + + if (unlikely(netdata_exit)) + break; + + // execute notifications + // and cleanup + health_alarm_log_process(host); + + if (unlikely(netdata_exit)) { + // wait for all notifications to finish before allowing health to be cleaned up + ALARM_ENTRY *ae; + while (NULL != (ae = alarm_notifications_in_progress.head)) { + health_alarm_wait_for_execution(ae); + } + break; + } + + } /* rrdhost_foreach */ + + // wait for all notifications to finish before allowing health to be cleaned up + ALARM_ENTRY *ae; + while (NULL != (ae = alarm_notifications_in_progress.head)) { + health_alarm_wait_for_execution(ae); + } + + rrd_unlock(); + + + if(unlikely(netdata_exit)) + break; + + now = now_realtime_sec(); + if(now < next_run) { + debug(D_HEALTH, "Health monitoring iteration no %u done. Next iteration in %d secs", loop, (int) (next_run - now)); + sleep_usec(USEC_PER_SEC * (usec_t) (next_run - now)); + now = now_realtime_sec(); + } + else + debug(D_HEALTH, "Health monitoring iteration no %u done. Next iteration now", loop); + + } // forever + + netdata_thread_cleanup_pop(1); + return NULL; +} diff --git a/health/health.d/adaptec_raid.conf b/health/health.d/adaptec_raid.conf new file mode 100644 index 0000000..a1301ce --- /dev/null +++ b/health/health.d/adaptec_raid.conf @@ -0,0 +1,24 @@ + +# logical device status check + +template: adapter_raid_ld_status + on: adapter_raid.ld_status + lookup: max -5s + units: bool + every: 10s + crit: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: at least 1 logical device is failed or degraded + to: sysadmin + +# physical device state check + +template: adapter_raid_pd_state + on: adapter_raid.pd_state + lookup: max -5s + units: bool + every: 10s + crit: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: at least 1 physical device is not in online state + to: sysadmin diff --git a/health/health.d/am2320.conf b/health/health.d/am2320.conf new file mode 100644 index 0000000..ddf8b70 --- /dev/null +++ b/health/health.d/am2320.conf @@ -0,0 +1,12 @@ +# make sure am2320 is sending stats + +template: am2320_last_collected_secs + on: am2320.temperature + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster
\ No newline at end of file diff --git a/health/health.d/anomalies.conf b/health/health.d/anomalies.conf new file mode 100644 index 0000000..a2d248e --- /dev/null +++ b/health/health.d/anomalies.conf @@ -0,0 +1,17 @@ +# raise a warning alarm if an anomaly probability is consistently above 50% + +template: anomaly_probabilities + on: anomalies.probability + lookup: average -2m foreach * + every: 1m + warn: $this > 50 + info: average anomaly probability > 50% for last 2 minutes + +# raise a warning alarm if an anomaly flag is consistently firing + +template: anomaly_flags + on: anomalies.anomaly + lookup: sum -2m foreach * + every: 1m + warn: $this > 10 + info: count of anomalies > 10 for last 2 minutes diff --git a/health/health.d/apache.conf b/health/health.d/apache.conf new file mode 100644 index 0000000..0c98b87 --- /dev/null +++ b/health/health.d/apache.conf @@ -0,0 +1,14 @@ + +# make sure apache is running + +template: apache_last_collected_secs + on: apache.requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/apcupsd.conf b/health/health.d/apcupsd.conf new file mode 100644 index 0000000..4f86037 --- /dev/null +++ b/health/health.d/apcupsd.conf @@ -0,0 +1,40 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + +template: 10min_ups_load + on: apcupsd.load + os: * + hosts: * + lookup: average -10m unaligned of percentage + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 10m multiplier 1.5 max 1h + info: average UPS load for the last 10 minutes + to: sitemgr + +# Discussion in https://github.com/netdata/netdata/pull/3928: +# Fire the alarm as soon as it's going on battery (99% charge) and clear only when full. +template: ups_charge + on: apcupsd.charge + os: * + hosts: * + lookup: average -60s unaligned of charge + units: % + every: 60s + warn: $this < 100 + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 10m multiplier 1.5 max 1h + info: current UPS charge, averaged over the last 60 seconds to reduce measurement errors + to: sitemgr + +template: apcupsd_last_collected_secs + on: apcupsd.load + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/apps_plugin.conf b/health/health.d/apps_plugin.conf new file mode 100644 index 0000000..9a27bc6 --- /dev/null +++ b/health/health.d/apps_plugin.conf @@ -0,0 +1,15 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + +# disabled due to https://github.com/netdata/netdata/issues/10327 +# +# alarm: used_file_descriptors +# on: apps.files +# hosts: * +# calc: $fdperc +# units: % +# every: 5s +# warn: $this > (($status >= $WARNING) ? (75) : (80)) +# crit: $this > (($status == $CRITICAL) ? (85) : (90)) +# delay: down 5m multiplier 1.5 max 1h +# info: Peak percentage of file descriptors used +# to: sysadmin diff --git a/health/health.d/backend.conf b/health/health.d/backend.conf new file mode 100644 index 0000000..e51b8aa --- /dev/null +++ b/health/health.d/backend.conf @@ -0,0 +1,56 @@ +# Alert that backends subsystem will be disabled soon + alarm: backend_metrics_eol + on: netdata.backend_metrics + units: boolean + calc: $now - $last_collected_t + every: 1m + warn: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: The backends subsystem is deprecated and will be removed soon. Migrate your configuration to exporting.conf. + to: sysadmin + +# make sure we are sending data to backend + + alarm: backend_last_buffering + on: netdata.backend_metrics + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful buffering of backend data + to: dba + + alarm: backend_metrics_sent + on: netdata.backend_metrics + units: % + calc: abs($sent) * 100 / abs($buffered) + every: 10s + warn: $this != 100 + delay: down 5m multiplier 1.5 max 1h + info: percentage of metrics sent to the backend server + to: dba + + alarm: backend_metrics_lost + on: netdata.backend_metrics + units: metrics + calc: abs($lost) + every: 10s + crit: ($this != 0) || ($status == $CRITICAL && abs($sent) == 0) + delay: down 5m multiplier 1.5 max 1h + info: number of metrics lost due to repeating failures to contact the backend server + to: dba + + +# this chart has been removed from netdata +# alarm: backend_slow +# on: netdata.backend_latency +# units: % +# calc: $latency * 100 / ($update_every * 1000) +# every: 10s +# warn: $this > 50 +# crit: $this > 100 +# delay: down 5m multiplier 1.5 max 1h +# info: the percentage of time between iterations needed by the backend time to process the data sent by netdata +# to: dba diff --git a/health/health.d/bcache.conf b/health/health.d/bcache.conf new file mode 100644 index 0000000..f0da9ac --- /dev/null +++ b/health/health.d/bcache.conf @@ -0,0 +1,22 @@ + +template: bcache_cache_errors + on: disk.bcache_cache_read_races + lookup: sum -10m unaligned absolute + units: errors + every: 1m + warn: $this > 0 + crit: $this > ( ($status >= $CRITICAL) ? (0) : (10) ) + delay: down 1h multiplier 1.5 max 2h + info: the number of times bcache had issues using the cache, during the last 10 mins (this usually means your SSD cache is failing) + to: sysadmin + +template: bcache_cache_dirty + on: disk.bcache_cache_alloc + calc: $dirty + $metadata + $undefined + units: % + every: 1m + warn: $this > ( ($status >= $WARNING ) ? ( 70 ) : ( 90 ) ) + crit: $this > ( ($status >= $CRITICAL) ? ( 90 ) : ( 95 ) ) + delay: up 1m down 1h multiplier 1.5 max 2h + info: the percentage of cache space used for dirty and metadata (this usually means your SSD cache is too small) + to: sysadmin diff --git a/health/health.d/beanstalkd.conf b/health/health.d/beanstalkd.conf new file mode 100644 index 0000000..30dc273 --- /dev/null +++ b/health/health.d/beanstalkd.conf @@ -0,0 +1,36 @@ +# get the number of buried jobs in all queues + +template: server_buried_jobs + on: beanstalk.current_jobs + calc: $buried + units: jobs + every: 10s + warn: $this > 0 + crit: $this > 10 + delay: up 0 down 5m multiplier 1.2 max 1h + info: the number of buried jobs aggregated across all tubes + to: sysadmin + +# get the number of buried jobs per queue + +#template: tube_buried_jobs +# on: beanstalk.jobs +# calc: $buried +# units: jobs +# every: 10s +# warn: $this > 0 +# crit: $this > 10 +# delay: up 0 down 5m multiplier 1.2 max 1h +# info: the number of jobs buried per tube +# to: sysadmin + +# get the current number of tubes + +#template: number_of_tubes +# on: beanstalk.current_tubes +# calc: $tubes +# every: 10s +# warn: $this < 5 +# delay: up 0 down 5m multiplier 1.2 max 1h +# info: the current number of tubes on the server +# to: sysadmin diff --git a/health/health.d/bind_rndc.conf b/health/health.d/bind_rndc.conf new file mode 100644 index 0000000..4145e77 --- /dev/null +++ b/health/health.d/bind_rndc.conf @@ -0,0 +1,9 @@ + template: bind_rndc_stats_file_size + on: bind_rndc.stats_size + units: megabytes + every: 60 + calc: $stats_size + warn: $this > 512 + crit: $this > 1024 + info: Bind stats file is very large! Consider to create logrotate conf file for it! + to: sysadmin diff --git a/health/health.d/boinc.conf b/health/health.d/boinc.conf new file mode 100644 index 0000000..43c588d --- /dev/null +++ b/health/health.d/boinc.conf @@ -0,0 +1,62 @@ +# Alarms for various BOINC issues. + +# Warn on any compute errors encountered. +template: boinc_compute_errors + on: boinc.states + os: * + hosts: * +families: * + lookup: average -10m unaligned of comperror + units: tasks + every: 1m + warn: $this > 0 + crit: $this > 1 + delay: up 1m down 5m multiplier 1.5 max 1h + info: the total number of compute errors over the past 10 minutes + to: sysadmin + +# Warn on lots of upload errors +template: boinc_upload_errors + on: boinc.states + os: * + hosts: * +families: * + lookup: average -10m unaligned of upload_failed + units: tasks + every: 1m + warn: $this > 0 + crit: $this > 1 + delay: up 1m down 5m multiplier 1.5 max 1h + info: the average number of failed uploads over the past 10 minutes + to: sysadmin + +# Warn on the task queue being empty +template: boinc_total_tasks + on: boinc.tasks + os: * + hosts: * +families: * + lookup: average -10m unaligned of total + units: tasks + every: 1m + warn: $this < 1 + crit: $this < 0.1 + delay: up 5m down 10m multiplier 1.5 max 1h + info: the total number of locally available tasks + to: sysadmin + +# Warn on no active tasks with a non-empty queue +template: boinc_active_tasks + on: boinc.tasks + os: * + hosts: * +families: * + lookup: average -10m unaligned of active + calc: ($boinc_total_tasks >= 1) ? ($this) : (inf) + units: tasks + every: 1m + warn: $this < 1 + crit: $this < 0.1 + delay: up 5m down 10m multiplier 1.5 max 1h + info: the total number of active tasks + to: sysadmin diff --git a/health/health.d/btrfs.conf b/health/health.d/btrfs.conf new file mode 100644 index 0000000..b27aa54 --- /dev/null +++ b/health/health.d/btrfs.conf @@ -0,0 +1,57 @@ + +template: btrfs_allocated + on: btrfs.disk + os: * + hosts: * +families: * + calc: 100 - ($unallocated * 100 / ($unallocated + $data_used + $data_free + $meta_used + $meta_free + $sys_used + $sys_free)) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) + crit: $this > (($status == $CRITICAL) ? (95) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: the percentage of allocated BTRFS physical disk space + to: sysadmin + +template: btrfs_data + on: btrfs.data + os: * + hosts: * +families: * + calc: $used * 100 / ($used + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: the percentage of used BTRFS data space + to: sysadmin + +template: btrfs_metadata + on: btrfs.metadata + os: * + hosts: * +families: * + calc: ($used + $reserved) * 100 / ($used + $free + $reserved) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: the percentage of used BTRFS metadata space + to: sysadmin + +template: btrfs_system + on: btrfs.system + os: * + hosts: * +families: * + calc: $used * 100 / ($used + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: the percentage of used BTRFS system space + to: sysadmin + diff --git a/health/health.d/ceph.conf b/health/health.d/ceph.conf new file mode 100644 index 0000000..de16f7b --- /dev/null +++ b/health/health.d/ceph.conf @@ -0,0 +1,13 @@ +# low ceph disk available + +template: cluster_space_usage + on: ceph.general_usage + calc: $avail * 100 / ($avail + $used) + units: % + every: 10s + warn: $this < 10 + crit: $this < 1 + delay: down 5m multiplier 1.2 max 1h + info: ceph disk usage is almost full + to: sysadmin + diff --git a/health/health.d/cgroups.conf b/health/health.d/cgroups.conf new file mode 100644 index 0000000..79ece53 --- /dev/null +++ b/health/health.d/cgroups.conf @@ -0,0 +1,41 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +template: cgroup_10min_cpu_usage + on: cgroup.cpu_limit + os: linux + hosts: * + lookup: average -10m unaligned + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: cpu utilization for the last 10 minutes + to: sysadmin + +template: cgroup_ram_in_use + on: cgroup.mem_usage + os: linux + hosts: * + calc: ($ram) * 100 / $memory_limit + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: RAM used by cgroup + to: sysadmin + +template: cgroup_ram_and_swap_in_use + on: cgroup.mem_usage + os: linux + hosts: * + calc: ($ram + $swap) * 100 / $memory_and_swap_limit + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: RAM and Swap used by cgroup + to: sysadmin diff --git a/health/health.d/cockroachdb.conf b/health/health.d/cockroachdb.conf new file mode 100644 index 0000000..8ab2c9d --- /dev/null +++ b/health/health.d/cockroachdb.conf @@ -0,0 +1,91 @@ + +# Availability + +template: cockroachdb_last_collected_secs + on: cockroachdb.live_nodes + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba + +# Capacity + +template: cockroachdb_used_storage_capacity + on: cockroachdb.storage_used_capacity_percentage + calc: $capacity_used_percent + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: entire disk usage percentage + to: dba + +template: cockroachdb_used_usable_storage_capacity + on: cockroachdb.storage_used_capacity_percentage + calc: $capacity_usable_used_percent + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: usable space usage percentage + to: dba + +# Replication + +template: cockroachdb_unavailable_ranges + on: cockroachdb.ranges_replication_problem + calc: $ranges_unavailable + units: num + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of ranges with fewer live replicas than the replication target + to: dba + +template: cockroachdb_replicas_leaders_not_leaseholders + on: cockroachdb.replicas_leaders + calc: $replicas_leaders_not_leaseholders + units: num + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of replicas that are Raft leaders whose range lease is held by another store + to: dba + +# FD + +template: cockroachdb_open_file_descriptors_limit + on: cockroachdb.process_file_descriptors + calc: $sys_fd_open/$sys_fd_softlimit * 100 + units: % + every: 10s + warn: $this > 80 + delay: down 15m multiplier 1.5 max 1h + info: open file descriptors usage percentage + to: dba + +# SQL + +template: cockroachdb_sql_active_connections + on: cockroachdb.sql_connections + calc: $sql_conns + units: active connections + every: 10s + info: number of active SQL connections + to: dba + +template: cockroachdb_sql_executed_statements_total_last_5m + on: cockroachdb.sql_statements_total + lookup: sum -5m absolute of sql_query_count + units: statements + every: 10s + warn: $this == 0 AND $cockroachdb_sql_active_connections != 0 + delay: down 15m up 30s multiplier 1.5 max 1h + info: number of executed SQL statements in the last 5 minutes + to: dba diff --git a/health/health.d/couchdb.conf b/health/health.d/couchdb.conf new file mode 100644 index 0000000..4a28952 --- /dev/null +++ b/health/health.d/couchdb.conf @@ -0,0 +1,13 @@ + +# make sure couchdb is running + +template: couchdb_last_collected_secs + on: couchdb.request_methods + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba diff --git a/health/health.d/cpu.conf b/health/health.d/cpu.conf new file mode 100644 index 0000000..fa81898 --- /dev/null +++ b/health/health.d/cpu.conf @@ -0,0 +1,55 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +template: 10min_cpu_usage + on: system.cpu + os: linux + hosts: * + lookup: average -10m unaligned of user,system,softirq,irq,guest + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal) + to: sysadmin + +template: 10min_cpu_iowait + on: system.cpu + os: linux + hosts: * + lookup: average -10m unaligned of iowait + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (20) : (40)) + crit: $this > (($status == $CRITICAL) ? (40) : (50)) + delay: down 15m multiplier 1.5 max 1h + info: average CPU wait I/O for the last 10 minutes + to: sysadmin + +template: 20min_steal_cpu + on: system.cpu + os: linux + hosts: * + lookup: average -20m unaligned of steal + units: % + every: 5m + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (20) : (30)) + delay: down 1h multiplier 1.5 max 2h + info: average CPU steal time for the last 20 minutes + to: sysadmin + +## FreeBSD +template: 10min_cpu_usage + on: system.cpu + os: freebsd + hosts: * + lookup: average -10m unaligned of user,system,interrupt + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average cpu utilization for the last 10 minutes (excluding nice) + to: sysadmin diff --git a/health/health.d/dbengine.conf b/health/health.d/dbengine.conf new file mode 100644 index 0000000..274673e --- /dev/null +++ b/health/health.d/dbengine.conf @@ -0,0 +1,50 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: 10min_dbengine_global_fs_errors + on: netdata.dbengine_global_errors + os: linux freebsd macos + hosts: * +lookup: sum -10m unaligned of fs_errors + units: errors + every: 10s + crit: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of File-System errors dbengine came across the last 10 minutes (too many open files, wrong permissions etc) + to: sysadmin + + alarm: 10min_dbengine_global_io_errors + on: netdata.dbengine_global_errors + os: linux freebsd macos + hosts: * +lookup: sum -10m unaligned of io_errors + units: errors + every: 10s + crit: $this > 0 + delay: down 1h multiplier 1.5 max 3h + info: number of IO errors dbengine came across the last 10 minutes (CRC errors, out of space, bad disk etc) + to: sysadmin + + alarm: 10min_dbengine_global_flushing_warnings + on: netdata.dbengine_global_errors + os: linux freebsd macos + hosts: * +lookup: sum -10m unaligned of pg_cache_over_half_dirty_events + units: errors + every: 10s + warn: $this > 0 + delay: down 1h multiplier 1.5 max 3h + info: number of times in the last 10 minutes that dbengine dirty pages were over 50% of the instance's page cache, metric data at risk of not being stored in the database, please reduce disk load or use faster disks + to: sysadmin + + alarm: 10min_dbengine_global_flushing_errors + on: netdata.dbengine_long_term_page_stats + os: linux freebsd macos + hosts: * +lookup: sum -10m unaligned of flushing_pressure_deletions + units: pages + every: 10s + crit: $this != 0 + delay: down 1h multiplier 1.5 max 3h + info: number of pages deleted due to failure to flush data to disk in the last 10 minutes, metric data were lost to unblock data collection, please reduce disk load or use faster disks + to: sysadmin diff --git a/health/health.d/disks.conf b/health/health.d/disks.conf new file mode 100644 index 0000000..9c194ce --- /dev/null +++ b/health/health.d/disks.conf @@ -0,0 +1,167 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + +# ----------------------------------------------------------------------------- +# low disk space + +# checking the latest collected values +# raise an alarm if the disk is low on +# available disk space + +template: disk_space_usage + on: disk.space + os: linux freebsd + hosts: * +families: !/dev !/dev/* !/run !/run/* * + calc: $used * 100 / ($avail + $used) + units: % + every: 1m + warn: $this > (($status >= $WARNING ) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: current disk space usage + to: sysadmin + +template: disk_inode_usage + on: disk.inodes + os: linux freebsd + hosts: * +families: !/dev !/dev/* !/run !/run/* * + calc: $used * 100 / ($avail + $used) + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: current disk inode usage + to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk fill rate + +# calculate the rate the disk fills +# use as base, the available space change +# during the last hour + +# this is just a calculation - it has no alarm +# we will use it in the next template to find +# the hours remaining + +template: disk_fill_rate + on: disk.space + os: linux freebsd + hosts: * +families: * + lookup: min -10m at -50m unaligned of avail + calc: ($this - $avail) / (($now - $after) / 3600) + every: 1m + units: GB/hour + info: average rate the disk fills up (positive), or frees up (negative) space, for the last hour + + +# calculate the hours remaining +# if the disk continues to fill +# in this rate + +template: out_of_disk_space_time + on: disk.space + os: linux freebsd + hosts: * +families: * + calc: ($disk_fill_rate > 0) ? ($avail / $disk_fill_rate) : (inf) + units: hours + every: 10s + warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) + crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) + delay: down 15m multiplier 1.2 max 1h + info: estimated time the disk will run out of space, if the system continues to add data with the rate of the last hour + to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk inode fill rate + +# calculate the rate the disk inodes are allocated +# use as base, the available inodes change +# during the last hour + +# this is just a calculation - it has no alarm +# we will use it in the next template to find +# the hours remaining + +template: disk_inode_rate + on: disk.inodes + os: linux freebsd + hosts: * +families: * + lookup: min -10m at -50m unaligned of avail + calc: ($this - $avail) / (($now - $after) / 3600) + every: 1m + units: inodes/hour + info: average rate at which disk inodes are allocated (positive), or freed (negative), for the last hour + +# calculate the hours remaining +# if the disk inodes are allocated +# in this rate + +template: out_of_disk_inodes_time + on: disk.inodes + os: linux freebsd + hosts: * +families: * + calc: ($disk_inode_rate > 0) ? ($avail / $disk_inode_rate) : (inf) + units: hours + every: 10s + warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) + crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) + delay: down 15m multiplier 1.2 max 1h + info: estimated time the disk will run out of inodes, if the system continues to allocate inodes with the rate of the last hour + to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk congestion + +# raise an alarm if the disk is congested +# by calculating the average disk utilization +# for the last 10 minutes + +template: 10min_disk_utilization + on: disk.util + os: linux freebsd + hosts: * +families: * + lookup: average -10m unaligned + units: % + every: 1m + green: 90 + red: 98 + warn: $this > $green * (($status >= $WARNING) ? (0.7) : (1)) + crit: $this > $red * (($status == $CRITICAL) ? (0.7) : (1)) + delay: down 15m multiplier 1.2 max 1h + info: the percentage of time the disk was busy, during the last 10 minutes + to: sysadmin + + +# raise an alarm if the disk backlog +# is above 1000ms (1s) per second +# for 10 minutes +# (i.e. the disk cannot catch up) + +template: 10min_disk_backlog + on: disk.backlog + os: linux + hosts: * +families: * + lookup: average -10m unaligned + units: ms + every: 1m + green: 2000 + red: 5000 + warn: $this > $green * (($status >= $WARNING) ? (0.7) : (1)) + crit: $this > $red * (($status == $CRITICAL) ? (0.7) : (1)) + delay: down 15m multiplier 1.2 max 1h + info: average of the kernel estimated disk backlog, for the last 10 minutes + to: sysadmin diff --git a/health/health.d/dns_query.conf b/health/health.d/dns_query.conf new file mode 100644 index 0000000..113c950 --- /dev/null +++ b/health/health.d/dns_query.conf @@ -0,0 +1,12 @@ + +# detect dns query failure + +template: dns_query_time_query_time + on: dns_query_time.query_time + lookup: average -10s unaligned foreach * + units: ms + every: 10s + warn: $this == nan + delay: up 20s down 5m multiplier 1.5 max 1h + info: query round trip time + to: sysadmin diff --git a/health/health.d/dnsmasq_dhcp.conf b/health/health.d/dnsmasq_dhcp.conf new file mode 100644 index 0000000..ecf3b84 --- /dev/null +++ b/health/health.d/dnsmasq_dhcp.conf @@ -0,0 +1,12 @@ +# dhcp-range utilization + +template: dnsmasq_dhcp_dhcp_range_utilization + on: dnsmasq_dhcp.dhcp_range_utilization + every: 10s + units: % + calc: $used + warn: $this > ( ($status >= $WARNING ) ? ( 80 ) : ( 90 ) ) + crit: $this > ( ($status >= $CRITICAL) ? ( 90 ) : ( 95 ) ) + delay: down 5m + info: dhcp-range utilization above threshold! + to: sysadmin diff --git a/health/health.d/dockerd.conf b/health/health.d/dockerd.conf new file mode 100644 index 0000000..729906c --- /dev/null +++ b/health/health.d/dockerd.conf @@ -0,0 +1,8 @@ +template: docker_unhealthy_containers + on: docker.unhealthy_containers + units: unhealthy containers + every: 10s + lookup: average -10s + crit: $this > 0 + info: number of unhealthy containers + to: sysadmin diff --git a/health/health.d/elasticsearch.conf b/health/health.d/elasticsearch.conf new file mode 100644 index 0000000..f442344 --- /dev/null +++ b/health/health.d/elasticsearch.conf @@ -0,0 +1,12 @@ + +# make sure elasticsearch is running + +template: elasticsearch_last_collected + on: elasticsearch.cluster_health_status + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin diff --git a/health/health.d/entropy.conf b/health/health.d/entropy.conf new file mode 100644 index 0000000..66d44ec --- /dev/null +++ b/health/health.d/entropy.conf @@ -0,0 +1,16 @@ + +# check if entropy is too low +# the alarm is checked every 1 minute +# and examines the last hour of data + + alarm: lowest_entropy + on: system.entropy + os: linux + hosts: * + lookup: min -10m unaligned + units: entries + every: 5m + warn: $this < (($status >= $WARNING) ? (200) : (100)) + delay: down 1h multiplier 1.5 max 2h + info: minimum entries in the random numbers pool in the last 10 minutes + to: silent diff --git a/health/health.d/exporting.conf b/health/health.d/exporting.conf new file mode 100644 index 0000000..506cb0c --- /dev/null +++ b/health/health.d/exporting.conf @@ -0,0 +1,34 @@ + +template: exporting_last_buffering +families: * + on: exporting_data_size + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful buffering of exporting data + to: dba + +template: exporting_metrics_sent +families: * + on: exporting_data_size + units: % + calc: abs($sent) * 100 / abs($buffered) + every: 10s + warn: $this != 100 + delay: down 5m multiplier 1.5 max 1h + info: percentage of metrics sent to the external database server + to: dba + +template: exporting_metrics_lost +families: * + on: exporting_data_size + units: metrics + calc: abs($lost) + every: 10s + crit: ($this != 0) || ($status == $CRITICAL && abs($sent) == 0) + delay: down 5m multiplier 1.5 max 1h + info: number of metrics lost due to repeating failures to contact the external database server + to: dba diff --git a/health/health.d/fping.conf b/health/health.d/fping.conf new file mode 100644 index 0000000..43658fe --- /dev/null +++ b/health/health.d/fping.conf @@ -0,0 +1,53 @@ + +template: fping_last_collected_secs +families: * + on: fping.latency + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +template: host_reachable +families: * + on: fping.latency + calc: $average != nan + units: up/down + every: 10s + crit: $this == 0 + info: states if the remote host is reachable + delay: down 30m multiplier 1.5 max 2h + to: sysadmin + +template: host_latency +families: * + on: fping.latency + lookup: average -10s unaligned of average + units: ms + every: 10s + green: 500 + red: 1000 + warn: $this > $green OR $max > $red + crit: $this > $red + info: average round trip delay during the last 10 seconds + delay: down 30m multiplier 1.5 max 2h + to: sysadmin + +template: packet_loss +families: * + on: fping.quality + lookup: average -10m unaligned of returned + calc: 100 - $this + green: 1 + red: 10 + units: % + every: 10s + warn: $this > $green + crit: $this > $red + info: packet loss percentage + delay: down 30m multiplier 1.5 max 2h + to: sysadmin + diff --git a/health/health.d/fronius.conf b/health/health.d/fronius.conf new file mode 100644 index 0000000..cdf6c8f --- /dev/null +++ b/health/health.d/fronius.conf @@ -0,0 +1,11 @@ +template: fronius_last_collected_secs +families: * + on: fronius.power + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/gearman.conf b/health/health.d/gearman.conf new file mode 100644 index 0000000..e3863ae --- /dev/null +++ b/health/health.d/gearman.conf @@ -0,0 +1,22 @@ +# make sure Gearman is running +template: gearman_last_collected_secs + on: gearman.total_jobs + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +template: gearman_workers_queued + on: gearman.single_job + lookup: average -10m unaligned match-names of Queued + units: workers + every: 10s + warn: $this > 30000 + crit: $this > 100000 + delay: down 5m multiplier 1.5 max 1h + info: number of queued jobs + to: sysadmin
\ No newline at end of file diff --git a/health/health.d/haproxy.conf b/health/health.d/haproxy.conf new file mode 100644 index 0000000..e49c70d --- /dev/null +++ b/health/health.d/haproxy.conf @@ -0,0 +1,27 @@ +template: haproxy_backend_server_status + on: haproxy_hs.down + units: failed servers + every: 10s + lookup: average -10s + crit: $this > 0 + info: number of failed haproxy backend servers + to: sysadmin + +template: haproxy_backend_status + on: haproxy_hb.down + units: failed backend + every: 10s + lookup: average -10s + crit: $this > 0 + info: number of failed haproxy backends + to: sysadmin + +template: haproxy_last_collected + on: haproxy_hb.down + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin diff --git a/health/health.d/hdfs.conf b/health/health.d/hdfs.conf new file mode 100644 index 0000000..678faab --- /dev/null +++ b/health/health.d/hdfs.conf @@ -0,0 +1,75 @@ + +# make sure hdfs is running + +template: hdfs_last_collected_secs + on: hdfs.heap_memory + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + + +# Common + +template: hdfs_capacity_usage + on: hdfs.capacity + calc: ($used) * 100 / ($used + $remaining) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (80) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: used capacity + to: sysadmin + + +# NameNode + +template: hdfs_missing_blocks + on: hdfs.blocks + calc: $missing + units: missing blocks + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: missing blocks + to: sysadmin + + +template: hdfs_stale_nodes + on: hdfs.data_nodes + calc: $stale + units: dead nodes + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: stale data nodes + to: sysadmin + + +template: hdfs_dead_nodes + on: hdfs.data_nodes + calc: $dead + units: dead nodes + every: 10s + crit: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: dead data nodes + to: sysadmin + + +# DataNode + +template: hdfs_num_failed_volumes + on: hdfs.num_failed_volumes + calc: $fsds_num_failed_volumes + units: failed volumes + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: failed volumes + to: sysadmin diff --git a/health/health.d/httpcheck.conf b/health/health.d/httpcheck.conf new file mode 100644 index 0000000..0ddf35e --- /dev/null +++ b/health/health.d/httpcheck.conf @@ -0,0 +1,99 @@ +template: httpcheck_last_collected_secs +families: * + on: httpcheck.status + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# This is a fast-reacting no-notification alarm ideal for custom dashboards or badges +template: web_service_up +families: * + on: httpcheck.status + lookup: average -1m unaligned percentage of success + calc: ($this < 75) ? (0) : ($this) + every: 5s + units: up/down + info: at least 75% verified responses during last 60 seconds, ideal for badges + to: silent + +template: web_service_bad_content +families: * + on: httpcheck.status + lookup: average -5m unaligned percentage of bad_content + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of unexpected http response content during the last 5 minutes + options: no-clear-notification + to: webmaster + +template: web_service_bad_status +families: * + on: httpcheck.status + lookup: average -5m unaligned percentage of bad_status + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of unexpected http status during the last 5 minutes + options: no-clear-notification + to: webmaster + +template: web_service_timeouts +families: * + on: httpcheck.status + lookup: average -5m unaligned percentage of timeout + every: 10s + units: % + info: average of timeouts during the last 5 minutes + +template: no_web_service_connections +families: * + on: httpcheck.status + lookup: average -5m unaligned percentage of no_connection + every: 10s + units: % + info: average of failed requests during the last 5 minutes + +# combined timeout & no connection alarm +template: web_service_unreachable +families: * + on: httpcheck.status + calc: ($no_web_service_connections >= $web_service_timeouts) ? ($no_web_service_connections) : ($web_service_timeouts) + units: % + every: 10s + warn: ($no_web_service_connections >= 10 OR $web_service_timeouts >= 10) AND ($no_web_service_connections < 40 OR $web_service_timeouts < 40) + crit: $no_web_service_connections >= 40 OR $web_service_timeouts >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of failed requests either due to timeouts or no connection during the last 5 minutes + options: no-clear-notification + to: webmaster + +template: 1h_web_service_response_time +families: * + on: httpcheck.responsetime + lookup: average -1h unaligned of time + every: 30s + units: ms + info: average response time over the last hour + +template: web_service_slow +families: * + on: httpcheck.responsetime + lookup: average -3m unaligned of time + units: ms + every: 10s + warn: ($this > ($1h_web_service_response_time * 2) ) + crit: ($this > ($1h_web_service_response_time * 3) ) + info: average response time over the last 3 minutes, compared to the average over the last hour + delay: down 5m multiplier 1.5 max 1h + options: no-clear-notification + to: webmaster diff --git a/health/health.d/ioping.conf b/health/health.d/ioping.conf new file mode 100644 index 0000000..59a5c8e --- /dev/null +++ b/health/health.d/ioping.conf @@ -0,0 +1,13 @@ +template: disk_latency +families: * + on: ioping.latency + lookup: average -10s unaligned of average + units: ms + every: 10s + green: 500 + red: 1000 + warn: $this > $green OR $max > $red + crit: $this > $red + info: average round trip delay during the last 10 seconds + delay: down 30m multiplier 1.5 max 2h + to: sysadmin diff --git a/health/health.d/ipc.conf b/health/health.d/ipc.conf new file mode 100644 index 0000000..989d6e9 --- /dev/null +++ b/health/health.d/ipc.conf @@ -0,0 +1,28 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: semaphores_used + on: system.ipc_semaphores + os: linux + hosts: * + calc: $semaphores * 100 / $ipc_semaphores_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (70) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: the percentage of IPC semaphores used + to: sysadmin + + alarm: semaphore_arrays_used + on: system.ipc_semaphore_arrays + os: linux + hosts: * + calc: $arrays * 100 / $ipc_semaphores_arrays_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (70) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: the percentage of IPC semaphore arrays used + to: sysadmin diff --git a/health/health.d/ipfs.conf b/health/health.d/ipfs.conf new file mode 100644 index 0000000..3f77572 --- /dev/null +++ b/health/health.d/ipfs.conf @@ -0,0 +1,11 @@ + +template: ipfs_datastore_usage + on: ipfs.repo_size + calc: $size * 100 / $avail + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: ipfs Datastore close to running out of space + to: sysadmin diff --git a/health/health.d/ipmi.conf b/health/health.d/ipmi.conf new file mode 100644 index 0000000..c255819 --- /dev/null +++ b/health/health.d/ipmi.conf @@ -0,0 +1,20 @@ + alarm: ipmi_sensors_states + on: ipmi.sensors_states + calc: $warning + $critical + units: sensors + every: 10s + warn: $this > 0 + crit: $critical > 0 + delay: up 5m down 15m multiplier 1.5 max 1h + info: the number IPMI sensors in non-nominal state + to: sysadmin + + alarm: ipmi_events + on: ipmi.events + calc: $events + units: events + every: 10s + warn: $this > 0 + delay: up 5m down 15m multiplier 1.5 max 1h + info: the number of events in the IPMI System Event Log (SEL) + to: sysadmin diff --git a/health/health.d/isc_dhcpd.conf b/health/health.d/isc_dhcpd.conf new file mode 100644 index 0000000..8054656 --- /dev/null +++ b/health/health.d/isc_dhcpd.conf @@ -0,0 +1,10 @@ + template: isc_dhcpd_leases_size + on: isc_dhcpd.leases_total + units: KB + every: 60 + calc: $leases_size + warn: $this > 3072 + crit: $this > 6144 + delay: up 2m down 5m + info: dhcpd.leases file too big! Module can slow down your server. + to: sysadmin diff --git a/health/health.d/kubelet.conf b/health/health.d/kubelet.conf new file mode 100644 index 0000000..d2ef24b --- /dev/null +++ b/health/health.d/kubelet.conf @@ -0,0 +1,115 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- + +# True (1) if the node is experiencing a configuration-related error, false (0) otherwise. + + template: node_config_error + on: k8s_kubelet.kubelet_node_config_error + calc: $kubelet_node_config_error + units: bool + every: 10s + warn: $this == 1 + delay: down 1m multiplier 1.5 max 2h + info: the node is experiencing a configuration-related error + to: sysadmin + +# Failed Token() requests to the alternate token source + + template: token_requests + lookup: sum -10s of token_fail_count + on: k8s_kubelet.kubelet_token_requests + units: failed requests + every: 10s + warn: $this > 0 + delay: down 1m multiplier 1.5 max 2h + info: failed token requests to alternate token source + to: sysadmin + +# Docker and runtime operation errors + + template: kubelet_operations_error + lookup: sum -1m + on: k8s_kubelet.kubelet_operations_errors + units: errors + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (20)) + delay: up 30s down 1m multiplier 1.5 max 2h + info: operations error + to: sysadmin + +# ----------------------------------------------------------------------------- + +# Pod Lifecycle Event Generator Relisting Latency + +# 1. calculate the pleg relisting latency for 1m (quantile 0.5, quantile 0.9, quantile 0.99) +# 2. do the same for the last 10s +# 3. raise an alarm if the later is: +# - 2x the first for quantile 0.5 +# - 4x the first for quantile 0.9 +# - 8x the first for quantile 0.99 +# +# we assume the minimum latency is 1000 microseconds + +# quantile 0.5 + +template: 1m_kubelet_pleg_relist_latency_quantile_05 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + lookup: average -1m unaligned of kubelet_pleg_relist_latency_05 + units: microseconds + every: 10s + info: the average value of pleg relisting latency during the last minute (quantile 0.5) + +template: 10s_kubelet_pleg_relist_latency_quantile_05 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + lookup: average -10s unaligned of kubelet_pleg_relist_latency_05 + calc: $this * 100 / (($1m_kubelet_pleg_relist_latency_quantile_05 < 1000)?(1000):($1m_kubelet_pleg_relist_latency_quantile_05)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(100):(200)) + crit: $this > (($status >= $WARNING)?(200):(400)) + delay: down 1m multiplier 1.5 max 2h + info: the % of the pleg relisting latency in the last 10 seconds, compared to the last minute (quantile 0.5) + to: sysadmin + +# quantile 0.9 + +template: 1m_kubelet_pleg_relist_latency_quantile_09 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + lookup: average -1m unaligned of kubelet_pleg_relist_latency_09 + units: microseconds + every: 10s + info: the average value of pleg relisting latency during the last minute (quantile 0.9) + +template: 10s_kubelet_pleg_relist_latency_quantile_09 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + lookup: average -10s unaligned of kubelet_pleg_relist_latency_09 + calc: $this * 100 / (($1m_kubelet_pleg_relist_latency_quantile_09 < 1000)?(1000):($1m_kubelet_pleg_relist_latency_quantile_09)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(200):(400)) + crit: $this > (($status >= $WARNING)?(400):(800)) + delay: down 1m multiplier 1.5 max 2h + info: the % of the pleg relisting latency in the last 10 seconds, compared to the last minute (quantile 0.9) + to: sysadmin + +# quantile 0.99 + +template: 1m_kubelet_pleg_relist_latency_quantile_099 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + lookup: average -1m unaligned of kubelet_pleg_relist_latency_099 + units: microseconds + every: 10s + info: the average value of pleg relisting latency during the last minute (quantile 0.99) + +template: 10s_kubelet_pleg_relist_latency_quantile_099 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + lookup: average -10s unaligned of kubelet_pleg_relist_latency_099 + calc: $this * 100 / (($1m_kubelet_pleg_relist_latency_quantile_099 < 1000)?(1000):($1m_kubelet_pleg_relist_latency_quantile_099)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(400):(800)) + crit: $this > (($status >= $WARNING)?(800):(1200)) + delay: down 1m multiplier 1.5 max 2h + info: the % of the pleg relisting latency in the last 10 seconds, compared to the last minute (quantile 0.99) + to: sysadmin diff --git a/health/health.d/lighttpd.conf b/health/health.d/lighttpd.conf new file mode 100644 index 0000000..915907a --- /dev/null +++ b/health/health.d/lighttpd.conf @@ -0,0 +1,14 @@ + +# make sure lighttpd is running + +template: lighttpd_last_collected_secs + on: lighttpd.requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/linux_power_supply.conf b/health/health.d/linux_power_supply.conf new file mode 100644 index 0000000..38727be --- /dev/null +++ b/health/health.d/linux_power_supply.conf @@ -0,0 +1,12 @@ +# Alert on low battery capacity. + +template: linux_power_supply_capacity + on: powersupply.capacity + calc: $capacity + units: % + every: 10s + warn: $this < 10 + crit: $this < 5 + delay: up 30s down 5m multiplier 1.2 max 1h + info: the percentage remaining capacity of the power supply + to: sysadmin diff --git a/health/health.d/load.conf b/health/health.d/load.conf new file mode 100644 index 0000000..ee0c54b --- /dev/null +++ b/health/health.d/load.conf @@ -0,0 +1,56 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# Calculate the base trigger point for the load average alarms. +# This is the maximum number of CPU's in the system over the past 1 +# minute, with a special case for a single CPU of setting the trigger at 2. + alarm: load_trigger + on: system.load + os: linux + hosts: * + calc: ($active_processors == nan or $active_processors == inf or $active_processors < 2) ? ( 2 ) : ( $active_processors ) + units: cpus + every: 1m + info: trigger point for load average alarms + +# Send alarms if the load average is unusually high. +# These intentionally _do not_ calculate the average over the sampled +# time period because the values being checked already are averages. + alarm: load_average_15 + on: system.load + os: linux + hosts: * + lookup: max -1m unaligned of load15 + units: load + every: 1m + warn: $this > (($status >= $WARNING) ? (1.75 * $load_trigger) : (2 * $load_trigger)) + crit: $this > (($status == $CRITICAL) ? (3.5 * $load_trigger) : (4 * $load_trigger)) + delay: down 15m multiplier 1.5 max 1h + info: fifteen-minute load average + to: sysadmin + + alarm: load_average_5 + on: system.load + os: linux + hosts: * + lookup: max -1m unaligned of load5 + units: load + every: 1m + warn: $this > (($status >= $WARNING) ? (3.5 * $load_trigger) : (4 * $load_trigger)) + crit: $this > (($status == $CRITICAL) ? (7 * $load_trigger) : (8 * $load_trigger)) + delay: down 15m multiplier 1.5 max 1h + info: five-minute load average + to: sysadmin + + alarm: load_average_1 + on: system.load + os: linux + hosts: * + lookup: max -1m unaligned of load1 + units: load + every: 1m + warn: $this > (($status >= $WARNING) ? (7 * $load_trigger) : (8 * $load_trigger)) + crit: $this > (($status == $CRITICAL) ? (14 * $load_trigger) : (16 * $load_trigger)) + delay: down 15m multiplier 1.5 max 1h + info: one-minute load average + to: sysadmin diff --git a/health/health.d/mdstat.conf b/health/health.d/mdstat.conf new file mode 100644 index 0000000..2f906e1 --- /dev/null +++ b/health/health.d/mdstat.conf @@ -0,0 +1,38 @@ +template: mdstat_last_collected + on: md.disks + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin + +template: mdstat_disks + on: md.disks + units: failed devices + every: 10s + calc: $down + crit: $this > 0 + info: Array is degraded! + to: sysadmin + +template: mdstat_mismatch_cnt + on: md.mismatch_cnt + units: unsynchronized blocks + calc: $count + every: 60s + warn: $this > 1024 + delay: up 30m + info: Mismatch count! + to: sysadmin + +template: mdstat_nonredundant_last_collected + on: md.nonredundant + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin
\ No newline at end of file diff --git a/health/health.d/megacli.conf b/health/health.d/megacli.conf new file mode 100644 index 0000000..6e81a2a --- /dev/null +++ b/health/health.d/megacli.conf @@ -0,0 +1,48 @@ +template: adapter_state + on: megacli.adapter_degraded + units: is degraded + lookup: sum -10s + every: 10s + crit: $this > 0 + info: adapter state + to: sysadmin + +template: bbu_relative_charge + on: megacli.bbu_relative_charge + units: percent + lookup: average -10s + every: 10s + warn: $this <= (($status >= $WARNING) ? (85) : (80)) + crit: $this <= (($status == $CRITICAL) ? (50) : (40)) + info: BBU relative state of charge + to: sysadmin + +template: bbu_cycle_count + on: megacli.bbu_cycle_count + units: cycle count + lookup: average -10s + every: 10s + warn: $this >= 100 + crit: $this >= 500 + info: BBU cycle count + to: sysadmin + +template: pd_media_errors + on: megacli.pd_media_error + units: media errors + lookup: sum -10s + every: 10s + warn: $this > 0 + delay: down 1m multiplier 2 max 10m + info: physical drive media errors + to: sysadmin + +template: pd_predictive_failures + on: megacli.pd_predictive_failure + units: predictive failures + lookup: sum -10s + every: 10s + warn: $this > 0 + delay: down 1m multiplier 2 max 10m + info: physical drive predictive failures + to: sysadmin diff --git a/health/health.d/memcached.conf b/health/health.d/memcached.conf new file mode 100644 index 0000000..d248ef5 --- /dev/null +++ b/health/health.d/memcached.conf @@ -0,0 +1,52 @@ + +# make sure memcached is running + +template: memcached_last_collected_secs + on: memcached.cache + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba + + +# detect if memcached cache is full + +template: memcached_cache_memory_usage + on: memcached.cache + calc: $used * 100 / ($used + $available) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: up 0 down 15m multiplier 1.5 max 1h + info: current cache memory usage + to: dba + + +# find the rate memcached cache is filling + +template: cache_fill_rate + on: memcached.cache + lookup: min -10m at -50m unaligned of available + calc: ($this - $available) / (($now - $after) / 3600) + units: KB/hour + every: 1m + info: average rate the cache fills up (positive), or frees up (negative) space, for the last hour + + +# find the hours remaining until memcached cache is full + +template: out_of_cache_space_time + on: memcached.cache + calc: ($cache_fill_rate > 0) ? ($available / $cache_fill_rate) : (inf) + units: hours + every: 10s + warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) + crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) + delay: down 15m multiplier 1.5 max 1h + info: estimated time the cache will run out of space, if the system continues to add data with the rate of the last hour + to: dba diff --git a/health/health.d/memory.conf b/health/health.d/memory.conf new file mode 100644 index 0000000..4a0e6e5 --- /dev/null +++ b/health/health.d/memory.conf @@ -0,0 +1,38 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: 1hour_ecc_memory_correctable + on: mem.ecc_ce + os: linux + hosts: * + lookup: sum -10m unaligned + units: errors + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: number of ECC correctable errors during the last hour + to: sysadmin + + alarm: 1hour_ecc_memory_uncorrectable + on: mem.ecc_ue + os: linux + hosts: * + lookup: sum -10m unaligned + units: errors + every: 1m + crit: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: number of ECC uncorrectable errors during the last hour + to: sysadmin + + alarm: 1hour_memory_hw_corrupted + on: mem.hwcorrupt + os: linux + hosts: * + calc: $HardwareCorrupted + units: MB + every: 10s + warn: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: amount of memory corrupted due to a hardware failure + to: sysadmin diff --git a/health/health.d/mongodb.conf b/health/health.d/mongodb.conf new file mode 100644 index 0000000..a80cb31 --- /dev/null +++ b/health/health.d/mongodb.conf @@ -0,0 +1,13 @@ + +# make sure mongodb is running + +template: mongodb_last_collected_secs + on: mongodb.read_operations + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba diff --git a/health/health.d/mysql.conf b/health/health.d/mysql.conf new file mode 100644 index 0000000..62cef5a --- /dev/null +++ b/health/health.d/mysql.conf @@ -0,0 +1,146 @@ + +# make sure mysql is running + +template: mysql_last_collected_secs + on: mysql.queries + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba + + +# ----------------------------------------------------------------------------- +# slow queries + +template: mysql_10s_slow_queries + on: mysql.queries + lookup: sum -10s of slow_queries + units: slow queries + every: 10s + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (10) : (20)) + delay: down 5m multiplier 1.5 max 1h + info: number of mysql slow queries over the last 10 seconds + to: dba + + +# ----------------------------------------------------------------------------- +# lock waits + +template: mysql_10s_table_locks_immediate + on: mysql.table_locks + lookup: sum -10s absolute of immediate + units: immediate locks + every: 10s + info: number of table immediate locks over the last 10 seconds + to: dba + +template: mysql_10s_table_locks_waited + on: mysql.table_locks + lookup: sum -10s absolute of waited + units: waited locks + every: 10s + info: number of table waited locks over the last 10 seconds + to: dba + +template: mysql_10s_waited_locks_ratio + on: mysql.table_locks + calc: ( ($mysql_10s_table_locks_waited + $mysql_10s_table_locks_immediate) > 0 ) ? (($mysql_10s_table_locks_waited * 100) / ($mysql_10s_table_locks_waited + $mysql_10s_table_locks_immediate)) : 0 + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (10) : (25)) + crit: $this > (($status == $CRITICAL) ? (25) : (50)) + delay: down 30m multiplier 1.5 max 1h + info: the ratio of mysql waited table locks, for the last 10 seconds + to: dba + + +# ----------------------------------------------------------------------------- +# connections + +template: mysql_connections + on: mysql.connections_active + calc: $active * 100 / $limit + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (60) : (70)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: down 15m multiplier 1.5 max 1h + info: the ratio of current active connections vs the maximum possible number of connections + to: dba + + +# ----------------------------------------------------------------------------- +# replication + +template: mysql_replication + on: mysql.slave_status + calc: ($sql_running <= 0 OR $io_running <= 0)?0:1 + units: ok/failed + every: 10s + crit: $this == 0 + delay: down 5m multiplier 1.5 max 1h + info: checks if mysql replication has stopped + to: dba + +template: mysql_replication_lag + on: mysql.slave_behind + calc: $seconds + units: seconds + every: 10s + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (10) : (30)) + delay: down 15m multiplier 1.5 max 1h + info: the number of seconds mysql replication is behind this master + to: dba + + +# ----------------------------------------------------------------------------- +# galera cluster size + +template: mysql_galera_cluster_size_max_2m + on: mysql.galera_cluster_size + lookup: max -2m absolute + units: nodes + every: 10s + info: max cluster size 2 minute + to: dba + +template: mysql_galera_cluster_size + on: mysql.galera_cluster_size + calc: $nodes + units: nodes + every: 10s + warn: $this > $mysql_galera_cluster_size_max_2m + crit: $this < $mysql_galera_cluster_size_max_2m + delay: up 20s down 5m multiplier 1.5 max 1h + info: cluster size has changed + to: dba + +# galera node state + +template: mysql_galera_cluster_state + on: mysql.galera_cluster_state + calc: $state + every: 10s + warn: $this < 4 + crit: $this < 2 + delay: up 30s down 5m multiplier 1.5 max 1h + info: node state (0: undefined, 1: joining, 2: donor/desynced, 3: joined, 4: synced) + to: dba + + +# galera node status + +template: mysql_galera_cluster_status + on: mysql.galera_cluster_status + calc: $wsrep_cluster_status + every: 10s + crit: $mysql_galera_cluster_state != nan AND $this != 0 + delay: up 30s down 5m multiplier 1.5 max 1h + info: node and cluster status (-1: unknown, 0: primary/quorum present, 1: non-primary/quorum lost, 2: disconnected) + to: dba diff --git a/health/health.d/named.conf b/health/health.d/named.conf new file mode 100644 index 0000000..4fc65c8 --- /dev/null +++ b/health/health.d/named.conf @@ -0,0 +1,14 @@ + +# make sure named is running + +template: named_last_collected_secs + on: named.global_queries + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: domainadmin + diff --git a/health/health.d/net.conf b/health/health.d/net.conf new file mode 100644 index 0000000..261290e --- /dev/null +++ b/health/health.d/net.conf @@ -0,0 +1,195 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- +# net traffic overflow + + template: interface_speed + on: net.net + os: * + hosts: * + families: * + calc: ( $nic_speed_max > 0 ) ? ( $nic_speed_max) : ( nan ) + units: Mbit + every: 10s + info: The current speed of the physical network interface + + template: 1m_received_traffic_overflow + on: net.net + os: linux + hosts: * + families: * + lookup: average -1m unaligned absolute of received + calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed * 1000)) : ( nan ) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (90)) + delay: down 1m multiplier 1.5 max 1h + info: interface received bandwidth usage over net device speed max + to: sysadmin + + template: 1m_sent_traffic_overflow + on: net.net + os: linux + hosts: * + families: * + lookup: average -1m unaligned absolute of sent + calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed * 1000)) : ( nan ) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (90)) + delay: down 1m multiplier 1.5 max 1h + info: interface sent bandwidth usage over net device speed max + to: sysadmin + +# ----------------------------------------------------------------------------- +# dropped packets + +# check if an interface is dropping packets +# the alarm is checked every 1 minute +# and examines the last 10 minutes of data +# +# it is possible to have expected packet drops on an interface for some network configurations +# look at the Monitoring Network Interfaces section in the proc.plugin documentation for more information + +template: inbound_packets_dropped + on: net.drops + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute of inbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface inbound dropped packets in the last 10 minutes + to: sysadmin + +template: outbound_packets_dropped + on: net.drops + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute of outbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface outbound dropped packets in the last 10 minutes + to: sysadmin + +template: inbound_packets_dropped_ratio + on: net.packets + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute of received + calc: (($inbound_packets_dropped != nan AND $this > 0) ? ($inbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 0.1 + crit: $this >= 2 + delay: down 1h multiplier 1.5 max 2h + info: the ratio of inbound dropped packets vs the total number of received packets of the network interface, during the last 10 minutes + to: sysadmin + +template: outbound_packets_dropped_ratio + on: net.packets + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute of sent + calc: (($outbound_packets_dropped != nan AND $this > 0) ? ($outbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 0.1 + crit: $this >= 2 + delay: down 1h multiplier 1.5 max 2h + info: the ratio of outbound dropped packets vs the total number of sent packets of the network interface, during the last 10 minutes + to: sysadmin + +# ----------------------------------------------------------------------------- +# interface errors + +template: interface_inbound_errors + on: net.errors + os: freebsd + hosts: * +families: * + lookup: sum -10m unaligned absolute of inbound + units: errors + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface inbound errors in the last 10 minutes + to: sysadmin + +template: interface_outbound_errors + on: net.errors + os: freebsd + hosts: * +families: * + lookup: sum -10m unaligned absolute of outbound + units: errors + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface outbound errors in the last 10 minutes + to: sysadmin + +# ----------------------------------------------------------------------------- +# FIFO errors + +# check if an interface is having FIFO +# buffer errors +# the alarm is checked every 1 minute +# and examines the last 10 minutes of data + +template: 10min_fifo_errors + on: net.fifo + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute + units: errors + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 2h + info: interface fifo errors in the last 10 minutes + to: sysadmin + +# ----------------------------------------------------------------------------- +# check for packet storms + +# 1. calculate the rate packets are received in 1m: 1m_received_packets_rate +# 2. do the same for the last 10s +# 3. raise an alarm if the later is 10x or 20x the first +# we assume the minimum packet storm should at least have +# 10000 packets/s, average of the last 10 seconds + +template: 1m_received_packets_rate + on: net.packets + os: linux freebsd + hosts: * +families: * + lookup: average -1m unaligned of received + units: packets + every: 10s + info: the average number of packets received during the last minute + +template: 10s_received_packets_storm + on: net.packets + os: linux freebsd + hosts: * +families: * + lookup: average -10s unaligned of received + calc: $this * 100 / (($1m_received_packets_rate < 1000)?(1000):($1m_received_packets_rate)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(200):(5000)) + crit: $this > (($status >= $WARNING)?(5000):(6000)) + options: no-clear-notification + info: the % of the rate of received packets in the last 10 seconds, compared to the rate of the last minute (clear notification for this alarm will not be sent) + to: sysadmin diff --git a/health/health.d/netfilter.conf b/health/health.d/netfilter.conf new file mode 100644 index 0000000..1d07752 --- /dev/null +++ b/health/health.d/netfilter.conf @@ -0,0 +1,29 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: netfilter_last_collected_secs + on: netfilter.conntrack_sockets + os: linux + hosts: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + + alarm: netfilter_conntrack_full + on: netfilter.conntrack_sockets + os: linux + hosts: * + lookup: max -10s unaligned of connections + calc: $this * 100 / $netfilter_conntrack_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: the number of connections tracked by the netfilter connection tracker, as a percentage of the connection tracker table size + to: sysadmin diff --git a/health/health.d/nginx.conf b/health/health.d/nginx.conf new file mode 100644 index 0000000..a686c3d --- /dev/null +++ b/health/health.d/nginx.conf @@ -0,0 +1,14 @@ + +# make sure nginx is running + +template: nginx_last_collected_secs + on: nginx.requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/nginx_plus.conf b/health/health.d/nginx_plus.conf new file mode 100644 index 0000000..5a171a7 --- /dev/null +++ b/health/health.d/nginx_plus.conf @@ -0,0 +1,14 @@ + +# make sure nginx_plus is running + +template: nginx_plus_last_collected_secs + on: nginx_plus.requests_total + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/phpfpm.conf b/health/health.d/phpfpm.conf new file mode 100644 index 0000000..ec7ae74 --- /dev/null +++ b/health/health.d/phpfpm.conf @@ -0,0 +1,14 @@ + +# make sure phpfpm is running + +template: phpfpm_last_collected_secs + on: phpfpm.requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.d/pihole.conf b/health/health.d/pihole.conf new file mode 100644 index 0000000..b255d35 --- /dev/null +++ b/health/health.d/pihole.conf @@ -0,0 +1,65 @@ + +# Make sure Pi-hole is responding. + +template: pihole_last_collected_secs + on: pihole.dns_queries_total + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + +# Blocked DNS queries. + +template: pihole_blocked_queries + on: pihole.dns_queries_percentage + every: 10s + units: % + calc: $blocked + warn: $this > ( ($status >= $WARNING ) ? ( 45 ) : ( 55 ) ) + crit: $this > ( ($status >= $CRITICAL) ? ( 55 ) : ( 75 ) ) + delay: up 2m down 5m + info: percentage of blocked dns queries for the last 24 hour + to: sysadmin + + +# Blocklist last update time. +# Default update interval is a week. + +template: pihole_blocklist_last_update + on: pihole.blocklist_last_update + every: 10s + units: seconds + calc: $ago + warn: $this > 60 * 60 * 24 * 8 + crit: $this > 60 * 60 * 24 * 8 * 2 + info: blocklist last update time + to: sysadmin + +# Gravity file check (gravity.list). + +template: pihole_blocklist_gravity_file + on: pihole.blocklist_last_update + every: 10s + units: boolean + calc: $file_exists + crit: $this != 1 + delay: up 2m down 5m + info: gravity file existence + to: sysadmin + +# Pi-hole's ability to block unwanted domains. +# Should be enabled. The whole point of Pi-hole! + +template: pihole_status + on: pihole.unwanted_domains_blocking_status + every: 10s + units: boolean + calc: $enabled + warn: $this != 1 + delay: up 2m down 5m + info: unwanted domains blocking status + to: sysadmin diff --git a/health/health.d/portcheck.conf b/health/health.d/portcheck.conf new file mode 100644 index 0000000..696333f --- /dev/null +++ b/health/health.d/portcheck.conf @@ -0,0 +1,46 @@ +template: portcheck_last_collected_secs +families: * + on: portcheck.status + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# This is a fast-reacting no-notification alarm ideal for custom dashboards or badges +template: service_reachable +families: * + on: portcheck.status + lookup: average -1m unaligned percentage of success + calc: ($this < 75) ? (0) : ($this) + every: 5s + units: up/down + info: at least 75% successful connections during last 60 seconds, ideal for badges + to: silent + +template: connection_timeouts +families: * + on: portcheck.status + lookup: average -5m unaligned percentage of timeout + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of timeouts during the last 5 minutes + to: sysadmin + +template: connection_fails +families: * + on: portcheck.status + lookup: average -5m unaligned percentage of no_connection,failed + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average of failed connections during the last 5 minutes + to: sysadmin diff --git a/health/health.d/postgres.conf b/health/health.d/postgres.conf new file mode 100644 index 0000000..4e0583b --- /dev/null +++ b/health/health.d/postgres.conf @@ -0,0 +1,13 @@ + +# make sure postgres is running + +template: postgres_last_collected_secs + on: postgres.db_stat_transactions + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba diff --git a/health/health.d/processes.conf b/health/health.d/processes.conf new file mode 100644 index 0000000..293f1aa --- /dev/null +++ b/health/health.d/processes.conf @@ -0,0 +1,13 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: active_processes + on: system.active_processes + hosts: * + calc: $active * 100 / $pidmax + units: % + every: 5s + warn: $this > (($status >= $WARNING) ? (75) : (80)) + crit: $this > (($status == $CRITICAL) ? (85) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: the percentage of active processes + to: sysadmin diff --git a/health/health.d/pulsar.conf b/health/health.d/pulsar.conf new file mode 100644 index 0000000..0147894 --- /dev/null +++ b/health/health.d/pulsar.conf @@ -0,0 +1,13 @@ + +# Availability + +template: pulsar_last_collected_secs + on: pulsar.broker_components + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin diff --git a/health/health.d/qos.conf b/health/health.d/qos.conf new file mode 100644 index 0000000..7290d15 --- /dev/null +++ b/health/health.d/qos.conf @@ -0,0 +1,18 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# check if a QoS class is dropping packets +# the alarm is checked every 10 seconds +# and examines the last minute of data + +#template: 10min_qos_packet_drops +# on: tc.qos_dropped +# os: linux +# hosts: * +# lookup: sum -10m unaligned absolute +# every: 30s +# warn: $this > 0 +# delay: up 0 down 30m multiplier 1.5 max 1h +# units: packets +# info: dropped packets in the last 30 minutes +# to: sysadmin diff --git a/health/health.d/ram.conf b/health/health.d/ram.conf new file mode 100644 index 0000000..0a71dac --- /dev/null +++ b/health/health.d/ram.conf @@ -0,0 +1,64 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: used_ram_to_ignore + on: system.ram + os: linux freebsd + hosts: * + calc: ($zfs.arc_size.arcsz = nan)?(0):($zfs.arc_size.arcsz - $zfs.arc_size.min) + every: 10s + info: the amount of memory that is reported as used, but it is actually capable for resizing itself based on the system needs (eg. ZFS ARC) + + alarm: ram_in_use + on: system.ram + os: linux + hosts: * +# calc: $used * 100 / ($used + $cached + $free) + calc: ($used - $used_ram_to_ignore) * 100 / ($used + $cached + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: system RAM used + to: sysadmin + + alarm: ram_available + on: mem.available + os: linux + hosts: * + calc: ($avail + $system.ram.used_ram_to_ignore) * 100 / ($system.ram.used + $system.ram.cached + $system.ram.free + $system.ram.buffers) + units: % + every: 10s + warn: $this < (($status >= $WARNING) ? (15) : (10)) + crit: $this < (($status == $CRITICAL) ? (10) : ( 5)) + delay: down 15m multiplier 1.5 max 1h + info: estimated amount of RAM available for userspace processes, without causing swapping + to: sysadmin + +## FreeBSD + alarm: ram_in_use + on: system.ram + os: freebsd + hosts: * + calc: ($active + $wired + $laundry + $buffers - $used_ram_to_ignore) * 100 / ($active + $wired + $laundry + $buffers - $used_ram_to_ignore + $cache + $free + $inactive) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: system RAM usage + to: sysadmin + + alarm: ram_available + on: system.ram + os: freebsd + hosts: * + calc: ($free + $inactive + $used_ram_to_ignore) * 100 / ($free + $active + $inactive + $wired + $cache + $laundry + $buffers) + units: % + every: 10s + warn: $this < (($status >= $WARNING) ? (15) : (10)) + crit: $this < (($status == $CRITICAL) ? (10) : ( 5)) + delay: down 15m multiplier 1.5 max 1h + info: estimated amount of RAM available for userspace processes, without causing swapping + to: sysadmin diff --git a/health/health.d/redis.conf b/health/health.d/redis.conf new file mode 100644 index 0000000..c08a884 --- /dev/null +++ b/health/health.d/redis.conf @@ -0,0 +1,34 @@ + +# make sure redis is running + +template: redis_last_collected_secs + on: redis.operations + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba + +template: redis_bgsave_broken +families: * + on: redis.bgsave_health + every: 10s + crit: $rdb_last_bgsave_status != 0 + units: ok/failed + info: states if redis bgsave is working + delay: down 5m multiplier 1.5 max 1h + to: dba + +template: redis_bgsave_slow +families: * + on: redis.bgsave_now + every: 10s + warn: $rdb_bgsave_in_progress > 600 + crit: $rdb_bgsave_in_progress > 1200 + units: seconds + info: the time redis needs to save its database + delay: down 5m multiplier 1.5 max 1h + to: dba diff --git a/health/health.d/retroshare.conf b/health/health.d/retroshare.conf new file mode 100644 index 0000000..2344b60 --- /dev/null +++ b/health/health.d/retroshare.conf @@ -0,0 +1,25 @@ +# make sure RetroShare is running + +template: retroshare_last_collected_secs + on: retroshare.peers + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# make sure the DHT is fine when active + +template: retroshare_dht_working + on: retroshare.dht + calc: $dht_size_all + units: peers + every: 1m + warn: $this < (($status >= $WARNING) ? (120) : (100)) + crit: $this < (($status == $CRITICAL) ? (10) : (1)) + delay: up 0 down 15m multiplier 1.5 max 1h + info: Checks if the DHT has enough peers to operate + to: sysadmin diff --git a/health/health.d/riakkv.conf b/health/health.d/riakkv.conf new file mode 100644 index 0000000..7453027 --- /dev/null +++ b/health/health.d/riakkv.conf @@ -0,0 +1,80 @@ +# Ensure that Riak is running. template: riak_last_collected_secs +template: riak_last_collected_secs + on: riak.kv.throughput + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: dba + +# Warn if a list keys operation is running. +template: riak_list_keys_active + on: riak.core.fsm_active + calc: $list_fsm_active + units: state machines + every: 10s + warn: $list_fsm_active > 0 + info: number of currently running list keys finite state machines + to: dba + + +## Timing healthchecks +# KV GET +template: 1h_kv_get_mean_latency + on: riak.kv.latency.get + calc: $node_get_fsm_time_mean + lookup: average -1h unaligned of time + every: 30s + units: ms + info: mean average KV GET latency over the last hour + +template: riak_kv_get_slow + on: riak.kv.latency.get + calc: $mean + lookup: average -3m unaligned of time + units: ms + every: 10s + warn: ($this > ($1h_kv_get_mean_latency * 2) ) + crit: ($this > ($1h_kv_get_mean_latency * 3) ) + info: average KV GET time over the last 3 minutes, compared to the average over the last hour + delay: down 5m multiplier 1.5 max 1h + to: dba + +# KV PUT +template: 1h_kv_put_mean_latency + on: riak.kv.latency.put + calc: $node_put_fsm_time_mean + lookup: average -1h unaligned of time + every: 30s + units: ms + info: mean average KV PUT latency over the last hour + +template: riak_kv_put_slow + on: riak.kv.latency.put + calc: $mean + lookup: average -3m unaligned of time + units: ms + every: 10s + warn: ($this > ($1h_kv_put_mean_latency * 2) ) + crit: ($this > ($1h_kv_put_mean_latency * 3) ) + info: average KV PUT time over the last 3 minutes, compared to the average over the last hour + delay: down 5m multiplier 1.5 max 1h + to: dba + + +## VM healthchecks + +# Default Erlang VM process limit: 262144 +# On systems observed, this is < 2000, but may grow depending on load. +template: riak_vm_high_process_count + on: riak.vm + calc: $sys_process_count + units: processes + every: 10s + warn: $this > 10000 + crit: $this > 100000 + info: number of processes running in the Erlang VM (the default limit on ERTS 10.2.4 is 262144) + to: dba diff --git a/health/health.d/scaleio.conf b/health/health.d/scaleio.conf new file mode 100644 index 0000000..1a3088a --- /dev/null +++ b/health/health.d/scaleio.conf @@ -0,0 +1,38 @@ + +# make sure scaleio is running + +template: scaleio_last_collected_secs + on: scaleio.system_capacity_total + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# make sure Storage Pool capacity utilization is under limit + +template: scaleio_storage_pool_capacity_utilization + on: scaleio.storage_pool_capacity_utilization + calc: $used + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: Storage Pool capacity utilization + to: sysadmin + + +# make sure Sdc is connected to MDM + +template: scaleio_sdc_mdm_connection_state + on: scaleio.sdc_mdm_connection_state + calc: $connected + every: 10s + warn: $this != 1 + delay: up 30s down 5m multiplier 1.5 max 1h + info: Sdc connection to MDM state + to: sysadmin diff --git a/health/health.d/softnet.conf b/health/health.d/softnet.conf new file mode 100644 index 0000000..f835f2a --- /dev/null +++ b/health/health.d/softnet.conf @@ -0,0 +1,40 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# check for common /proc/net/softnet_stat errors + + alarm: 1min_netdev_backlog_exceeded + on: system.softnet_stat + os: linux + hosts: * + lookup: average -1m unaligned absolute of dropped + units: packets + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (10)) + delay: down 1h multiplier 1.5 max 2h + info: average number of packets dropped in the last 1min, because sysctl net.core.netdev_max_backlog was exceeded (this can be a cause for dropped packets) + to: sysadmin + + alarm: 1min_netdev_budget_ran_outs + on: system.softnet_stat + os: linux + hosts: * + lookup: average -1m unaligned absolute of squeezed + units: events + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (10)) + delay: down 1h multiplier 1.5 max 2h + info: average number of times, during the last 1min, ksoftirq ran out of sysctl net.core.netdev_budget or net.core.netdev_budget_usecs, with work remaining (this can be a cause for dropped packets) + to: silent + + alarm: 10min_netisr_backlog_exceeded + on: system.softnet_stat + os: freebsd + hosts: * + lookup: average -1m unaligned absolute of qdrops + units: packets + every: 10s + warn: $this > (($status >+ $WARNING) ? (0) : (10)) + delay: down 1h multiplier 1.5 max 2h + info: average number of drops in the last 1min, because sysctl net.route.netisr_maxqlen was exceeded (this can be a cause for dropped packets) + to: sysadmin diff --git a/health/health.d/squid.conf b/health/health.d/squid.conf new file mode 100644 index 0000000..06cc967 --- /dev/null +++ b/health/health.d/squid.conf @@ -0,0 +1,14 @@ + +# make sure squid is running + +template: squid_last_collected_secs + on: squid.clients_requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: proxyadmin + diff --git a/health/health.d/stiebeleltron.conf b/health/health.d/stiebeleltron.conf new file mode 100644 index 0000000..e0361eb --- /dev/null +++ b/health/health.d/stiebeleltron.conf @@ -0,0 +1,11 @@ +template: stiebeleltron_last_collected_secs +families: * + on: stiebeleltron.heating.hc1 + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/swap.conf b/health/health.d/swap.conf new file mode 100644 index 0000000..f920b08 --- /dev/null +++ b/health/health.d/swap.conf @@ -0,0 +1,43 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: 30min_ram_swapped_out + on: system.swapio + os: linux freebsd + hosts: * + lookup: sum -30m unaligned absolute of out + # we have to convert KB to MB by dividing $this (i.e. the result of the lookup) with 1024 + calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free ) + units: % of RAM + every: 1m + warn: $this > (($status >= $WARNING) ? (10) : (20)) + crit: $this > (($status == $CRITICAL) ? (20) : (30)) + delay: up 0 down 15m multiplier 1.5 max 1h + info: the amount of memory swapped in the last 30 minutes, as a percentage of the system RAM + to: sysadmin + + alarm: ram_in_swap + on: system.swap + os: linux + hosts: * + calc: $used * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free ) + units: % of RAM + every: 10s + warn: $this > (($status >= $WARNING) ? (15) : (20)) + crit: $this > (($status == $CRITICAL) ? (40) : (50)) + delay: up 30s down 15m multiplier 1.5 max 1h + info: the swap memory used, as a percentage of the system RAM + to: sysadmin + + alarm: used_swap + on: system.swap + os: linux freebsd + hosts: * + calc: $used * 100 / ( $used + $free ) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 30s down 15m multiplier 1.5 max 1h + info: the percentage of swap memory used + to: sysadmin diff --git a/health/health.d/tcp_conn.conf b/health/health.d/tcp_conn.conf new file mode 100644 index 0000000..7aa9a98 --- /dev/null +++ b/health/health.d/tcp_conn.conf @@ -0,0 +1,19 @@ + +# +# ${tcp_max_connections} may be nan or -1 if the system +# supports dynamic threshold for TCP connections. +# In this case, the alarm will always be zero. +# + + alarm: tcp_connections + on: ipv4.tcpsock + os: linux + hosts: * + calc: (${tcp_max_connections} > 0) ? ( ${connections} * 100 / ${tcp_max_connections} ) : 0 + units: % + every: 10s + warn: $this > (($status >= $WARNING ) ? ( 60 ) : ( 80 )) + crit: $this > (($status >= $CRITICAL) ? ( 80 ) : ( 90 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the percentage of IPv4 TCP connections over the max allowed + to: sysadmin diff --git a/health/health.d/tcp_listen.conf b/health/health.d/tcp_listen.conf new file mode 100644 index 0000000..3b30725 --- /dev/null +++ b/health/health.d/tcp_listen.conf @@ -0,0 +1,83 @@ +# +# There are two queues involved when incoming TCP connections are handled +# (both at the kernel): +# +# SYN queue +# The SYN queue tracks TCP handshakes until connections are fully established. +# It overflows when too many incoming TCP connection requests hang in the +# half-open state and the server is not configured to fall back to SYN cookies. +# Overflows are usually caused by SYN flood DoS attacks (i.e. someone sends +# lots of SYN packets and never completes the handshakes). +# +# Accept queue +# The accept queue holds fully established TCP connections waiting to be handled +# by the listening application. It overflows when the server application fails +# to accept new connections at the rate they are coming in. +# +# +# ----------------------------------------------------------------------------- +# tcp accept queue (at the kernel) + + alarm: 1m_tcp_accept_queue_overflows + on: ip.tcp_accept_queue + os: linux + hosts: * + lookup: average -60s unaligned absolute of ListenOverflows + units: overflows + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (1) : (5)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the average number of times the TCP accept queue of the kernel overflown, during the last minute + to: sysadmin + +# THIS IS TOO GENERIC +# CHECK: https://github.com/netdata/netdata/issues/3234#issuecomment-423935842 + alarm: 1m_tcp_accept_queue_drops + on: ip.tcp_accept_queue + os: linux + hosts: * + lookup: average -60s unaligned absolute of ListenDrops + units: drops + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (1) : (5)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the average number of times the TCP accept queue of the kernel dropped packets, during the last minute (includes bogus packets received) + to: sysadmin + + +# ----------------------------------------------------------------------------- +# tcp SYN queue (at the kernel) + +# When the SYN queue is full, either TcpExtTCPReqQFullDoCookies or +# TcpExtTCPReqQFullDrop is incremented, depending on whether SYN cookies are +# enabled or not. In both cases this probably indicates a SYN flood attack, +# so i guess a notification should be sent. + + alarm: 1m_tcp_syn_queue_drops + on: ip.tcp_syn_queue + os: linux + hosts: * + lookup: average -60s unaligned absolute of TCPReqQFullDrop + units: drops + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (0) : (5)) + delay: up 10 down 5m multiplier 1.5 max 1h + info: the number of times the TCP SYN queue of the kernel was full and dropped packets, during the last minute + to: sysadmin + + alarm: 1m_tcp_syn_queue_cookies + on: ip.tcp_syn_queue + os: linux + hosts: * + lookup: average -60s unaligned absolute of TCPReqQFullDoCookies + units: cookies + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (0) : (5)) + delay: up 10 down 5m multiplier 1.5 max 1h + info: the number of times the TCP SYN queue of the kernel was full and sent SYN cookies, during the last minute + to: sysadmin + diff --git a/health/health.d/tcp_mem.conf b/health/health.d/tcp_mem.conf new file mode 100644 index 0000000..6927d57 --- /dev/null +++ b/health/health.d/tcp_mem.conf @@ -0,0 +1,20 @@ +# +# check +# http://blog.tsunanet.net/2011/03/out-of-socket-memory.html +# +# We give a warning when TCP is under memory pressure +# and a critical when TCP is 90% of its upper memory limit +# + + alarm: tcp_memory + on: ipv4.sockstat_tcp_mem + os: linux + hosts: * + calc: ${mem} * 100 / ${tcp_mem_high} + units: % + every: 10s + warn: ${mem} > (($status >= $WARNING ) ? ( ${tcp_mem_pressure} * 0.8 ) : ( ${tcp_mem_pressure} )) + crit: ${mem} > (($status >= $CRITICAL ) ? ( ${tcp_mem_pressure} ) : ( ${tcp_mem_high} * 0.9 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the amount of TCP memory as a percentage of its max memory limit + to: sysadmin diff --git a/health/health.d/tcp_orphans.conf b/health/health.d/tcp_orphans.conf new file mode 100644 index 0000000..280d659 --- /dev/null +++ b/health/health.d/tcp_orphans.conf @@ -0,0 +1,21 @@ + +# +# check +# http://blog.tsunanet.net/2011/03/out-of-socket-memory.html +# +# The kernel may penalize orphans by 2x or even 4x +# so we alarm warning at 25% and critical at 50% +# + + alarm: tcp_orphans + on: ipv4.sockstat_tcp_sockets + os: linux + hosts: * + calc: ${orphan} * 100 / ${tcp_max_orphans} + units: % + every: 10s + warn: $this > (($status >= $WARNING ) ? ( 20 ) : ( 25 )) + crit: $this > (($status >= $CRITICAL) ? ( 25 ) : ( 50 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: the percentage of orphan IPv4 TCP sockets over the max allowed (this may lead to too-many-orphans errors) + to: sysadmin diff --git a/health/health.d/tcp_resets.conf b/health/health.d/tcp_resets.conf new file mode 100644 index 0000000..36a550a --- /dev/null +++ b/health/health.d/tcp_resets.conf @@ -0,0 +1,67 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- + + alarm: ipv4_tcphandshake_last_collected_secs + on: ipv4.tcphandshake + os: linux freebsd + hosts: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# ----------------------------------------------------------------------------- +# tcp resets this host sends + + alarm: 1m_ipv4_tcp_resets_sent + on: ipv4.tcphandshake + os: linux + hosts: * + lookup: average -1m at -10s unaligned absolute of OutRsts + units: tcp resets/s + every: 10s + info: average TCP RESETS this host is sending, over the last minute + + alarm: 10s_ipv4_tcp_resets_sent + on: ipv4.tcphandshake + os: linux + hosts: * + lookup: average -10s unaligned absolute of OutRsts + units: tcp resets/s + every: 10s + warn: $this > ((($1m_ipv4_tcp_resets_sent < 5)?(5):($1m_ipv4_tcp_resets_sent)) * (($status >= $WARNING) ? (1) : (20))) + delay: up 20s down 60m multiplier 1.2 max 2h + options: no-clear-notification + info: average TCP RESETS this host is sending, over the last 10 seconds (this can be an indication that a port scan is made, or that a service running on this host has crashed; clear notification for this alarm will not be sent) + to: sysadmin + +# ----------------------------------------------------------------------------- +# tcp resets this host receives + + alarm: 1m_ipv4_tcp_resets_received + on: ipv4.tcphandshake + os: linux freebsd + hosts: * + lookup: average -1m at -10s unaligned absolute of AttemptFails + units: tcp resets/s + every: 10s + info: average TCP RESETS this host is sending, over the last minute + + alarm: 10s_ipv4_tcp_resets_received + on: ipv4.tcphandshake + os: linux freebsd + hosts: * + lookup: average -10s unaligned absolute of AttemptFails + units: tcp resets/s + every: 10s + warn: $this > ((($1m_ipv4_tcp_resets_received < 5)?(5):($1m_ipv4_tcp_resets_received)) * (($status >= $WARNING) ? (1) : (10))) + delay: up 20s down 60m multiplier 1.2 max 2h + options: no-clear-notification + info: average TCP RESETS this host is receiving, over the last 10 seconds (this can be an indication that a service this host needs, has crashed; clear notification for this alarm will not be sent) + to: sysadmin diff --git a/health/health.d/udp_errors.conf b/health/health.d/udp_errors.conf new file mode 100644 index 0000000..1e47b5c --- /dev/null +++ b/health/health.d/udp_errors.conf @@ -0,0 +1,49 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- + + alarm: ipv4_udperrors_last_collected_secs + on: ipv4.udperrors + os: linux freebsd + hosts: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# ----------------------------------------------------------------------------- +# UDP receive buffer errors + + alarm: 1m_ipv4_udp_receive_buffer_errors + on: ipv4.udperrors + os: linux freebsd + hosts: * + lookup: average -1m unaligned absolute of RcvbufErrors + units: errors + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (0) : (10)) + info: average number of UDP receive buffer errors during the last minute + delay: up 0 down 60m multiplier 1.2 max 2h + to: sysadmin + +# ----------------------------------------------------------------------------- +# UDP send buffer errors + + alarm: 1m_ipv4_udp_send_buffer_errors + on: ipv4.udperrors + os: linux + hosts: * + lookup: average -1m unaligned absolute of SndbufErrors + units: errors + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (0) : (10)) + info: number of UDP send buffer errors during the last minute + delay: up 0 down 60m multiplier 1.2 max 2h + to: sysadmin diff --git a/health/health.d/unbound.conf b/health/health.d/unbound.conf new file mode 100644 index 0000000..bdedc11 --- /dev/null +++ b/health/health.d/unbound.conf @@ -0,0 +1,35 @@ + +# make sure unbound is running + +template: unbound_last_collected_secs + on: unbound.queries + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# make sure there is no overwritten/dropped queries in the request-list + +template: unbound_request_list_overwritten + on: unbound.request_list_jostle_list + lookup: average -60s unaligned absolute match-names of overwritten + units: queries + every: 10s + warn: $this > 5 + delay: up 10 down 5m multiplier 1.5 max 1h + info: the number of overwritten queries in the request-list + to: sysadmin + +template: unbound_request_list_dropped + on: unbound.request_list_jostle_list + lookup: average -60s unaligned absolute match-names of dropped + units: queries + every: 10s + warn: $this > 0 + delay: up 10 down 5m multiplier 1.5 max 1h + info: the number of dropped queries in the request-list + to: sysadmin diff --git a/health/health.d/varnish.conf b/health/health.d/varnish.conf new file mode 100644 index 0000000..cca7446 --- /dev/null +++ b/health/health.d/varnish.conf @@ -0,0 +1,9 @@ + alarm: varnish_last_collected + on: varnish.uptime + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin diff --git a/health/health.d/vcsa.conf b/health/health.d/vcsa.conf new file mode 100644 index 0000000..7bb98a9 --- /dev/null +++ b/health/health.d/vcsa.conf @@ -0,0 +1,122 @@ + +# make sure vcsa is running and responding + +template: vcsa_last_collected_secs + on: vcsa.system_health + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# Overall system health: +# - 0: all components are healthy. +# - 1: one or more components might become overloaded soon. +# - 2: one or more components in the appliance might be degraded. +# - 3: one or more components might be in an unusable status and the appliance might become unresponsive soon. +# - 4: no health data is available. + +template: vcsa_system_health + on: vcsa.system_health + lookup: max -10s unaligned of system + units: status + every: 10s + warn: ($this == 1) || ($this == 2) + crit: $this == 3 + delay: down 1m multiplier 1.5 max 1h + info: overall system health status + to: sysadmin + +# Components health: +# - 0: healthy. +# - 1: healthy, but may have some problems. +# - 2: degraded, and may have serious problems. +# - 3: unavailable, or will stop functioning soon. +# - 4: no health data is available. + +template: vcsa_swap_health + on: vcsa.components_health + lookup: max -10s unaligned of swap + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: swap health status + to: sysadmin + +template: vcsa_storage_health + on: vcsa.components_health + lookup: max -10s unaligned of storage + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: storage health status + to: sysadmin + +template: vcsa_mem_health + on: vcsa.components_health + lookup: max -10s unaligned of mem + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: mem health status + to: sysadmin + +template: vcsa_load_health + on: vcsa.components_health + lookup: max -10s unaligned of load + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: load health status + to: sysadmin + +template: vcsa_database_storage_health + on: vcsa.components_health + lookup: max -10s unaligned of database_storage + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: database storage health status + to: sysadmin + +template: vcsa_applmgmt_health + on: vcsa.components_health + lookup: max -10s unaligned of applmgmt + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: appl mgmt health status + to: sysadmin + + +# Software updates health: +# - 0: no updates available. +# - 2: non-security updates are available. +# - 3: security updates are available. +# - 4: an error retrieving information on software updates. + +template: vcsa_software_updates_health + on: vcsa.software_updates_health + lookup: max -10s unaligned of software_packages + units: status + every: 10s + warn: $this == 4 + crit: $this == 3 + delay: down 1m multiplier 1.5 max 1h + info: software packages health status + to: sysadmin diff --git a/health/health.d/vernemq.conf b/health/health.d/vernemq.conf new file mode 100644 index 0000000..36bbaf8 --- /dev/null +++ b/health/health.d/vernemq.conf @@ -0,0 +1,399 @@ + +# Availability + +template: vernemq_last_collected_secs + on: vernemq.node_uptime + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +# Socket errors + +template: vernemq_socket_errors + on: vernemq.socket_errors + lookup: sum -1m unaligned absolute of socket_error + units: errors + every: 10s + warn: $this > (($status == $WARNING) ? (0) : (5)) + delay: down 5m multiplier 1.5 max 2h + info: socket errors in the last minute + to: sysadmin + +# Queues dropped/expired/unhandled PUBLISH messages + +template: vernemq_queue_message_drop + on: vernemq.queue_undelivered_messages + lookup: sum -1m unaligned absolute of queue_message_drop + units: dropped messages + every: 10s + warn: $this > (($status == $WARNING) ? (0) : (5)) + delay: down 5m multiplier 1.5 max 2h + info: dropped messaged due to full queues in the last minute + to: sysadmin + +template: vernemq_queue_message_expired + on: vernemq.queue_undelivered_messages + lookup: sum -1m unaligned absolute of queue_message_expired + units: expired messages + every: 10s + warn: $this > (($status == $WARNING) ? (0) : (15)) + delay: down 5m multiplier 1.5 max 2h + info: messages which expired before delivery in the last minute + to: sysadmin + +template: vernemq_queue_message_unhandled + on: vernemq.queue_undelivered_messages + lookup: sum -1m unaligned absolute of queue_message_unhandled + units: unhandled messages + every: 10s + warn: $this > (($status == $WARNING) ? (0) : (5)) + delay: down 5m multiplier 1.5 max 2h + info: unhandled messages (connections with clean session=true) in the last minute + to: sysadmin + +# Erlang VM + +template: vernemq_average_scheduler_utilization + on: vernemq.average_scheduler_utilization + lookup: average -10m unaligned + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average scheduler utilization for the last 10 minutes + to: sysadmin + +# Cluster communication and netsplits + +template: vernemq_cluster_dropped + on: vernemq.cluster_dropped + lookup: average -1m unaligned + units: KiB/s + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: the amount of traffic dropped during communication with the cluster nodes in the last minute + to: sysadmin + +template: vernemq_netsplits + on: vernemq.netsplits + lookup: sum -1m unaligned absolute of netsplit_detected + units: netsplits + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: detected netsplits in the last minute + to: sysadmin + +# Unsuccessful CONNACK + +template: vernemq_mqtt_connack_sent_reason_success + on: vernemq.mqtt_connack_sent_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v3/v5 CONNACK sent in the last minute + to: sysadmin + +template: vernemq_mqtt_connack_sent_reason_unsuccessful + on: vernemq.mqtt_connack_sent_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_connack_sent_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v3/v5 CONNACK sent in the last minute + to: sysadmin + +# Not normal DISCONNECT + +template: vernemq_mqtt_disconnect_received_reason_normal_disconnect + on: vernemq.mqtt_disconnect_received_reason + lookup: sum -1m unaligned absolute match-names of normal_disconnect + units: packets + every: 10s + info: normal v5 DISCONNECT received in the last minute + to: sysadmin + +template: vernemq_mqtt_disconnect_sent_reason_normal_disconnect + on: vernemq.mqtt_disconnect_sent_reason + lookup: sum -1m unaligned absolute match-names of normal_disconnect + units: packets + every: 10s + info: normal v5 DISCONNECT sent in the last minute + to: sysadmin + +template: vernemq_mqtt_disconnect_received_reason_not_normal + on: vernemq.mqtt_disconnect_received_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_disconnect_received_reason_normal_disconnect + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: not normal v5 DISCONNECT received in the last minute + to: sysadmin + +template: vernemq_mqtt_disconnect_sent_reason_not_normal + on: vernemq.mqtt_disconnect_sent_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_disconnect_sent_reason_normal_disconnect + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: not normal v5 DISCONNECT sent in the last minute + to: sysadmin + +# SUBSCRIBE errors and unauthorized attempts + +template: vernemq_mqtt_subscribe_error + on: vernemq.mqtt_subscribe_error + lookup: sum -1m unaligned absolute + units: failed ops + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: failed v3/v5 SUBSCRIBE operations in the last minute + to: sysadmin + +template: vernemq_mqtt_subscribe_auth_error + on: vernemq.mqtt_subscribe_auth_error + lookup: sum -1m unaligned absolute + units: attempts + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unauthorized v3/v5 SUBSCRIBE attempts in the last minute + to: sysadmin + +# UNSUBSCRIBE errors + +template: vernemq_mqtt_unsubscribe_error + on: vernemq.mqtt_unsubscribe_error + lookup: sum -1m unaligned absolute + units: failed ops + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: failed v3/v5 UNSUBSCRIBE operations in the last minute + to: sysadmin + +# PUBLISH errors and unauthorized attempts + +template: vernemq_mqtt_publish_errors + on: vernemq.mqtt_publish_errors + lookup: sum -1m unaligned absolute + units: failed ops + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: failed v3/v5 PUBLISH operations in the last minute + to: sysadmin + +template: vernemq_mqtt_publish_auth_errors + on: vernemq.mqtt_publish_auth_errors + lookup: sum -1m unaligned absolute + units: attempts + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unauthorized v3/v5 PUBLISH attempts in the last minute + to: sysadmin + +# Unsuccessful and unexpected PUBACK + +template: vernemq_mqtt_puback_received_reason_success + on: vernemq.mqtt_puback_received_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v5 PUBACK received in the last minute + to: sysadmin + +template: vernemq_mqtt_puback_sent_reason_success + on: vernemq.mqtt_puback_sent_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v5 PUBACK sent in the last minute + to: sysadmin + +template: vernemq_mqtt_puback_received_reason_unsuccessful + on: vernemq.mqtt_puback_received_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_puback_received_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v5 PUBACK received in the last minute + to: sysadmin + +template: vernemq_mqtt_puback_sent_reason_unsuccessful + on: vernemq.mqtt_puback_sent_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_puback_sent_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v5 PUBACK sent in the last minute + to: sysadmin + +template: vernemq_mqtt_puback_unexpected + on: vernemq.mqtt_puback_invalid_error + lookup: sum -1m unaligned absolute + units: messages + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unexpected v3/v5 PUBACK received in the last minute + to: sysadmin + +# Unsuccessful and unexpected PUBREC + +template: vernemq_mqtt_pubrec_received_reason_success + on: vernemq.mqtt_pubrec_received_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v5 PUBREC received in the last minute + to: sysadmin + +template: vernemq_mqtt_pubrec_sent_reason_success + on: vernemq.mqtt_pubrec_sent_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v5 PUBREC sent in the last minute + to: sysadmin + +template: vernemq_mqtt_pubrec_received_reason_unsuccessful + on: vernemq.mqtt_pubrec_received_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_pubrec_received_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v5 PUBREC received in the last minute + to: sysadmin + +template: vernemq_mqtt_pubrec_sent_reason_unsuccessful + on: vernemq.mqtt_pubrec_sent_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_pubrec_sent_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v5 PUBREC sent in the last minute + to: sysadmin + +template: vernemq_mqtt_pubrec_invalid_error + on: vernemq.mqtt_pubrec_invalid_error + lookup: sum -1m unaligned absolute + units: messages + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unexpected v3 PUBREC received in the last minute + to: sysadmin + +# Unsuccessful PUBREL + +template: vernemq_mqtt_pubrel_received_reason_success + on: vernemq.mqtt_pubrel_received_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v5 PUBREL received in the last minute + to: sysadmin + +template: vernemq_mqtt_pubrel_sent_reason_success + on: vernemq.mqtt_pubrel_sent_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v5 PUBREL sent in the last minute + to: sysadmin + +template: vernemq_mqtt_pubrel_received_reason_unsuccessful + on: vernemq.mqtt_pubrel_received_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_pubrel_received_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v5 PUBREL received in the last minute + to: sysadmin + +template: vernemq_mqtt_pubrel_sent_reason_unsuccessful + on: vernemq.mqtt_pubrel_sent_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_pubrel_sent_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v5 PUBREL sent in the last minute + to: sysadmin + +# Unsuccessful and unexpected PUBCOMP + +template: vernemq_mqtt_pubcomp_received_reason_success + on: vernemq.mqtt_pubcomp_received_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v5 PUBCOMP received in the last minute + to: sysadmin + +template: vernemq_mqtt_pubcomp_sent_reason_success + on: vernemq.mqtt_pubcomp_sent_reason + lookup: sum -1m unaligned absolute match-names of success + units: packets + every: 10s + info: successful v5 PUBCOMP sent in the last minute + to: sysadmin + +template: vernemq_mqtt_pubcomp_received_reason_unsuccessful + on: vernemq.mqtt_pubcomp_received_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_pubcomp_received_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v5 PUBCOMP received in the last minute + to: sysadmin + +template: vernemq_mqtt_pubcomp_sent_reason_unsuccessful + on: vernemq.mqtt_pubcomp_sent_reason + lookup: sum -1m unaligned absolute + calc: $this - $vernemq_mqtt_pubcomp_sent_reason_success + units: packets + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unsuccessful v5 PUBCOMP sent in the last minute + to: sysadmin + +template: vernemq_mqtt_pubcomp_unexpected + on: vernemq.mqtt_pubcomp_invalid_error + lookup: sum -1m unaligned absolute + units: messages + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: unexpected v3/v5 PUBCOMP received in the last minute + to: sysadmin diff --git a/health/health.d/vsphere.conf b/health/health.d/vsphere.conf new file mode 100644 index 0000000..d8b2be1 --- /dev/null +++ b/health/health.d/vsphere.conf @@ -0,0 +1,157 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# -----------------------------------------------VM Specific------------------------------------------------------------ +# Memory + +template: vsphere_vm_mem_usage + on: vsphere.vm_mem_usage_percentage + hosts: * + calc: $used + units: % + every: 20s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: used RAM + +# -----------------------------------------------HOST Specific---------------------------------------------------------- +# Memory + +template: vsphere_host_mem_usage + on: vsphere.host_mem_usage_percentage + hosts: * + calc: $used + units: % + every: 20s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: used RAM + +# Network errors + +template: vsphere_inbound_packets_errors + on: vsphere.net_errors_total + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of rx + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface inbound dropped packets in the last 10 minutes + to: sysadmin + +template: vsphere_outbound_packets_errors + on: vsphere.net_errors_total + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of tx + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface outbound dropped packets in the last 10 minutes + to: sysadmin + +# Network errors ratio + +template: vsphere_inbound_packets_errors_ratio + on: vsphere.net_packets_total + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of rx + calc: (($vsphere_inbound_packets_errors != nan AND $this > 0) ? ($vsphere_inbound_packets_errors * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 0.1 + crit: $this >= 2 + delay: down 1h multiplier 1.5 max 2h + info: the ratio of inbound errors vs the total number of received packets of the network interface, during the last 10 minutes + to: sysadmin + +template: vsphere_outbound_packets_errors_ratio + on: vsphere.net_packets_total + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of tx + calc: (($vsphere_outbound_packets_errors != nan AND $this > 0) ? ($vsphere_outbound_packets_errors * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 0.1 + crit: $this >= 2 + delay: down 1h multiplier 1.5 max 2h + info: the ratio of outbound errors vs the total number of sent packets of the network interface, during the last 10 minutes + to: sysadmin + +# -----------------------------------------------Common------------------------------------------------------------------- +# CPU + +template: vsphere_cpu_usage + on: vsphere.cpu_usage_total + hosts: * + lookup: average -10m unaligned match-names of used + units: % + every: 20s + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: cpu utilization for the last 10 minutes + to: sysadmin + +# Network drops + +template: vsphere_inbound_packets_dropped + on: vsphere.net_drops_total + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of rx + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface inbound dropped packets in the last 10 minutes + to: sysadmin + +template: vsphere_outbound_packets_dropped + on: vsphere.net_drops_total + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of tx + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface outbound dropped packets in the last 10 minutes + to: sysadmin + +# Network drops ratio + +template: vsphere_inbound_packets_dropped_ratio + on: vsphere.net_packets_total + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of rx + calc: (($vsphere_inbound_packets_dropped != nan AND $this > 0) ? ($vsphere_inbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 0.1 + crit: $this >= 2 + delay: down 1h multiplier 1.5 max 2h + info: the ratio of inbound dropped packets vs the total number of received packets of the network interface, during the last 10 minutes + to: sysadmin + +template: vsphere_outbound_packets_dropped_ratio + on: vsphere.net_packets_total + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of tx + calc: (($vsphere_outbound_packets_dropped != nan AND $this > 0) ? ($vsphere_outbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 0.1 + crit: $this >= 2 + delay: down 1h multiplier 1.5 max 2h + info: the ratio of outbound dropped packets vs the total number of sent packets of the network interface, during the last 10 minutes + to: sysadmin diff --git a/health/health.d/web_log.conf b/health/health.d/web_log.conf new file mode 100644 index 0000000..44de38a --- /dev/null +++ b/health/health.d/web_log.conf @@ -0,0 +1,385 @@ + +# make sure we can collect web log data + +template: last_collected_secs + on: web_log.response_codes +families: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + + +# ----------------------------------------------------------------------------- +# high level response code alarms + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: 1m_requests + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned + calc: ($this == 0)?(1):($this) + units: requests + every: 10s + info: the sum of all HTTP requests over the last minute + +template: 1m_successful + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned of successful_requests + calc: $this * 100 / $1m_requests + units: % + every: 10s + warn: ($1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 95 ) : ( 85 )) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 85 ) : ( 75 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of successful HTTP responses (1xx, 2xx, 304, 401) over the last minute + to: webmaster + +template: 1m_redirects + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned of redirects + calc: $this * 100 / $1m_requests + units: % + every: 10s + warn: ($1m_requests > 120) ? ($this > (($status >= $WARNING ) ? ( 1 ) : ( 20 )) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 20 ) : ( 30 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP redirects (3xx except 304) over the last minute + to: webmaster + +template: 1m_bad_requests + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned of bad_requests + calc: $this * 100 / $1m_requests + units: % + every: 10s + warn: ($1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 10 ) : ( 30 )) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 30 ) : ( 50 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP bad requests (4xx except 401) over the last minute + to: webmaster + +template: 1m_internal_errors + on: web_log.response_statuses +families: * + lookup: sum -1m unaligned of server_errors + calc: $this * 100 / $1m_requests + units: % + every: 10s + warn: ($1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 1 ) : ( 2 )) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 2 ) : ( 5 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP internal server errors (5xx), over the last minute + to: webmaster + +# unmatched lines + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_total_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: 1m_total_requests + on: web_log.response_codes +families: * + lookup: sum -1m unaligned + calc: ($this == 0)?(1):($this) + units: requests + every: 10s + info: the sum of all HTTP requests over the last minute + +template: 1m_unmatched +on: web_log.response_codes +families: * + lookup: sum -1m unaligned of unmatched + calc: $this * 100 / $1m_total_requests + units: % + every: 10s + warn: ($1m_total_requests > 120) ? ($this > 1) : ( 0 ) + delay: up 1m down 5m multiplier 1.5 max 1h + info: the ratio of unmatched lines, over the last minute + to: webmaster + +# ----------------------------------------------------------------------------- +# web slow + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: 10m_response_time + on: web_log.response_time +families: * + lookup: average -10m unaligned of avg + units: ms + every: 30s + info: the average time to respond to HTTP requests, over the last 10 minutes + +template: web_slow + on: web_log.response_time +families: * + lookup: average -1m unaligned of avg + units: ms + every: 10s + green: 500 + red: 1000 + warn: ($1m_requests > 120) ? ($this > $green && $this > ($10m_response_time * 2) ) : ( 0 ) + crit: ($1m_requests > 120) ? ($this > $red && $this > ($10m_response_time * 4) ) : ( 0 ) + delay: down 15m multiplier 1.5 max 1h + info: the average time to respond to HTTP requests, over the last 1 minute + options: no-clear-notification + to: webmaster + +# ----------------------------------------------------------------------------- +# web too many or too few requests + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $5m_successful_old > 120 +# +# i.e. when there were at least 120 requests during the 5 minutes starting +# at -10m and ending at -5m + +template: 5m_successful_old + on: web_log.response_statuses +families: * + lookup: average -5m at -5m unaligned of successful_requests + units: requests/s + every: 30s + info: average rate of successful HTTP requests over the last 5 minutes + +template: 5m_successful + on: web_log.response_statuses +families: * + lookup: average -5m unaligned of successful_requests + units: requests/s + every: 30s + info: average successful HTTP requests over the last 5 minutes + +template: 5m_requests_ratio + on: web_log.response_codes +families: * + calc: ($5m_successful_old > 0)?($5m_successful * 100 / $5m_successful_old):(100) + units: % + every: 30s + warn: ($5m_successful_old > 120) ? ($this > 200 OR $this < 50) : (0) + crit: ($5m_successful_old > 120) ? ($this > 400 OR $this < 25) : (0) + delay: down 15m multiplier 1.5 max 1h +options: no-clear-notification + info: the percentage of successful web requests over the last 5 minutes, \ + compared with the previous 5 minutes \ + (clear notification for this alarm will not be sent) + to: webmaster + + + +# ---------------------------------------------------GO-VERSION--------------------------------------------------------- + +# make sure we can collect web log data + +template: web_log_last_collected_secs + on: web_log.requests +families: * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + +# unmatched lines + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_total_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: web_log_1m_total_requests + on: web_log.requests +families: * + lookup: sum -1m unaligned + calc: ($this == 0)?(1):($this) + units: requests + every: 10s + info: the sum of all HTTP requests over the last minute + +template: web_log_1m_unmatched + on: web_log.excluded_requests +families: * + lookup: sum -1m unaligned of unmatched + calc: $this * 100 / $web_log_1m_total_requests + units: % + every: 10s + warn: ($web_log_1m_total_requests > 120) ? ($this > 1) : ( 0 ) + delay: up 1m down 5m multiplier 1.5 max 1h + info: the ratio of unmatched lines, over the last minute + to: webmaster + +# ----------------------------------------------------------------------------- +# high level response code alarms + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: web_log_1m_requests + on: web_log.type_requests +families: * + lookup: sum -1m unaligned + calc: ($this == 0)?(1):($this) + units: requests + every: 10s + info: the sum of all HTTP requests over the last minute + +template: web_log_1m_successful + on: web_log.type_requests +families: * + lookup: sum -1m unaligned of success + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s + warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 95 ) : ( 85 )) ) : ( 0 ) + crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 85 ) : ( 75 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of successful HTTP responses (1xx, 2xx, 304, 401) over the last minute + to: webmaster + +template: web_log_1m_redirects + on: web_log.type_requests +families: * + lookup: sum -1m unaligned of redirect + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s + warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING ) ? ( 1 ) : ( 20 )) ) : ( 0 ) + crit: ($web_log_1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 20 ) : ( 30 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP redirects (3xx except 304) over the last minute + to: webmaster + +template: web_log_1m_bad_requests + on: web_log.type_requests +families: * + lookup: sum -1m unaligned of bad + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s + warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 10 ) : ( 30 )) ) : ( 0 ) + crit: ($web_log_1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 30 ) : ( 50 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP bad requests (4xx except 401) over the last minute + to: webmaster + +template: web_log_1m_internal_errors + on: web_log.type_requests +families: * + lookup: sum -1m unaligned of error + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s + warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 1 ) : ( 2 )) ) : ( 0 ) + crit: ($web_log_1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 2 ) : ( 5 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: the ratio of HTTP internal server errors (5xx), over the last minute + to: webmaster + +# ----------------------------------------------------------------------------- +# web slow + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + +template: web_log_10m_response_time + on: web_log.request_processing_time +families: * + lookup: average -10m unaligned of avg + units: ms + every: 30s + info: the average time to respond to HTTP requests, over the last 10 minutes + +template: web_log_web_slow + on: web_log.request_processing_time +families: * + lookup: average -1m unaligned of avg + units: ms + every: 10s + green: 500 + red: 1000 + warn: ($web_log_1m_requests > 120) ? ($this > $green && $this > ($web_log_10m_response_time * 2) ) : ( 0 ) + crit: ($web_log_1m_requests > 120) ? ($this > $red && $this > ($web_log_10m_response_time * 4) ) : ( 0 ) + delay: down 15m multiplier 1.5 max 1h + info: the average time to respond to HTTP requests, over the last 1 minute + options: no-clear-notification + to: webmaster + +# ----------------------------------------------------------------------------- +# web too many or too few requests + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $5m_successful_old > 120 +# +# i.e. when there were at least 120 requests during the 5 minutes starting +# at -10m and ending at -5m + +template: web_log_5m_successful_old + on: web_log.type_requests +families: * + lookup: average -5m at -5m unaligned of success + units: requests/s + every: 30s + info: average rate of successful HTTP requests over the last 5 minutes + +template: web_log_5m_successful + on: web_log.type_requests +families: * + lookup: average -5m unaligned of success + units: requests/s + every: 30s + info: average successful HTTP requests over the last 5 minutes + +template: web_log_5m_requests_ratio + on: web_log.type_requests +families: * + calc: ($web_log_5m_successful_old > 0)?($web_log_5m_successful * 100 / $web_log_5m_successful_old):(100) + units: % + every: 30s + warn: ($web_log_5m_successful_old > 120) ? ($this > 200 OR $this < 50) : (0) + crit: ($web_log_5m_successful_old > 120) ? ($this > 400 OR $this < 25) : (0) + delay: down 15m multiplier 1.5 max 1h +options: no-clear-notification + info: the percentage of successful web requests over the last 5 minutes, \ + compared with the previous 5 minutes \ + (clear notification for this alarm will not be sent) + to: webmaster diff --git a/health/health.d/whoisquery.conf b/health/health.d/whoisquery.conf new file mode 100644 index 0000000..275e11d --- /dev/null +++ b/health/health.d/whoisquery.conf @@ -0,0 +1,24 @@ + +# make sure whoisquery is running + +template: whoisquery_last_collected_secs + on: whoisquery.time_until_expiration + calc: $now - $last_collected_t + units: seconds ago + every: 60s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + + +template: whoisquery_days_until_expiration + on: whoisquery.time_until_expiration + calc: $expiry + units: seconds + every: 60s + warn: $this < $days_until_expiration_warning*24*60*60 + crit: $this < $days_until_expiration_critical*24*60*60 + info: domain time until expiration + to: webmaster diff --git a/health/health.d/wmi.conf b/health/health.d/wmi.conf new file mode 100644 index 0000000..0441fc1 --- /dev/null +++ b/health/health.d/wmi.conf @@ -0,0 +1,130 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +## Availability + +template: wmi_last_collected_secs + on: cpu.collector_duration + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + +## CPU + +template: wmi_10min_cpu_usage + on: wmi.cpu_utilization_total + os: linux + hosts: * + lookup: average -10m unaligned match-names of dpc,user,privileged,interrupt + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: cpu utilization for the last 10 minutes + to: sysadmin + + +## Memory + +template: wmi_ram_in_use + on: wmi.memory_utilization + os: linux + hosts: * + calc: ($used) * 100 / ($used + $available) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: used RAM + to: sysadmin + +template: wmi_swap_in_use + on: wmi.memory_swap_utilization + os: linux + hosts: * + calc: ($used) * 100 / ($used + $available) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: used Swap + to: sysadmin + + +## Network + +template: inbound_packets_discarded + on: wmi.net_discarded + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of inbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface inbound discarded packets in the last 10 minutes + to: sysadmin + +template: outbound_packets_discarded + on: wmi.net_discarded + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of outbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface outbound discarded packets in the last 10 minutes + to: sysadmin + +template: inbound_packets_errors + on: wmi.net_errors + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of inbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface inbound errors in the last 10 minutes + to: sysadmin + +template: outbound_packets_errors + on: wmi.net_errors + os: linux + hosts: * +families: * + lookup: sum -10m unaligned absolute match-names of outbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: interface outbound errors in the last 10 minutes + to: sysadmin + + +## Disk + +template: wmi_disk_in_use + on: wmi.logical_disk_utilization + os: linux + hosts: * + calc: ($used) * 100 / ($used + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: used disk space + to: sysadmin diff --git a/health/health.d/x509check.conf b/health/health.d/x509check.conf new file mode 100644 index 0000000..dfca377 --- /dev/null +++ b/health/health.d/x509check.conf @@ -0,0 +1,32 @@ + +# make sure x509check is running + +template: x509check_last_collected_secs + on: x509check.time_until_expiration + calc: $now - $last_collected_t + units: seconds ago + every: 60s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + + +template: x509check_days_until_expiration + on: x509check.time_until_expiration + calc: $expiry + units: seconds + every: 60s + warn: $this < $days_until_expiration_warning*24*60*60 + crit: $this < $days_until_expiration_critical*24*60*60 + info: certificate time until expiration + to: webmaster + +template: x509check_revocation_status + on: x509check.revocation_status + calc: $revoked + every: 60s + crit: $this != nan AND $this != 0 + info: certificate revocation status + to: webmaster diff --git a/health/health.d/zfs.conf b/health/health.d/zfs.conf new file mode 100644 index 0000000..af73824 --- /dev/null +++ b/health/health.d/zfs.conf @@ -0,0 +1,10 @@ + + alarm: zfs_memory_throttle + on: zfs.memory_ops + lookup: sum -10m unaligned absolute of throttled + units: events + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 2h + info: the number of times ZFS had to limit the ARC growth in the last 10 minutes + to: sysadmin diff --git a/health/health.d/zookeeper.conf b/health/health.d/zookeeper.conf new file mode 100644 index 0000000..ffbe31b --- /dev/null +++ b/health/health.d/zookeeper.conf @@ -0,0 +1,14 @@ + +# make sure zookeeper is running + +template: zookeeper_last_collected_secs + on: zookeeper.requests + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster + diff --git a/health/health.h b/health/health.h new file mode 100644 index 0000000..5281e16 --- /dev/null +++ b/health/health.h @@ -0,0 +1,112 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#ifndef NETDATA_HEALTH_H +#define NETDATA_HEALTH_H 1 + +#include "../daemon/common.h" + +#define NETDATA_PLUGIN_HOOK_HEALTH \ + { \ + .name = "HEALTH", \ + .config_section = NULL, \ + .config_name = NULL, \ + .enabled = 1, \ + .thread = NULL, \ + .init_routine = NULL, \ + .start_routine = health_main \ + }, + +extern unsigned int default_health_enabled; + +#define HEALTH_ENTRY_FLAG_PROCESSED 0x00000001 +#define HEALTH_ENTRY_FLAG_UPDATED 0x00000002 +#define HEALTH_ENTRY_FLAG_EXEC_RUN 0x00000004 +#define HEALTH_ENTRY_FLAG_EXEC_FAILED 0x00000008 +#define HEALTH_ENTRY_FLAG_SILENCED 0x00000010 +#define HEALTH_ENTRY_RUN_ONCE 0x00000020 +#define HEALTH_ENTRY_FLAG_EXEC_IN_PROGRESS 0x00000040 + +#define HEALTH_ENTRY_FLAG_SAVED 0x10000000 +#define HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION 0x80000000 + +#ifndef HEALTH_LISTEN_PORT +#define HEALTH_LISTEN_PORT 19998 +#endif + +#ifndef HEALTH_LISTEN_BACKLOG +#define HEALTH_LISTEN_BACKLOG 4096 +#endif + +#define HEALTH_ON_KEY "on" +#define HEALTH_EVERY_KEY "every" +#define HEALTH_GREEN_KEY "green" +#define HEALTH_RED_KEY "red" +#define HEALTH_WARN_KEY "warn" +#define HEALTH_CRIT_KEY "crit" +#define HEALTH_EXEC_KEY "exec" +#define HEALTH_RECIPIENT_KEY "to" +#define HEALTH_UNITS_KEY "units" +#define HEALTH_INFO_KEY "info" +#define HEALTH_DELAY_KEY "delay" +#define HEALTH_OPTIONS_KEY "options" +#define HEALTH_FOREACH_KEY "foreach" + +#define HEALTH_SILENCERS_MAX_FILE_LEN 10000 + +extern char *silencers_filename; + +extern void health_init(void); +extern void *health_main(void *ptr); + +extern void health_reload(void); + +extern int health_variable_lookup(const char *variable, uint32_t hash, RRDCALC *rc, calculated_number *result); +extern void health_aggregate_alarms(RRDHOST *host, BUFFER *wb, BUFFER* context, RRDCALC_STATUS status); +extern void health_alarms2json(RRDHOST *host, BUFFER *wb, int all); +extern void health_alarms_values2json(RRDHOST *host, BUFFER *wb, int all); +extern void health_alarm_log2json(RRDHOST *host, BUFFER *wb, uint32_t after); + +void health_api_v1_chart_variables2json(RRDSET *st, BUFFER *buf); +void health_api_v1_chart_custom_variables2json(RRDSET *st, BUFFER *buf); + +extern int health_alarm_log_open(RRDHOST *host); +extern void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae); +extern void health_alarm_log_load(RRDHOST *host); + +extern ALARM_ENTRY* health_create_alarm_entry( + RRDHOST *host, + uint32_t alarm_id, + uint32_t alarm_event_id, + time_t when, + const char *name, + const char *chart, + const char *family, + const char *exec, + const char *recipient, + time_t duration, + calculated_number old_value, + calculated_number new_value, + RRDCALC_STATUS old_status, + RRDCALC_STATUS new_status, + const char *source, + const char *units, + const char *info, + int delay, + uint32_t flags); + +extern void health_alarm_log(RRDHOST *host, ALARM_ENTRY *ae); + +extern void health_readdir(RRDHOST *host, const char *user_path, const char *stock_path, const char *subpath); +extern char *health_user_config_dir(void); +extern char *health_stock_config_dir(void); +extern void health_alarm_log_free(RRDHOST *host); + +extern void health_alarm_log_free_one_nochecks_nounlink(ALARM_ENTRY *ae); + +extern void *health_cmdapi_thread(void *ptr); + +extern void health_label_log_save(RRDHOST *host); + +extern SIMPLE_PATTERN *health_pattern_from_foreach(char *s); + +#endif //NETDATA_HEALTH_H diff --git a/health/health_config.c b/health/health_config.c new file mode 100644 index 0000000..1acf369 --- /dev/null +++ b/health/health_config.c @@ -0,0 +1,1035 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +#define HEALTH_CONF_MAX_LINE 4096 + +#define HEALTH_ALARM_KEY "alarm" +#define HEALTH_TEMPLATE_KEY "template" +#define HEALTH_ON_KEY "on" +#define HEALTH_HOST_KEY "hosts" +#define HEALTH_OS_KEY "os" +#define HEALTH_FAMILIES_KEY "families" +#define HEALTH_PLUGIN_KEY "plugin" +#define HEALTH_MODULE_KEY "module" +#define HEALTH_LOOKUP_KEY "lookup" +#define HEALTH_CALC_KEY "calc" +#define HEALTH_EVERY_KEY "every" +#define HEALTH_GREEN_KEY "green" +#define HEALTH_RED_KEY "red" +#define HEALTH_WARN_KEY "warn" +#define HEALTH_CRIT_KEY "crit" +#define HEALTH_EXEC_KEY "exec" +#define HEALTH_RECIPIENT_KEY "to" +#define HEALTH_UNITS_KEY "units" +#define HEALTH_INFO_KEY "info" +#define HEALTH_DELAY_KEY "delay" +#define HEALTH_OPTIONS_KEY "options" +#define HEALTH_REPEAT_KEY "repeat" +#define HEALTH_HOST_LABEL_KEY "host labels" + +static inline int rrdcalc_add_alarm_from_config(RRDHOST *host, RRDCALC *rc) { + if(!rc->chart) { + error("Health configuration for alarm '%s' does not have a chart", rc->name); + return 0; + } + + if(!rc->update_every) { + error("Health configuration for alarm '%s.%s' has no frequency (parameter 'every'). Ignoring it.", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if(!RRDCALC_HAS_DB_LOOKUP(rc) && !rc->calculation && !rc->warning && !rc->critical) { + error("Health configuration for alarm '%s.%s' is useless (no db lookup, no calculation, no warning and no critical expressions)", rc->chart?rc->chart:"NOCHART", rc->name); + return 0; + } + + if (rrdcalc_exists(host, rc->chart, rc->name, rc->hash_chart, rc->hash)) + return 0; + + rc->id = rrdcalc_get_unique_id(host, rc->chart, rc->name, &rc->next_event_id); + + debug(D_HEALTH, "Health configuration adding alarm '%s.%s' (%u): exec '%s', recipient '%s', green " CALCULATED_NUMBER_FORMAT_AUTO ", red " CALCULATED_NUMBER_FORMAT_AUTO ", lookup: group %d, after %d, before %d, options %u, dimensions '%s', for each dimension '%s', update every %d, calculation '%s', warning '%s', critical '%s', source '%s', delay up %d, delay down %d, delay max %d, delay_multiplier %f, warn_repeat_every %u, crit_repeat_every %u", + rc->chart?rc->chart:"NOCHART", + rc->name, + rc->id, + (rc->exec)?rc->exec:"DEFAULT", + (rc->recipient)?rc->recipient:"DEFAULT", + rc->green, + rc->red, + (int)rc->group, + rc->after, + rc->before, + rc->options, + (rc->dimensions)?rc->dimensions:"NONE", + (rc->foreachdim)?rc->foreachdim:"NONE", + rc->update_every, + (rc->calculation)?rc->calculation->parsed_as:"NONE", + (rc->warning)?rc->warning->parsed_as:"NONE", + (rc->critical)?rc->critical->parsed_as:"NONE", + rc->source, + rc->delay_up_duration, + rc->delay_down_duration, + rc->delay_max_duration, + rc->delay_multiplier, + rc->warn_repeat_every, + rc->crit_repeat_every + ); + + rrdcalc_add_to_host(host, rc); + + return 1; +} + +static inline int rrdcalctemplate_add_template_from_config(RRDHOST *host, RRDCALCTEMPLATE *rt) { + if(unlikely(!rt->context)) { + error("Health configuration for template '%s' does not have a context", rt->name); + return 0; + } + + if(unlikely(!rt->update_every)) { + error("Health configuration for template '%s' has no frequency (parameter 'every'). Ignoring it.", rt->name); + return 0; + } + + if(unlikely(!RRDCALCTEMPLATE_HAS_DB_LOOKUP(rt) && !rt->calculation && !rt->warning && !rt->critical)) { + error("Health configuration for template '%s' is useless (no calculation, no warning and no critical evaluation)", rt->name); + return 0; + } + + RRDCALCTEMPLATE *t, *last = NULL; + if(!rt->foreachdim) { + for (t = host->templates; t ; last = t, t = t->next) { + if(unlikely(t->hash_name == rt->hash_name + && !strcmp(t->name, rt->name) + && !strcmp(t->family_match?t->family_match:"*", rt->family_match?rt->family_match:"*") + )) { + error("Health configuration template '%s' already exists for host '%s'.", rt->name, host->hostname); + return 0; + } + } + + if(likely(last)) { + last->next = rt; + } + else { + rt->next = host->templates; + host->templates = rt; + } + } else { + for (t = host->alarms_template_with_foreach; t ; last = t, t = t->next) { + if(unlikely(t->hash_name == rt->hash_name + && !strcmp(t->name, rt->name) + && !strcmp(t->family_match?t->family_match:"*", rt->family_match?rt->family_match:"*") + )) { + error("Health configuration template '%s' already exists for host '%s'.", rt->name, host->hostname); + return 0; + } + } + + if(likely(last)) { + last->next = rt; + } + else { + rt->next = host->alarms_template_with_foreach; + host->alarms_template_with_foreach = rt; + } + } + + debug(D_HEALTH, "Health configuration adding template '%s': context '%s', exec '%s', recipient '%s', green " CALCULATED_NUMBER_FORMAT_AUTO ", red " CALCULATED_NUMBER_FORMAT_AUTO ", lookup: group %d, after %d, before %d, options %u, dimensions '%s', for each dimension '%s', update every %d, calculation '%s', warning '%s', critical '%s', source '%s', delay up %d, delay down %d, delay max %d, delay_multiplier %f, warn_repeat_every %u, crit_repeat_every %u", + rt->name, + (rt->context)?rt->context:"NONE", + (rt->exec)?rt->exec:"DEFAULT", + (rt->recipient)?rt->recipient:"DEFAULT", + rt->green, + rt->red, + (int)rt->group, + rt->after, + rt->before, + rt->options, + (rt->dimensions)?rt->dimensions:"NONE", + (rt->foreachdim)?rt->foreachdim:"NONE", + rt->update_every, + (rt->calculation)?rt->calculation->parsed_as:"NONE", + (rt->warning)?rt->warning->parsed_as:"NONE", + (rt->critical)?rt->critical->parsed_as:"NONE", + rt->source, + rt->delay_up_duration, + rt->delay_down_duration, + rt->delay_max_duration, + rt->delay_multiplier, + rt->warn_repeat_every, + rt->crit_repeat_every + ); + + + return 1; +} + +static inline int health_parse_delay( + size_t line, const char *filename, char *string, + int *delay_up_duration, + int *delay_down_duration, + int *delay_max_duration, + float *delay_multiplier) { + + char given_up = 0; + char given_down = 0; + char given_max = 0; + char given_multiplier = 0; + + char *s = string; + while(*s) { + char *key = s; + + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!*key) break; + + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!strcasecmp(key, "up")) { + if (!config_parse_duration(value, delay_up_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_up = 1; + } + else if(!strcasecmp(key, "down")) { + if (!config_parse_duration(value, delay_down_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_down = 1; + } + else if(!strcasecmp(key, "multiplier")) { + *delay_multiplier = strtof(value, NULL); + if(isnan(*delay_multiplier) || isinf(*delay_multiplier) || islessequal(*delay_multiplier, 0)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_multiplier = 1; + } + else if(!strcasecmp(key, "max")) { + if (!config_parse_duration(value, delay_max_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_max = 1; + } + else { + error("Health configuration at line %zu of file '%s': unknown keyword '%s'", + line, filename, key); + } + } + + if(!given_up) + *delay_up_duration = 0; + + if(!given_down) + *delay_down_duration = 0; + + if(!given_multiplier) + *delay_multiplier = 1.0; + + if(!given_max) { + if((*delay_max_duration) < (*delay_up_duration) * (*delay_multiplier)) + *delay_max_duration = (int)((*delay_up_duration) * (*delay_multiplier)); + + if((*delay_max_duration) < (*delay_down_duration) * (*delay_multiplier)) + *delay_max_duration = (int)((*delay_down_duration) * (*delay_multiplier)); + } + + return 1; +} + +static inline uint32_t health_parse_options(const char *s) { + uint32_t options = 0; + char buf[100+1] = ""; + + while(*s) { + buf[0] = '\0'; + + // skip spaces + while(*s && isspace(*s)) + s++; + + // find the next space + size_t count = 0; + while(*s && count < 100 && !isspace(*s)) + buf[count++] = *s++; + + if(buf[0]) { + buf[count] = '\0'; + + if(!strcasecmp(buf, "no-clear-notification") || !strcasecmp(buf, "no-clear")) + options |= RRDCALC_FLAG_NO_CLEAR_NOTIFICATION; + else + error("Ignoring unknown alarm option '%s'", buf); + } + } + + return options; +} + +static inline int health_parse_repeat( + size_t line, + const char *file, + char *string, + uint32_t *warn_repeat_every, + uint32_t *crit_repeat_every +) { + + char *s = string; + while(*s) { + char *key = s; + + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!*key) break; + + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!strcasecmp(key, "off")) { + *warn_repeat_every = 0; + *crit_repeat_every = 0; + return 1; + } + if(!strcasecmp(key, "warning")) { + if (!config_parse_duration(value, (int*)warn_repeat_every)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, file, value, key); + } + } + else if(!strcasecmp(key, "critical")) { + if (!config_parse_duration(value, (int*)crit_repeat_every)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, file, value, key); + } + } + } + + return 1; +} + +/** + * Health pattern from Foreach + * + * Create a new simple pattern using the user input + * + * @param s the string that will be used to create the simple pattern. + */ +SIMPLE_PATTERN *health_pattern_from_foreach(char *s) { + char *convert= strdupz(s); + SIMPLE_PATTERN *val = NULL; + if(convert) { + dimension_remove_pipe_comma(convert); + val = simple_pattern_create(convert, NULL, SIMPLE_PATTERN_EXACT); + + freez(convert); + } + + return val; +} + +static inline int health_parse_db_lookup( + size_t line, const char *filename, char *string, + RRDR_GROUPING *group_method, int *after, int *before, int *every, + uint32_t *options, char **dimensions, char **foreachdim +) { + debug(D_HEALTH, "Health configuration parsing database lookup %zu@%s: %s", line, filename, string); + + if(*dimensions) freez(*dimensions); + if(*foreachdim) freez(*foreachdim); + *dimensions = NULL; + *foreachdim = NULL; + *after = 0; + *before = 0; + *every = 0; + *options = 0; + + char *s = string, *key; + + // first is the group method + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + if(!*s) { + error("Health configuration invalid chart calculation at line %zu of file '%s': expected group method followed by the 'after' time, but got '%s'", + line, filename, key); + return 0; + } + + if((*group_method = web_client_api_request_v1_data_group(key, RRDR_GROUPING_UNDEFINED)) == RRDR_GROUPING_UNDEFINED) { + error("Health configuration at line %zu of file '%s': invalid group method '%s'", + line, filename, key); + return 0; + } + + // then is the 'after' time + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!config_parse_duration(key, after)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' after group method", + line, filename, key); + return 0; + } + + // sane defaults + *every = abs(*after); + + // now we may have optional parameters + while(*s) { + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + if(!*key) break; + + if(!strcasecmp(key, "at")) { + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if (!config_parse_duration(value, before)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' for '%s' keyword", + line, filename, value, key); + } + } + else if(!strcasecmp(key, HEALTH_EVERY_KEY)) { + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if (!config_parse_duration(value, every)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' for '%s' keyword", + line, filename, value, key); + } + } + else if(!strcasecmp(key, "absolute") || !strcasecmp(key, "abs") || !strcasecmp(key, "absolute_sum")) { + *options |= RRDR_OPTION_ABSOLUTE; + } + else if(!strcasecmp(key, "min2max")) { + *options |= RRDR_OPTION_MIN2MAX; + } + else if(!strcasecmp(key, "null2zero")) { + *options |= RRDR_OPTION_NULL2ZERO; + } + else if(!strcasecmp(key, "percentage")) { + *options |= RRDR_OPTION_PERCENTAGE; + } + else if(!strcasecmp(key, "unaligned")) { + *options |= RRDR_OPTION_NOT_ALIGNED; + } + else if(!strcasecmp(key, "match-ids") || !strcasecmp(key, "match_ids")) { + *options |= RRDR_OPTION_MATCH_IDS; + } + else if(!strcasecmp(key, "match-names") || !strcasecmp(key, "match_names")) { + *options |= RRDR_OPTION_MATCH_NAMES; + } + else if(!strcasecmp(key, "of")) { + char *find = NULL; + if(*s && strcasecmp(s, "all") != 0) { + find = strcasestr(s, " foreach"); + if(find) { + *find = '\0'; + } + *dimensions = strdupz(s); + } + + if(!find) { + break; + } + s = ++find; + } + else if(!strcasecmp(key, HEALTH_FOREACH_KEY )) { + *foreachdim = strdupz(s); + break; + } + else { + error("Health configuration at line %zu of file '%s': unknown keyword '%s'", + line, filename, key); + } + } + + return 1; +} + +static inline char *health_source_file(size_t line, const char *file) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%zu@%s", line, file); + return strdupz(buffer); +} + +static inline void strip_quotes(char *s) { + while(*s) { + if(*s == '\'' || *s == '"') *s = ' '; + s++; + } +} + +static int health_readfile(const char *filename, void *data) { + RRDHOST *host = (RRDHOST *)data; + + debug(D_HEALTH, "Health configuration reading file '%s'", filename); + + static uint32_t + hash_alarm = 0, + hash_template = 0, + hash_os = 0, + hash_on = 0, + hash_host = 0, + hash_families = 0, + hash_plugin = 0, + hash_module = 0, + hash_calc = 0, + hash_green = 0, + hash_red = 0, + hash_warn = 0, + hash_crit = 0, + hash_exec = 0, + hash_every = 0, + hash_lookup = 0, + hash_units = 0, + hash_info = 0, + hash_recipient = 0, + hash_delay = 0, + hash_options = 0, + hash_repeat = 0, + hash_host_label = 0; + + char buffer[HEALTH_CONF_MAX_LINE + 1]; + + if(unlikely(!hash_alarm)) { + hash_alarm = simple_uhash(HEALTH_ALARM_KEY); + hash_template = simple_uhash(HEALTH_TEMPLATE_KEY); + hash_on = simple_uhash(HEALTH_ON_KEY); + hash_os = simple_uhash(HEALTH_OS_KEY); + hash_host = simple_uhash(HEALTH_HOST_KEY); + hash_families = simple_uhash(HEALTH_FAMILIES_KEY); + hash_plugin = simple_uhash(HEALTH_PLUGIN_KEY); + hash_module = simple_uhash(HEALTH_MODULE_KEY); + hash_calc = simple_uhash(HEALTH_CALC_KEY); + hash_lookup = simple_uhash(HEALTH_LOOKUP_KEY); + hash_green = simple_uhash(HEALTH_GREEN_KEY); + hash_red = simple_uhash(HEALTH_RED_KEY); + hash_warn = simple_uhash(HEALTH_WARN_KEY); + hash_crit = simple_uhash(HEALTH_CRIT_KEY); + hash_exec = simple_uhash(HEALTH_EXEC_KEY); + hash_every = simple_uhash(HEALTH_EVERY_KEY); + hash_units = simple_hash(HEALTH_UNITS_KEY); + hash_info = simple_hash(HEALTH_INFO_KEY); + hash_recipient = simple_hash(HEALTH_RECIPIENT_KEY); + hash_delay = simple_uhash(HEALTH_DELAY_KEY); + hash_options = simple_uhash(HEALTH_OPTIONS_KEY); + hash_repeat = simple_uhash(HEALTH_REPEAT_KEY); + hash_host_label = simple_uhash(HEALTH_HOST_LABEL_KEY); + } + + FILE *fp = fopen(filename, "r"); + if(!fp) { + error("Health configuration cannot read file '%s'.", filename); + return 0; + } + + RRDCALC *rc = NULL; + RRDCALCTEMPLATE *rt = NULL; + + int ignore_this = 0; + size_t line = 0, append = 0; + char *s; + while((s = fgets(&buffer[append], (int)(HEALTH_CONF_MAX_LINE - append), fp)) || append) { + int stop_appending = !s; + line++; + s = trim(buffer); + if(!s || *s == '#') continue; + + append = strlen(s); + if(!stop_appending && s[append - 1] == '\\') { + s[append - 1] = ' '; + append = &s[append] - buffer; + if(append < HEALTH_CONF_MAX_LINE) + continue; + else { + error("Health configuration has too long multi-line at line %zu of file '%s'.", line, filename); + } + } + append = 0; + + char *key = s; + while(*s && *s != ':') s++; + if(!*s) { + error("Health configuration has invalid line %zu of file '%s'. It does not contain a ':'. Ignoring it.", line, filename); + continue; + } + *s = '\0'; + s++; + + char *value = s; + key = trim_all(key); + value = trim_all(value); + + if(!key) { + error("Health configuration has invalid line %zu of file '%s'. Keyword is empty. Ignoring it.", line, filename); + continue; + } + + if(!value) { + error("Health configuration has invalid line %zu of file '%s'. value is empty. Ignoring it.", line, filename); + continue; + } + + uint32_t hash = simple_uhash(key); + + if(hash == hash_alarm && !strcasecmp(key, HEALTH_ALARM_KEY)) { + if(rc) { + if(ignore_this || !rrdcalc_add_alarm_from_config(host, rc)) { + rrdcalc_free(rc); + } + // health_add_alarms_loop(host, rc, ignore_this) ; + } + + if(rt) { + if (ignore_this || !rrdcalctemplate_add_template_from_config(host, rt)) + rrdcalctemplate_free(rt); + + rt = NULL; + } + + rc = callocz(1, sizeof(RRDCALC)); + rc->next_event_id = 1; + rc->name = strdupz(value); + rc->hash = simple_hash(rc->name); + rc->source = health_source_file(line, filename); + rc->green = NAN; + rc->red = NAN; + rc->value = NAN; + rc->old_value = NAN; + rc->delay_multiplier = 1.0; + rc->old_status = RRDCALC_STATUS_UNINITIALIZED; + rc->warn_repeat_every = host->health_default_warn_repeat_every; + rc->crit_repeat_every = host->health_default_crit_repeat_every; + + if(rrdvar_fix_name(rc->name)) + error("Health configuration renamed alarm '%s' to '%s'", value, rc->name); + + ignore_this = 0; + } + else if(hash == hash_template && !strcasecmp(key, HEALTH_TEMPLATE_KEY)) { + if(rc) { +// health_add_alarms_loop(host, rc, ignore_this) ; + if(ignore_this || !rrdcalc_add_alarm_from_config(host, rc)) { + rrdcalc_free(rc); + } + + rc = NULL; + } + + if(rt) { + if(ignore_this || !rrdcalctemplate_add_template_from_config(host, rt)) + rrdcalctemplate_free(rt); + } + + rt = callocz(1, sizeof(RRDCALCTEMPLATE)); + rt->name = strdupz(value); + rt->hash_name = simple_hash(rt->name); + rt->source = health_source_file(line, filename); + rt->green = NAN; + rt->red = NAN; + rt->delay_multiplier = 1.0; + rt->warn_repeat_every = host->health_default_warn_repeat_every; + rt->crit_repeat_every = host->health_default_crit_repeat_every; + + if(rrdvar_fix_name(rt->name)) + error("Health configuration renamed template '%s' to '%s'", value, rt->name); + + ignore_this = 0; + } + else if(hash == hash_os && !strcasecmp(key, HEALTH_OS_KEY)) { + char *os_match = value; + SIMPLE_PATTERN *os_pattern = simple_pattern_create(os_match, NULL, SIMPLE_PATTERN_EXACT); + + if(!simple_pattern_matches(os_pattern, host->os)) { + if(rc) + debug(D_HEALTH, "HEALTH on '%s' ignoring alarm '%s' defined at %zu@%s: host O/S does not match '%s'", host->hostname, rc->name, line, filename, os_match); + + if(rt) + debug(D_HEALTH, "HEALTH on '%s' ignoring template '%s' defined at %zu@%s: host O/S does not match '%s'", host->hostname, rt->name, line, filename, os_match); + + ignore_this = 1; + } + + simple_pattern_free(os_pattern); + } + else if(hash == hash_host && !strcasecmp(key, HEALTH_HOST_KEY)) { + char *host_match = value; + SIMPLE_PATTERN *host_pattern = simple_pattern_create(host_match, NULL, SIMPLE_PATTERN_EXACT); + + if(!simple_pattern_matches(host_pattern, host->hostname)) { + if(rc) + debug(D_HEALTH, "HEALTH on '%s' ignoring alarm '%s' defined at %zu@%s: hostname does not match '%s'", host->hostname, rc->name, line, filename, host_match); + + if(rt) + debug(D_HEALTH, "HEALTH on '%s' ignoring template '%s' defined at %zu@%s: hostname does not match '%s'", host->hostname, rt->name, line, filename, host_match); + + ignore_this = 1; + } + + simple_pattern_free(host_pattern); + } + else if(rc) { + if(hash == hash_on && !strcasecmp(key, HEALTH_ON_KEY)) { + if(rc->chart) { + if(strcmp(rc->chart, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->chart, value, value); + + freez(rc->chart); + } + rc->chart = strdupz(value); + rc->hash_chart = simple_hash(rc->chart); + } + else if(hash == hash_lookup && !strcasecmp(key, HEALTH_LOOKUP_KEY)) { + health_parse_db_lookup(line, filename, value, &rc->group, &rc->after, &rc->before, + &rc->update_every, &rc->options, &rc->dimensions, &rc->foreachdim); + if(rc->foreachdim) { + rc->spdim = health_pattern_from_foreach(rc->foreachdim); + } + } + else if(hash == hash_every && !strcasecmp(key, HEALTH_EVERY_KEY)) { + if(!config_parse_duration(value, &rc->update_every)) + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' cannot parse duration: '%s'.", + line, filename, rc->name, key, value); + } + else if(hash == hash_green && !strcasecmp(key, HEALTH_GREEN_KEY)) { + char *e; + rc->green = str2ld(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rc->name, key, e); + } + } + else if(hash == hash_red && !strcasecmp(key, HEALTH_RED_KEY)) { + char *e; + rc->red = str2ld(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rc->name, key, e); + } + } + else if(hash == hash_calc && !strcasecmp(key, HEALTH_CALC_KEY)) { + const char *failed_at = NULL; + int error = 0; + rc->calculation = expression_parse(value, &failed_at, &error); + if(!rc->calculation) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rc->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_warn && !strcasecmp(key, HEALTH_WARN_KEY)) { + const char *failed_at = NULL; + int error = 0; + rc->warning = expression_parse(value, &failed_at, &error); + if(!rc->warning) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rc->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_crit && !strcasecmp(key, HEALTH_CRIT_KEY)) { + const char *failed_at = NULL; + int error = 0; + rc->critical = expression_parse(value, &failed_at, &error); + if(!rc->critical) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rc->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_exec && !strcasecmp(key, HEALTH_EXEC_KEY)) { + if(rc->exec) { + if(strcmp(rc->exec, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->exec, value, value); + + freez(rc->exec); + } + rc->exec = strdupz(value); + } + else if(hash == hash_recipient && !strcasecmp(key, HEALTH_RECIPIENT_KEY)) { + if(rc->recipient) { + if(strcmp(rc->recipient, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->recipient, value, value); + + freez(rc->recipient); + } + rc->recipient = strdupz(value); + } + else if(hash == hash_units && !strcasecmp(key, HEALTH_UNITS_KEY)) { + if(rc->units) { + if(strcmp(rc->units, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->units, value, value); + + freez(rc->units); + } + rc->units = strdupz(value); + strip_quotes(rc->units); + } + else if(hash == hash_info && !strcasecmp(key, HEALTH_INFO_KEY)) { + if(rc->info) { + if(strcmp(rc->info, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rc->name, key, rc->info, value, value); + + freez(rc->info); + } + rc->info = strdupz(value); + strip_quotes(rc->info); + } + else if(hash == hash_delay && !strcasecmp(key, HEALTH_DELAY_KEY)) { + health_parse_delay(line, filename, value, &rc->delay_up_duration, &rc->delay_down_duration, &rc->delay_max_duration, &rc->delay_multiplier); + } + else if(hash == hash_options && !strcasecmp(key, HEALTH_OPTIONS_KEY)) { + rc->options |= health_parse_options(value); + } + else if(hash == hash_repeat && !strcasecmp(key, HEALTH_REPEAT_KEY)){ + health_parse_repeat(line, filename, value, + &rc->warn_repeat_every, + &rc->crit_repeat_every); + } + else if(hash == hash_host_label && !strcasecmp(key, HEALTH_HOST_LABEL_KEY)) { + if(rc->labels) { + if(strcmp(rc->labels, value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'.", + line, filename, rc->name, key, value, value); + + freez(rc->labels); + simple_pattern_free(rc->splabels); + } + + rc->labels = simple_pattern_trim_around_equal(value); + rc->splabels = simple_pattern_create(rc->labels, NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_plugin && !strcasecmp(key, HEALTH_PLUGIN_KEY)) { + freez(rc->plugin_match); + simple_pattern_free(rc->plugin_pattern); + + rc->plugin_match = strdupz(value); + rc->plugin_pattern = simple_pattern_create(rc->plugin_match, NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_module && !strcasecmp(key, HEALTH_MODULE_KEY)) { + freez(rc->module_match); + simple_pattern_free(rc->module_pattern); + + rc->module_match = strdupz(value); + rc->module_pattern = simple_pattern_create(rc->module_match, NULL, SIMPLE_PATTERN_EXACT); + } + else { + error("Health configuration at line %zu of file '%s' for alarm '%s' has unknown key '%s'.", + line, filename, rc->name, key); + } + } + else if(rt) { + if(hash == hash_on && !strcasecmp(key, HEALTH_ON_KEY)) { + if(rt->context) { + if(strcmp(rt->context, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->context, value, value); + + freez(rt->context); + } + rt->context = strdupz(value); + rt->hash_context = simple_hash(rt->context); + } + else if(hash == hash_families && !strcasecmp(key, HEALTH_FAMILIES_KEY)) { + freez(rt->family_match); + simple_pattern_free(rt->family_pattern); + + rt->family_match = strdupz(value); + rt->family_pattern = simple_pattern_create(rt->family_match, NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_plugin && !strcasecmp(key, HEALTH_PLUGIN_KEY)) { + freez(rt->plugin_match); + simple_pattern_free(rt->plugin_pattern); + + rt->plugin_match = strdupz(value); + rt->plugin_pattern = simple_pattern_create(rt->plugin_match, NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_module && !strcasecmp(key, HEALTH_MODULE_KEY)) { + freez(rt->module_match); + simple_pattern_free(rt->module_pattern); + + rt->module_match = strdupz(value); + rt->module_pattern = simple_pattern_create(rt->module_match, NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_lookup && !strcasecmp(key, HEALTH_LOOKUP_KEY)) { + health_parse_db_lookup(line, filename, value, &rt->group, &rt->after, &rt->before, + &rt->update_every, &rt->options, &rt->dimensions, &rt->foreachdim); + if(rt->foreachdim) { + rt->spdim = health_pattern_from_foreach(rt->foreachdim); + } + } + else if(hash == hash_every && !strcasecmp(key, HEALTH_EVERY_KEY)) { + if(!config_parse_duration(value, &rt->update_every)) + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' cannot parse duration: '%s'.", + line, filename, rt->name, key, value); + } + else if(hash == hash_green && !strcasecmp(key, HEALTH_GREEN_KEY)) { + char *e; + rt->green = str2ld(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rt->name, key, e); + } + } + else if(hash == hash_red && !strcasecmp(key, HEALTH_RED_KEY)) { + char *e; + rt->red = str2ld(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rt->name, key, e); + } + } + else if(hash == hash_calc && !strcasecmp(key, HEALTH_CALC_KEY)) { + const char *failed_at = NULL; + int error = 0; + rt->calculation = expression_parse(value, &failed_at, &error); + if(!rt->calculation) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rt->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_warn && !strcasecmp(key, HEALTH_WARN_KEY)) { + const char *failed_at = NULL; + int error = 0; + rt->warning = expression_parse(value, &failed_at, &error); + if(!rt->warning) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rt->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_crit && !strcasecmp(key, HEALTH_CRIT_KEY)) { + const char *failed_at = NULL; + int error = 0; + rt->critical = expression_parse(value, &failed_at, &error); + if(!rt->critical) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rt->name, key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_exec && !strcasecmp(key, HEALTH_EXEC_KEY)) { + if(rt->exec) { + if(strcmp(rt->exec, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->exec, value, value); + + freez(rt->exec); + } + rt->exec = strdupz(value); + } + else if(hash == hash_recipient && !strcasecmp(key, HEALTH_RECIPIENT_KEY)) { + if(rt->recipient) { + if(strcmp(rt->recipient, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->recipient, value, value); + + freez(rt->recipient); + } + rt->recipient = strdupz(value); + } + else if(hash == hash_units && !strcasecmp(key, HEALTH_UNITS_KEY)) { + if(rt->units) { + if(strcmp(rt->units, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->units, value, value); + + freez(rt->units); + } + rt->units = strdupz(value); + strip_quotes(rt->units); + } + else if(hash == hash_info && !strcasecmp(key, HEALTH_INFO_KEY)) { + if(rt->info) { + if(strcmp(rt->info, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->info, value, value); + + freez(rt->info); + } + rt->info = strdupz(value); + strip_quotes(rt->info); + } + else if(hash == hash_delay && !strcasecmp(key, HEALTH_DELAY_KEY)) { + health_parse_delay(line, filename, value, &rt->delay_up_duration, &rt->delay_down_duration, &rt->delay_max_duration, &rt->delay_multiplier); + } + else if(hash == hash_options && !strcasecmp(key, HEALTH_OPTIONS_KEY)) { + rt->options |= health_parse_options(value); + } + else if(hash == hash_repeat && !strcasecmp(key, HEALTH_REPEAT_KEY)){ + health_parse_repeat(line, filename, value, + &rt->warn_repeat_every, + &rt->crit_repeat_every); + } + else if(hash == hash_host_label && !strcasecmp(key, HEALTH_HOST_LABEL_KEY)) { + if(rt->labels) { + if(strcmp(rt->labels, value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rt->name, key, rt->labels, value, value); + + freez(rt->labels); + simple_pattern_free(rt->splabels); + } + + rt->labels = simple_pattern_trim_around_equal(value); + rt->splabels = simple_pattern_create(rt->labels, NULL, SIMPLE_PATTERN_EXACT); + } + else { + error("Health configuration at line %zu of file '%s' for template '%s' has unknown key '%s'.", + line, filename, rt->name, key); + } + } + else { + error("Health configuration at line %zu of file '%s' has unknown key '%s'. Expected either '" HEALTH_ALARM_KEY "' or '" HEALTH_TEMPLATE_KEY "'.", + line, filename, key); + } + } + + if(rc) { + //health_add_alarms_loop(host, rc, ignore_this) ; + if(ignore_this || !rrdcalc_add_alarm_from_config(host, rc)) { + rrdcalc_free(rc); + } + } + + if(rt) { + if(ignore_this || !rrdcalctemplate_add_template_from_config(host, rt)) + rrdcalctemplate_free(rt); + } + + fclose(fp); + return 1; +} + +void health_readdir(RRDHOST *host, const char *user_path, const char *stock_path, const char *subpath) { + if(unlikely(!host->health_enabled)) { + debug(D_HEALTH, "CONFIG health is not enabled for host '%s'", host->hostname); + return; + } + + int stock_enabled = (int)config_get_boolean(CONFIG_SECTION_HEALTH, "enable stock health configuration", + CONFIG_BOOLEAN_YES); + + if (!stock_enabled) { + info("Netdata will not load stock alarms."); + stock_path = user_path; + } + + recursive_config_double_dir_load(user_path, stock_path, subpath, health_readfile, (void *) host, 0); +} diff --git a/health/health_json.c b/health/health_json.c new file mode 100644 index 0000000..7b5a1e3 --- /dev/null +++ b/health/health_json.c @@ -0,0 +1,368 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +static inline void health_string2json(BUFFER *wb, const char *prefix, const char *label, const char *value, const char *suffix) { + if(value && *value) { + buffer_sprintf(wb, "%s\"%s\":\"", prefix, label); + buffer_strcat_htmlescape(wb, value); + buffer_strcat(wb, "\""); + buffer_strcat(wb, suffix); + } + else + buffer_sprintf(wb, "%s\"%s\":null%s", prefix, label, suffix); +} + +inline void health_alarm_entry2json_nolock(BUFFER *wb, ALARM_ENTRY *ae, RRDHOST *host) { + buffer_sprintf(wb, + "\n\t{\n" + "\t\t\"hostname\": \"%s\",\n" + "\t\t\"unique_id\": %u,\n" + "\t\t\"alarm_id\": %u,\n" + "\t\t\"alarm_event_id\": %u,\n" + "\t\t\"name\": \"%s\",\n" + "\t\t\"chart\": \"%s\",\n" + "\t\t\"family\": \"%s\",\n" + "\t\t\"processed\": %s,\n" + "\t\t\"updated\": %s,\n" + "\t\t\"exec_run\": %lu,\n" + "\t\t\"exec_failed\": %s,\n" + "\t\t\"exec\": \"%s\",\n" + "\t\t\"recipient\": \"%s\",\n" + "\t\t\"exec_code\": %d,\n" + "\t\t\"source\": \"%s\",\n" + "\t\t\"units\": \"%s\",\n" + "\t\t\"when\": %lu,\n" + "\t\t\"duration\": %lu,\n" + "\t\t\"non_clear_duration\": %lu,\n" + "\t\t\"status\": \"%s\",\n" + "\t\t\"old_status\": \"%s\",\n" + "\t\t\"delay\": %d,\n" + "\t\t\"delay_up_to_timestamp\": %lu,\n" + "\t\t\"updated_by_id\": %u,\n" + "\t\t\"updates_id\": %u,\n" + "\t\t\"value_string\": \"%s\",\n" + "\t\t\"old_value_string\": \"%s\",\n" + "\t\t\"last_repeat\": \"%lu\",\n" + "\t\t\"silenced\": \"%s\",\n" + , host->hostname + , ae->unique_id + , ae->alarm_id + , ae->alarm_event_id + , ae->name + , ae->chart + , ae->family + , (ae->flags & HEALTH_ENTRY_FLAG_PROCESSED)?"true":"false" + , (ae->flags & HEALTH_ENTRY_FLAG_UPDATED)?"true":"false" + , (unsigned long)ae->exec_run_timestamp + , (ae->flags & HEALTH_ENTRY_FLAG_EXEC_FAILED)?"true":"false" + , ae->exec?ae->exec:host->health_default_exec + , ae->recipient?ae->recipient:host->health_default_recipient + , ae->exec_code + , ae->source + , ae->units?ae->units:"" + , (unsigned long)ae->when + , (unsigned long)ae->duration + , (unsigned long)ae->non_clear_duration + , rrdcalc_status2string(ae->new_status) + , rrdcalc_status2string(ae->old_status) + , ae->delay + , (unsigned long)ae->delay_up_to_timestamp + , ae->updated_by_id + , ae->updates_id + , ae->new_value_string + , ae->old_value_string + , (unsigned long)ae->last_repeat + , (ae->flags & HEALTH_ENTRY_FLAG_SILENCED)?"true":"false" + ); + + health_string2json(wb, "\t\t", "info", ae->info?ae->info:"", ",\n"); + + if(unlikely(ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION)) { + buffer_strcat(wb, "\t\t\"no_clear_notification\": true,\n"); + } + + buffer_strcat(wb, "\t\t\"value\":"); + buffer_rrd_value(wb, ae->new_value); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\"old_value\":"); + buffer_rrd_value(wb, ae->old_value); + buffer_strcat(wb, "\n"); + + buffer_strcat(wb, "\t}"); +} + +void health_alarm_log2json(RRDHOST *host, BUFFER *wb, uint32_t after) { + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + buffer_strcat(wb, "["); + + unsigned int max = host->health_log.max; + unsigned int count = 0; + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae && count < max ; count++, ae = ae->next) { + if(ae->unique_id > after) { + if(likely(count)) buffer_strcat(wb, ","); + health_alarm_entry2json_nolock(wb, ae, host); + } + } + + buffer_strcat(wb, "\n]\n"); + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} + +static inline void health_rrdcalc_values2json_nolock(RRDHOST *host, BUFFER *wb, RRDCALC *rc) { + (void)host; + buffer_sprintf(wb, + "\t\t\"%s.%s\": {\n" + "\t\t\t\"id\": %lu,\n" + , rc->chart, rc->name + , (unsigned long)rc->id); + + buffer_strcat(wb, "\t\t\t\"value\":"); + buffer_rrd_value(wb, rc->value); + buffer_strcat(wb, ",\n"); + + buffer_sprintf(wb, + "\t\t\t\"status\": \"%s\"\n" + , rrdcalc_status2string(rc->status)); + + buffer_strcat(wb, "\t\t}"); +} + +static inline void health_rrdcalc2json_nolock(RRDHOST *host, BUFFER *wb, RRDCALC *rc) { + char value_string[100 + 1]; + format_value_and_unit(value_string, 100, rc->value, rc->units, -1); + + buffer_sprintf(wb, + "\t\t\"%s.%s\": {\n" + "\t\t\t\"id\": %lu,\n" + "\t\t\t\"name\": \"%s\",\n" + "\t\t\t\"chart\": \"%s\",\n" + "\t\t\t\"family\": \"%s\",\n" + "\t\t\t\"active\": %s,\n" + "\t\t\t\"disabled\": %s,\n" + "\t\t\t\"silenced\": %s,\n" + "\t\t\t\"exec\": \"%s\",\n" + "\t\t\t\"recipient\": \"%s\",\n" + "\t\t\t\"source\": \"%s\",\n" + "\t\t\t\"units\": \"%s\",\n" + "\t\t\t\"info\": \"%s\",\n" + "\t\t\t\"status\": \"%s\",\n" + "\t\t\t\"last_status_change\": %lu,\n" + "\t\t\t\"last_updated\": %lu,\n" + "\t\t\t\"next_update\": %lu,\n" + "\t\t\t\"update_every\": %d,\n" + "\t\t\t\"delay_up_duration\": %d,\n" + "\t\t\t\"delay_down_duration\": %d,\n" + "\t\t\t\"delay_max_duration\": %d,\n" + "\t\t\t\"delay_multiplier\": %f,\n" + "\t\t\t\"delay\": %d,\n" + "\t\t\t\"delay_up_to_timestamp\": %lu,\n" + "\t\t\t\"warn_repeat_every\": \"%u\",\n" + "\t\t\t\"crit_repeat_every\": \"%u\",\n" + "\t\t\t\"value_string\": \"%s\",\n" + "\t\t\t\"last_repeat\": \"%lu\",\n" + , rc->chart, rc->name + , (unsigned long)rc->id + , rc->name + , rc->chart + , (rc->rrdset && rc->rrdset->family)?rc->rrdset->family:"" + , (rc->rrdset)?"true":"false" + , (rc->rrdcalc_flags & RRDCALC_FLAG_DISABLED)?"true":"false" + , (rc->rrdcalc_flags & RRDCALC_FLAG_SILENCED)?"true":"false" + , rc->exec?rc->exec:host->health_default_exec + , rc->recipient?rc->recipient:host->health_default_recipient + , rc->source + , rc->units?rc->units:"" + , rc->info?rc->info:"" + , rrdcalc_status2string(rc->status) + , (unsigned long)rc->last_status_change + , (unsigned long)rc->last_updated + , (unsigned long)rc->next_update + , rc->update_every + , rc->delay_up_duration + , rc->delay_down_duration + , rc->delay_max_duration + , rc->delay_multiplier + , rc->delay_last + , (unsigned long)rc->delay_up_to_timestamp + , rc->warn_repeat_every + , rc->crit_repeat_every + , value_string + , (unsigned long)rc->last_repeat + ); + + if(unlikely(rc->options & RRDCALC_FLAG_NO_CLEAR_NOTIFICATION)) { + buffer_strcat(wb, "\t\t\t\"no_clear_notification\": true,\n"); + } + + if(RRDCALC_HAS_DB_LOOKUP(rc)) { + if(rc->dimensions && *rc->dimensions) + health_string2json(wb, "\t\t\t", "lookup_dimensions", rc->dimensions, ",\n"); + + buffer_sprintf(wb, + "\t\t\t\"db_after\": %lu,\n" + "\t\t\t\"db_before\": %lu,\n" + "\t\t\t\"lookup_method\": \"%s\",\n" + "\t\t\t\"lookup_after\": %d,\n" + "\t\t\t\"lookup_before\": %d,\n" + "\t\t\t\"lookup_options\": \"", + (unsigned long) rc->db_after, + (unsigned long) rc->db_before, + group_method2string(rc->group), + rc->after, + rc->before + ); + buffer_data_options2string(wb, rc->options); + buffer_strcat(wb, "\",\n"); + } + + if(rc->calculation) { + health_string2json(wb, "\t\t\t", "calc", rc->calculation->source, ",\n"); + health_string2json(wb, "\t\t\t", "calc_parsed", rc->calculation->parsed_as, ",\n"); + } + + if(rc->warning) { + health_string2json(wb, "\t\t\t", "warn", rc->warning->source, ",\n"); + health_string2json(wb, "\t\t\t", "warn_parsed", rc->warning->parsed_as, ",\n"); + } + + if(rc->critical) { + health_string2json(wb, "\t\t\t", "crit", rc->critical->source, ",\n"); + health_string2json(wb, "\t\t\t", "crit_parsed", rc->critical->parsed_as, ",\n"); + } + + buffer_strcat(wb, "\t\t\t\"green\":"); + buffer_rrd_value(wb, rc->green); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\t\"red\":"); + buffer_rrd_value(wb, rc->red); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\t\"value\":"); + buffer_rrd_value(wb, rc->value); + buffer_strcat(wb, "\n"); + + buffer_strcat(wb, "\t\t}"); +} + +//void health_rrdcalctemplate2json_nolock(BUFFER *wb, RRDCALCTEMPLATE *rt) { +// +//} + +void health_aggregate_alarms(RRDHOST *host, BUFFER *wb, BUFFER* contexts, RRDCALC_STATUS status) { + RRDCALC *rc; + int numberOfAlarms = 0; + char *tok = NULL; + char *p = NULL; + + rrdhost_rdlock(host); + + if (contexts) { + p = (char*)buffer_tostring(contexts); + while(p && *p && (tok = mystrsep(&p, ", |"))) { + if(!*tok) continue; + + for(rc = host->alarms; rc ; rc = rc->next) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + if(unlikely(rc->rrdset && rc->rrdset->hash_context == simple_hash(tok) + && !strcmp(rc->rrdset->context, tok) + && ((status==RRDCALC_STATUS_RAISED)?(rc->status >= RRDCALC_STATUS_WARNING):rc->status == status))) + numberOfAlarms++; + } + } + } + else { + for(rc = host->alarms; rc ; rc = rc->next) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + + if(unlikely((status==RRDCALC_STATUS_RAISED)?(rc->status >= RRDCALC_STATUS_WARNING):rc->status == status)) + numberOfAlarms++; + } + } + + buffer_sprintf(wb, "%d", numberOfAlarms); + rrdhost_unlock(host); +} + +static void health_alarms2json_fill_alarms(RRDHOST *host, BUFFER *wb, int all, void (*fp)(RRDHOST *, BUFFER *, RRDCALC *)) { + RRDCALC *rc; + int i; + for(i = 0, rc = host->alarms; rc ; rc = rc->next) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + + if(likely(!all && !(rc->status == RRDCALC_STATUS_WARNING || rc->status == RRDCALC_STATUS_CRITICAL))) + continue; + + if(likely(i)) buffer_strcat(wb, ",\n"); + fp(host, wb, rc); + i++; + } +} + +void health_alarms2json(RRDHOST *host, BUFFER *wb, int all) { + rrdhost_rdlock(host); + buffer_sprintf(wb, "{\n\t\"hostname\": \"%s\"," + "\n\t\"latest_alarm_log_unique_id\": %u," + "\n\t\"status\": %s," + "\n\t\"now\": %lu," + "\n\t\"alarms\": {\n", + host->hostname, + (host->health_log.next_log_id > 0)?(host->health_log.next_log_id - 1):0, + host->health_enabled?"true":"false", + (unsigned long)now_realtime_sec()); + + health_alarms2json_fill_alarms(host, wb, all, health_rrdcalc2json_nolock); + +// buffer_strcat(wb, "\n\t},\n\t\"templates\": {"); +// RRDCALCTEMPLATE *rt; +// for(rt = host->templates; rt ; rt = rt->next) +// health_rrdcalctemplate2json_nolock(wb, rt); + + buffer_strcat(wb, "\n\t}\n}\n"); + rrdhost_unlock(host); +} + +void health_alarms_values2json(RRDHOST *host, BUFFER *wb, int all) { + rrdhost_rdlock(host); + buffer_sprintf(wb, "{\n\t\"hostname\": \"%s\"," + "\n\t\"alarms\": {\n", + host->hostname); + + health_alarms2json_fill_alarms(host, wb, all, health_rrdcalc_values2json_nolock); + + buffer_strcat(wb, "\n\t}\n}\n"); + rrdhost_unlock(host); +} + + +void health_active_log_alarms_2json(RRDHOST *host, BUFFER *wb) { + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + buffer_sprintf(wb, "[\n"); + + unsigned int max = host->health_log.max; + unsigned int count = 0; + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae && count < max ; ae = ae->next) { + if (!ae->updated_by_id && + ((ae->new_status == RRDCALC_STATUS_WARNING || ae->new_status == RRDCALC_STATUS_CRITICAL) || + ((ae->old_status == RRDCALC_STATUS_WARNING || ae->old_status == RRDCALC_STATUS_CRITICAL) && + ae->new_status == RRDCALC_STATUS_REMOVED))) { + if (likely(count)) + buffer_strcat(wb, ","); + health_alarm_entry2json_nolock(wb, ae, host); + count++; + } + } + buffer_strcat(wb, "]"); + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} diff --git a/health/health_log.c b/health/health_log.c new file mode 100644 index 0000000..8c0bc5c --- /dev/null +++ b/health/health_log.c @@ -0,0 +1,573 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +// ---------------------------------------------------------------------------- +// health alarm log load/save +// no need for locking - only one thread is reading / writing the alarms log + +inline int health_alarm_log_open(RRDHOST *host) { + if(host->health_log_fp) + fclose(host->health_log_fp); + + host->health_log_fp = fopen(host->health_log_filename, "a"); + + if(host->health_log_fp) { + if (setvbuf(host->health_log_fp, NULL, _IOLBF, 0) != 0) + error("HEALTH [%s]: cannot set line buffering on health log file '%s'.", host->hostname, host->health_log_filename); + return 0; + } + + error("HEALTH [%s]: cannot open health log file '%s'. Health data will be lost in case of netdata or server crash.", host->hostname, host->health_log_filename); + return -1; +} + +static inline void health_alarm_log_close(RRDHOST *host) { + if(host->health_log_fp) { + fclose(host->health_log_fp); + host->health_log_fp = NULL; + } +} + +static inline void health_log_rotate(RRDHOST *host) { + static size_t rotate_every = 0; + + if(unlikely(rotate_every == 0)) { + rotate_every = (size_t)config_get_number(CONFIG_SECTION_HEALTH, "rotate log every lines", 2000); + if(rotate_every < 100) rotate_every = 100; + } + + if(unlikely(host->health_log_entries_written > rotate_every)) { + health_alarm_log_close(host); + + char old_filename[FILENAME_MAX + 1]; + snprintfz(old_filename, FILENAME_MAX, "%s.old", host->health_log_filename); + + if(unlink(old_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot remove old alarms log file '%s'", host->hostname, old_filename); + + if(link(host->health_log_filename, old_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot move file '%s' to '%s'.", host->hostname, host->health_log_filename, old_filename); + + if(unlink(host->health_log_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot remove old alarms log file '%s'", host->hostname, host->health_log_filename); + + // open it with truncate + host->health_log_fp = fopen(host->health_log_filename, "w"); + + if(host->health_log_fp) + fclose(host->health_log_fp); + else + error("HEALTH [%s]: cannot truncate health log '%s'", host->hostname, host->health_log_filename); + + host->health_log_fp = NULL; + + host->health_log_entries_written = 0; + health_alarm_log_open(host); + } +} + +inline void health_label_log_save(RRDHOST *host) { + health_log_rotate(host); + + if(likely(host->health_log_fp)) { + BUFFER *wb = buffer_create(1024); + rrdhost_check_rdlock(host); + netdata_rwlock_rdlock(&host->labels.labels_rwlock); + struct label *l=localhost->labels.head; + while (l != NULL) { + buffer_sprintf(wb,"%s=%s\t ", l->key, l->value); + l = l->next; + } + netdata_rwlock_unlock(&host->labels.labels_rwlock); + + char *write = (char *) buffer_tostring(wb) ; + + write[wb->len-2] = '\n'; + write[wb->len-1] = '\0'; + + if (unlikely(fprintf(host->health_log_fp, "L\t%s" + , write + ) < 0)) + error("HEALTH [%s]: failed to save alarm log entry to '%s'. Health data may be lost in case of abnormal restart.", + host->hostname, host->health_log_filename); + else { + host->health_log_entries_written++; + } + + buffer_free(wb); + } +} + +inline void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae) { + health_log_rotate(host); + if(likely(host->health_log_fp)) { + if(unlikely(fprintf(host->health_log_fp + , "%c\t%s" + "\t%08x\t%08x\t%08x\t%08x\t%08x" + "\t%08x\t%08x\t%08x" + "\t%08x\t%08x\t%08x" + "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" + "\t%d\t%d\t%d\t%d" + "\t" CALCULATED_NUMBER_FORMAT_AUTO "\t" CALCULATED_NUMBER_FORMAT_AUTO + "\t%016lx" + "\n" + , (ae->flags & HEALTH_ENTRY_FLAG_SAVED)?'U':'A' + , host->hostname + + , ae->unique_id + , ae->alarm_id + , ae->alarm_event_id + , ae->updated_by_id + , ae->updates_id + + , (uint32_t)ae->when + , (uint32_t)ae->duration + , (uint32_t)ae->non_clear_duration + , (uint32_t)ae->flags + , (uint32_t)ae->exec_run_timestamp + , (uint32_t)ae->delay_up_to_timestamp + + , (ae->name)?ae->name:"" + , (ae->chart)?ae->chart:"" + , (ae->family)?ae->family:"" + , (ae->exec)?ae->exec:"" + , (ae->recipient)?ae->recipient:"" + , (ae->source)?ae->source:"" + , (ae->units)?ae->units:"" + , (ae->info)?ae->info:"" + + , ae->exec_code + , ae->new_status + , ae->old_status + , ae->delay + + , ae->new_value + , ae->old_value + , (uint64_t)ae->last_repeat + ) < 0)) + error("HEALTH [%s]: failed to save alarm log entry to '%s'. Health data may be lost in case of abnormal restart.", host->hostname, host->health_log_filename); + else { + ae->flags |= HEALTH_ENTRY_FLAG_SAVED; + host->health_log_entries_written++; + } + } +#ifdef ENABLE_ACLK + if (netdata_cloud_setting) { + if ((ae->new_status == RRDCALC_STATUS_WARNING || ae->new_status == RRDCALC_STATUS_CRITICAL) || + ((ae->old_status == RRDCALC_STATUS_WARNING || ae->old_status == RRDCALC_STATUS_CRITICAL))) { + aclk_update_alarm(host, ae); + } + } +#endif +} + +static uint32_t is_valid_alarm_id(RRDHOST *host, const char *chart, const char *name, uint32_t alarm_id) +{ + uint32_t hash_chart = simple_hash(chart); + uint32_t hash_name = simple_hash(name); + + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae ;ae = ae->next) { + if (unlikely( + ae->alarm_id == alarm_id && (!(ae->hash_name == hash_name && ae->hash_chart == hash_chart && + !strcmp(name, ae->name) && !strcmp(chart, ae->chart))))) { + return 0; + } + } + return 1; +} + +static inline ssize_t health_alarm_log_read(RRDHOST *host, FILE *fp, const char *filename) { + errno = 0; + + char *s, *buf = mallocz(65536 + 1); + size_t line = 0, len = 0; + ssize_t loaded = 0, updated = 0, errored = 0, duplicate = 0; + + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + while((s = fgets_trim_len(buf, 65536, fp, &len))) { + host->health_log_entries_written++; + line++; + + int max_entries = 30, entries = 0; + char *pointers[max_entries]; + + pointers[entries++] = s++; + while(*s) { + if(unlikely(*s == '\t')) { + *s = '\0'; + pointers[entries++] = ++s; + if(entries >= max_entries) { + error("HEALTH [%s]: line %zu of file '%s' has more than %d entries. Ignoring excessive entries.", host->hostname, line, filename, max_entries); + break; + } + } + else s++; + } + + if(likely(*pointers[0] == 'L')) + continue; + + if(likely(*pointers[0] == 'U' || *pointers[0] == 'A')) { + ALARM_ENTRY *ae = NULL; + + if(entries < 26) { + error("HEALTH [%s]: line %zu of file '%s' should have at least 26 entries, but it has %d. Ignoring it.", host->hostname, line, filename, entries); + errored++; + continue; + } + + // check that we have valid ids + uint32_t unique_id = (uint32_t)strtoul(pointers[2], NULL, 16); + if(!unique_id) { + error("HEALTH [%s]: line %zu of file '%s' states alarm entry with invalid unique id %u (%s). Ignoring it.", host->hostname, line, filename, unique_id, pointers[2]); + errored++; + continue; + } + + uint32_t alarm_id = (uint32_t)strtoul(pointers[3], NULL, 16); + if(!alarm_id) { + error("HEALTH [%s]: line %zu of file '%s' states alarm entry for invalid alarm id %u (%s). Ignoring it.", host->hostname, line, filename, alarm_id, pointers[3]); + errored++; + continue; + } + + // Check if we got last_repeat field + time_t last_repeat = 0; + if(entries > 27) { + char* alarm_name = pointers[13]; + last_repeat = (time_t)strtoul(pointers[27], NULL, 16); + + RRDCALC *rc = alarm_max_last_repeat(host, alarm_name,simple_hash(alarm_name)); + if (!rc) { + for(rc = host->alarms; rc ; rc = rc->next) { + RRDCALC *rdcmp = (RRDCALC *) avl_insert_lock(&(host)->alarms_idx_name, (avl *)rc); + if(rdcmp != rc) { + error("Cannot insert the alarm index ID using log %s", rc->name); + } + } + + rc = alarm_max_last_repeat(host, alarm_name,simple_hash(alarm_name)); + } + + if(unlikely(rc)) { + if (rrdcalc_isrepeating(rc)) { + rc->last_repeat = last_repeat; + // We iterate through repeating alarm entries only to + // find the latest last_repeat timestamp. Otherwise, + // there is no need to keep them in memory. + continue; + } + } + } + + if(unlikely(*pointers[0] == 'A')) { + // make sure it is properly numbered + if(unlikely(host->health_log.alarms && unique_id < host->health_log.alarms->unique_id)) { + error( "HEALTH [%s]: line %zu of file '%s' has alarm log entry %u in wrong order. Ignoring it." + , host->hostname, line, filename, unique_id); + errored++; + continue; + } + + ae = callocz(1, sizeof(ALARM_ENTRY)); + } + else if(unlikely(*pointers[0] == 'U')) { + // find the original + for(ae = host->health_log.alarms; ae ; ae = ae->next) { + if(unlikely(unique_id == ae->unique_id)) { + if(unlikely(*pointers[0] == 'A')) { + error("HEALTH [%s]: line %zu of file '%s' adds duplicate alarm log entry %u. Using the later." + , host->hostname, line, filename, unique_id); + *pointers[0] = 'U'; + duplicate++; + } + break; + } + else if(unlikely(unique_id > ae->unique_id)) { + // no need to continue + // the linked list is sorted + ae = NULL; + break; + } + } + } + + // if not found, skip this line + if(unlikely(!ae)) { + // error("HEALTH [%s]: line %zu of file '%s' updates alarm log entry with unique id %u, but it is not found.", host->hostname, line, filename, unique_id); + continue; + } + + // check for a possible host missmatch + //if(strcmp(pointers[1], host->hostname)) + // error("HEALTH [%s]: line %zu of file '%s' provides an alarm for host '%s' but this is named '%s'.", host->hostname, line, filename, pointers[1], host->hostname); + + ae->unique_id = unique_id; + if (!is_valid_alarm_id(host, pointers[14], pointers[13], alarm_id)) + alarm_id = rrdcalc_get_unique_id(host, pointers[14], pointers[13], NULL); + ae->alarm_id = alarm_id; + ae->alarm_event_id = (uint32_t)strtoul(pointers[4], NULL, 16); + ae->updated_by_id = (uint32_t)strtoul(pointers[5], NULL, 16); + ae->updates_id = (uint32_t)strtoul(pointers[6], NULL, 16); + + ae->when = (uint32_t)strtoul(pointers[7], NULL, 16); + ae->duration = (uint32_t)strtoul(pointers[8], NULL, 16); + ae->non_clear_duration = (uint32_t)strtoul(pointers[9], NULL, 16); + + ae->flags = (uint32_t)strtoul(pointers[10], NULL, 16); + ae->flags |= HEALTH_ENTRY_FLAG_SAVED; + + ae->exec_run_timestamp = (uint32_t)strtoul(pointers[11], NULL, 16); + ae->delay_up_to_timestamp = (uint32_t)strtoul(pointers[12], NULL, 16); + + freez(ae->name); + ae->name = strdupz(pointers[13]); + ae->hash_name = simple_hash(ae->name); + + freez(ae->chart); + ae->chart = strdupz(pointers[14]); + ae->hash_chart = simple_hash(ae->chart); + + freez(ae->family); + ae->family = strdupz(pointers[15]); + + freez(ae->exec); + ae->exec = strdupz(pointers[16]); + if(!*ae->exec) { freez(ae->exec); ae->exec = NULL; } + + freez(ae->recipient); + ae->recipient = strdupz(pointers[17]); + if(!*ae->recipient) { freez(ae->recipient); ae->recipient = NULL; } + + freez(ae->source); + ae->source = strdupz(pointers[18]); + if(!*ae->source) { freez(ae->source); ae->source = NULL; } + + freez(ae->units); + ae->units = strdupz(pointers[19]); + if(!*ae->units) { freez(ae->units); ae->units = NULL; } + + freez(ae->info); + ae->info = strdupz(pointers[20]); + if(!*ae->info) { freez(ae->info); ae->info = NULL; } + + ae->exec_code = str2i(pointers[21]); + ae->new_status = str2i(pointers[22]); + ae->old_status = str2i(pointers[23]); + ae->delay = str2i(pointers[24]); + + ae->new_value = str2l(pointers[25]); + ae->old_value = str2l(pointers[26]); + + ae->last_repeat = last_repeat; + + char value_string[100 + 1]; + freez(ae->old_value_string); + freez(ae->new_value_string); + ae->old_value_string = strdupz(format_value_and_unit(value_string, 100, ae->old_value, ae->units, -1)); + ae->new_value_string = strdupz(format_value_and_unit(value_string, 100, ae->new_value, ae->units, -1)); + + // add it to host if not already there + if(unlikely(*pointers[0] == 'A')) { + ae->next = host->health_log.alarms; + host->health_log.alarms = ae; + loaded++; + } + else updated++; + + if(unlikely(ae->unique_id > host->health_max_unique_id)) + host->health_max_unique_id = ae->unique_id; + + if(unlikely(ae->alarm_id >= host->health_max_alarm_id)) + host->health_max_alarm_id = ae->alarm_id; + } + else { + error("HEALTH [%s]: line %zu of file '%s' is invalid (unrecognized entry type '%s').", host->hostname, line, filename, pointers[0]); + errored++; + } + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + freez(buf); + + if(!host->health_max_unique_id) host->health_max_unique_id = (uint32_t)now_realtime_sec(); + if(!host->health_max_alarm_id) host->health_max_alarm_id = (uint32_t)now_realtime_sec(); + + host->health_log.next_log_id = host->health_max_unique_id + 1; + if (unlikely(!host->health_log.next_alarm_id || host->health_log.next_alarm_id <= host->health_max_alarm_id)) + host->health_log.next_alarm_id = host->health_max_alarm_id + 1; + + debug(D_HEALTH, "HEALTH [%s]: loaded file '%s' with %zd new alarm entries, updated %zd alarms, errors %zd entries, duplicate %zd", host->hostname, filename, loaded, updated, errored, duplicate); + return loaded; +} + +inline void health_alarm_log_load(RRDHOST *host) { + health_alarm_log_close(host); + + char filename[FILENAME_MAX + 1]; + snprintfz(filename, FILENAME_MAX, "%s.old", host->health_log_filename); + FILE *fp = fopen(filename, "r"); + if(!fp) + error("HEALTH [%s]: cannot open health file: %s", host->hostname, filename); + else { + health_alarm_log_read(host, fp, filename); + fclose(fp); + } + + host->health_log_entries_written = 0; + fp = fopen(host->health_log_filename, "r"); + if(!fp) + error("HEALTH [%s]: cannot open health file: %s", host->hostname, host->health_log_filename); + else { + health_alarm_log_read(host, fp, host->health_log_filename); + fclose(fp); + } + + health_alarm_log_open(host); +} + + +// ---------------------------------------------------------------------------- +// health alarm log management + +inline ALARM_ENTRY* health_create_alarm_entry( + RRDHOST *host, + uint32_t alarm_id, + uint32_t alarm_event_id, + time_t when, + const char *name, + const char *chart, + const char *family, + const char *exec, + const char *recipient, + time_t duration, + calculated_number old_value, + calculated_number new_value, + RRDCALC_STATUS old_status, + RRDCALC_STATUS new_status, + const char *source, + const char *units, + const char *info, + int delay, + uint32_t flags +) { + debug(D_HEALTH, "Health adding alarm log entry with id: %u", host->health_log.next_log_id); + + ALARM_ENTRY *ae = callocz(1, sizeof(ALARM_ENTRY)); + ae->name = strdupz(name); + ae->hash_name = simple_hash(ae->name); + + if(chart) { + ae->chart = strdupz(chart); + ae->hash_chart = simple_hash(ae->chart); + } + + if(family) + ae->family = strdupz(family); + + if(exec) ae->exec = strdupz(exec); + if(recipient) ae->recipient = strdupz(recipient); + if(source) ae->source = strdupz(source); + if(units) ae->units = strdupz(units); + if(info) ae->info = strdupz(info); + + ae->unique_id = host->health_log.next_log_id++; + ae->alarm_id = alarm_id; + ae->alarm_event_id = alarm_event_id; + ae->when = when; + ae->old_value = old_value; + ae->new_value = new_value; + + char value_string[100 + 1]; + ae->old_value_string = strdupz(format_value_and_unit(value_string, 100, ae->old_value, ae->units, -1)); + ae->new_value_string = strdupz(format_value_and_unit(value_string, 100, ae->new_value, ae->units, -1)); + + ae->old_status = old_status; + ae->new_status = new_status; + ae->duration = duration; + ae->delay = delay; + ae->delay_up_to_timestamp = when + delay; + ae->flags |= flags; + + ae->last_repeat = 0; + + if(ae->old_status == RRDCALC_STATUS_WARNING || ae->old_status == RRDCALC_STATUS_CRITICAL) + ae->non_clear_duration += ae->duration; + + return ae; +} + +inline void health_alarm_log( + RRDHOST *host, + ALARM_ENTRY *ae +) { + debug(D_HEALTH, "Health adding alarm log entry with id: %u", ae->unique_id); + + if(unlikely(alarm_entry_isrepeating(host, ae))) { + error("Repeating alarms cannot be added to host's alarm log entries. It seems somewhere in the logic, API is being misused. Alarm id: %u", ae->alarm_id); + return; + } + // link it + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + ae->next = host->health_log.alarms; + host->health_log.alarms = ae; + host->health_log.count++; + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + // match previous alarms + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + ALARM_ENTRY *t; + for(t = host->health_log.alarms ; t ; t = t->next) { + if(t != ae && t->alarm_id == ae->alarm_id) { + if(!(t->flags & HEALTH_ENTRY_FLAG_UPDATED) && !t->updated_by_id) { + t->flags |= HEALTH_ENTRY_FLAG_UPDATED; + t->updated_by_id = ae->unique_id; + ae->updates_id = t->unique_id; + + if((t->new_status == RRDCALC_STATUS_WARNING || t->new_status == RRDCALC_STATUS_CRITICAL) && + (t->old_status == RRDCALC_STATUS_WARNING || t->old_status == RRDCALC_STATUS_CRITICAL)) + ae->non_clear_duration += t->non_clear_duration; + + health_alarm_log_save(host, t); + } + + // no need to continue + break; + } + } + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + health_alarm_log_save(host, ae); +} + +inline void health_alarm_log_free_one_nochecks_nounlink(ALARM_ENTRY *ae) { + freez(ae->name); + freez(ae->chart); + freez(ae->family); + freez(ae->exec); + freez(ae->recipient); + freez(ae->source); + freez(ae->units); + freez(ae->info); + freez(ae->old_value_string); + freez(ae->new_value_string); + freez(ae); +} + +inline void health_alarm_log_free(RRDHOST *host) { + rrdhost_check_wrlock(host); + + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *ae; + while((ae = host->health_log.alarms)) { + host->health_log.alarms = ae->next; + health_alarm_log_free_one_nochecks_nounlink(ae); + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} diff --git a/health/notifications/Makefile.am b/health/notifications/Makefile.am new file mode 100644 index 0000000..e6b4213 --- /dev/null +++ b/health/notifications/Makefile.am @@ -0,0 +1,51 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +AUTOMAKE_OPTIONS = subdir-objects +MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +CLEANFILES = \ + alarm-notify.sh \ + $(NULL) + +include $(top_srcdir)/build/subst.inc +SUFFIXES = .in + +dist_libconfig_DATA = \ + health_alarm_notify.conf \ + health_email_recipients.conf \ + $(NULL) + +dist_plugins_SCRIPTS = \ + alarm-notify.sh \ + alarm-email.sh \ + alarm-test.sh \ + $(NULL) + +dist_noinst_DATA = \ + alarm-notify.sh.in \ + README.md \ + $(NULL) + +include alerta/Makefile.inc +include awssns/Makefile.inc +include discord/Makefile.inc +include email/Makefile.inc +include flock/Makefile.inc +include hangouts/Makefile.inc +include irc/Makefile.inc +include kavenegar/Makefile.inc +include messagebird/Makefile.inc +include opsgenie/Makefile.inc +include pagerduty/Makefile.inc +include pushbullet/Makefile.inc +include pushover/Makefile.inc +include rocketchat/Makefile.inc +include slack/Makefile.inc +include smstools3/Makefile.inc +include stackpulse/Makefile.inc +include syslog/Makefile.inc +include telegram/Makefile.inc +include twilio/Makefile.inc +include web/Makefile.inc +include matrix/Makefile.inc +include custom/Makefile.inc diff --git a/health/notifications/README.md b/health/notifications/README.md new file mode 100644 index 0000000..5354198 --- /dev/null +++ b/health/notifications/README.md @@ -0,0 +1,86 @@ +<!-- +title: "Alarm notifications" +description: "Reference documentation for Netdata's alarm notification feature, which supports dozens of endpoints, user roles, and more." +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/README.md +--> + +# Alarm notifications + +The `exec` line in health configuration defines an external script that will be called once +the alarm is triggered. The default script is `alarm-notify.sh`. + +You can change the default script globally by editing `/etc/netdata/netdata.conf`. + +`alarm-notify.sh` is capable of sending notifications: + +- to multiple recipients +- using multiple notification methods +- filtering severity per recipient + +It uses **roles**. For example `sysadmin`, `webmaster`, `dba`, etc. + +Each alarm is assigned to one or more roles, using the `to` line of the alarm configuration. Then `alarm-notify.sh` uses +its own configuration file `/etc/netdata/health_alarm_notify.conf`. To edit it on your system, run +`/etc/netdata/edit-config health_alarm_notify.conf` and find the destination address of the notification for each +method. + +Each role may have one or more destinations. + +So, for example the `sysadmin` role may send: + +1. emails to admin1@example.com and admin2@example.com +2. pushover.net notifications to USERTOKENS `A`, `B` and `C`. +3. pushbullet.com push notifications to admin1@example.com and admin2@example.com +4. messages to slack.com channel `#alarms` and `#systems`. +5. messages to Discord channels `#alarms` and `#systems`. + +## Configuration + +Edit `/etc/netdata/health_alarm_notify.conf` by running `/etc/netdata/edit-config health_alarm_notify.conf`: + +- settings per notification method: + + all notification methods except email, require some configuration + (i.e. API keys, tokens, destination rooms, channels, etc). + +- **recipients** per **role** per **notification method** + +```sh +grep sysadmin /etc/netdata/health_alarm_notify.conf + +role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}" +role_recipients_pushover[sysadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" +role_recipients_pushbullet[sysadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" +role_recipients_telegram[sysadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" +role_recipients_slack[sysadmin]="${DEFAULT_RECIPIENT_SLACK}" +... +``` + +## Testing Notifications + +You can run the following command by hand, to test alarms configuration: + +```sh +# become user netdata +su -s /bin/bash netdata + +# enable debugging info on the console +export NETDATA_ALARM_NOTIFY_DEBUG=1 + +# send test alarms to sysadmin +/usr/libexec/netdata/plugins.d/alarm-notify.sh test + +# send test alarms to any role +/usr/libexec/netdata/plugins.d/alarm-notify.sh test "ROLE" +``` + +Note that in versions before 1.16, the plugins.d directory may be installed in a different location in certain OSs (e.g. under `/usr/lib/netdata`). You can always find the location of the alarm-notify.sh script in `netdata.conf`. + +If you need to dig even deeper, you can trace the execution with `bash -x`. Note that in test mode, alarm-notify.sh calls itself with many more arguments. So first do + +```sh +bash -x /usr/libexec/netdata/plugins.d/alarm-notify.sh test +``` + + Then look in the output for the alarm-notify.sh calls and run the one you want to trace with `bash -x`. +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/alarm-email.sh b/health/notifications/alarm-email.sh new file mode 100755 index 0000000..69c4c3f --- /dev/null +++ b/health/notifications/alarm-email.sh @@ -0,0 +1,7 @@ +#!/usr/bin/env bash +# SPDX-License-Identifier: GPL-3.0-or-later + +# OBSOLETE - REPLACED WITH +# alarm-notify.sh + +${0/alarm-email.sh/alarm-notify.sh} "${@}" diff --git a/health/notifications/alarm-notify.sh.in b/health/notifications/alarm-notify.sh.in new file mode 100755 index 0000000..3bf8db5 --- /dev/null +++ b/health/notifications/alarm-notify.sh.in @@ -0,0 +1,2759 @@ +#!/usr/bin/env bash +#shellcheck source=/dev/null disable=SC2086,SC2154 + +# netdata +# real-time performance and health monitoring, done right! +# (C) 2017 Costa Tsaousis <costa@tsaousis.gr> +# SPDX-License-Identifier: GPL-3.0-or-later +# +# Script to send alarm notifications for netdata +# +# Features: +# - multiple notification methods +# - multiple roles per alarm +# - multiple recipients per role +# - severity filtering per recipient +# +# Supported notification methods: +# - emails by @ktsaou +# - slack.com notifications by @ktsaou +# - alerta.io notifications by @kattunga +# - discordapp.com notifications by @lowfive +# - pushover.net notifications by @ktsaou +# - pushbullet.com push notifications by Tiago Peralta @tperalta82 #1070 +# - telegram.org notifications by @hashworks #1002 +# - twilio.com notifications by Levi Blaney @shadycuz #1211 +# - kafka notifications by @ktsaou #1342 +# - pagerduty.com notifications by Jim Cooley @jimcooley #1373 +# - messagebird.com notifications by @tech_no_logical #1453 +# - hipchat notifications by @ktsaou #1561 +# - fleep notifications by @Ferroin +# - prowlapp.com notifications by @Ferroin +# - irc notifications by @manosf +# - custom notifications by @ktsaou +# - syslog messages by @Ferroin +# - Microsoft Team notification by @tioumen +# - RocketChat notifications by @Hermsi1337 #3777 +# - Google Hangouts Chat notifications by @EnzoAkira and @hendrikhofstadt +# - Dynatrace Event by @illumine +# - Stackpulse Event by @thiagoftsm +# - Opsgenie by @thiaoftsm #9858 + +# ----------------------------------------------------------------------------- +# testing notifications + +if { [ "${1}" = "test" ] || [ "${2}" = "test" ]; } && [ "${#}" -le 2 ]; then + if [ "${2}" = "test" ]; then + recipient="${1}" + else + recipient="${2}" + fi + + [ -z "${recipient}" ] && recipient="sysadmin" + + id=1 + last="CLEAR" + test_res=0 + for x in "WARNING" "CRITICAL" "CLEAR"; do + echo >&2 + echo >&2 "# SENDING TEST ${x} ALARM TO ROLE: ${recipient}" + + "${0}" "${recipient}" "$(hostname)" 1 1 "${id}" "$(date +%s)" "test_alarm" "test.chart" "test.family" "${x}" "${last}" 100 90 "${0}" 1 $((0 + id)) "units" "this is a test alarm to verify notifications work" "new value" "old value" "evaluated expression" "expression variable values" 0 0 + #shellcheck disable=SC2181 + if [ $? -ne 0 ]; then + echo >&2 "# FAILED" + test_res=1 + else + echo >&2 "# OK" + fi + + last="${x}" + id=$((id + 1)) + done + + exit $test_res +fi + +export PATH="${PATH}:/sbin:/usr/sbin:/usr/local/sbin" +export LC_ALL=C + +# ----------------------------------------------------------------------------- + +PROGRAM_NAME="$(basename "${0}")" + +logdate() { + date "+%Y-%m-%d %H:%M:%S" +} + +log() { + local status="${1}" + shift + + echo >&2 "$(logdate): ${PROGRAM_NAME}: ${status}: ${*}" + +} + +warning() { + log WARNING "${@}" +} + +error() { + log ERROR "${@}" +} + +info() { + log INFO "${@}" +} + +fatal() { + log FATAL "${@}" + exit 1 +} + +debug=${NETDATA_ALARM_NOTIFY_DEBUG-0} +debug() { + [ "${debug}" = "1" ] && log DEBUG "${@}" +} + +docurl() { + if [ -z "${curl}" ]; then + error "${curl} is unset." + return 1 + fi + + if [ "${debug}" = "1" ]; then + echo >&2 "--- BEGIN curl command ---" + printf >&2 "%q " ${curl} "${@}" + echo >&2 + echo >&2 "--- END curl command ---" + + local out code ret + out=$(mktemp /tmp/netdata-health-alarm-notify-XXXXXXXX) + code=$(${curl} ${curl_options} --write-out "%{http_code}" --output "${out}" --silent --show-error "${@}") + ret=$? + echo >&2 "--- BEGIN received response ---" + cat >&2 "${out}" + echo >&2 + echo >&2 "--- END received response ---" + echo >&2 "RECEIVED HTTP RESPONSE CODE: ${code}" + rm "${out}" + echo "${code}" + return ${ret} + fi + + ${curl} ${curl_options} --write-out "%{http_code}" --output /dev/null --silent --show-error "${@}" + return $? +} + +# ----------------------------------------------------------------------------- +# List of all the notification mechanisms we support. +# Used in a couple of places to write more compact code. + +method_names=" +email +pushover +pushbullet +telegram +slack +alerta +flock +discord +hipchat +twilio +messagebird +pd +fleep +syslog +custom +msteam +kavenegar +prowl +irc +awssns +rocketchat +sms +hangouts +dynatrace +matrix +" + +# ----------------------------------------------------------------------------- +# this is to be overwritten by the config file + +custom_sender() { + info "not sending custom notification for ${status} of '${host}.${chart}.${name}'" +} + +# ----------------------------------------------------------------------------- + +# check for BASH v4+ (required for associative arrays) +if [ ${BASH_VERSINFO[0]} -lt 4 ]; then + fatal "BASH version 4 or later is required (this is ${BASH_VERSION})." +fi + +# ----------------------------------------------------------------------------- +# defaults to allow running this script by hand + +[ -z "${NETDATA_USER_CONFIG_DIR}" ] && NETDATA_USER_CONFIG_DIR="@configdir_POST@" +[ -z "${NETDATA_STOCK_CONFIG_DIR}" ] && NETDATA_STOCK_CONFIG_DIR="@libconfigdir_POST@" +[ -z "${NETDATA_CACHE_DIR}" ] && NETDATA_CACHE_DIR="@cachedir_POST@" +[ -z "${NETDATA_REGISTRY_URL}" ] && NETDATA_REGISTRY_URL="https://registry.my-netdata.io" +[ -z "${NETDATA_REGISTRY_CLOUD_BASE_URL}" ] && NETDATA_REGISTRY_CLOUD_BASE_URL="https://app.netdata.cloud" + +# ----------------------------------------------------------------------------- +# parse command line parameters + +if [[ ${1} = "unittest" ]]; then + unittest=1 # enable unit testing mode + roles="${2}" # the role that should be used for unit testing + cfgfile="${3}" # the location of the config file to use for unit testing + status="${4}" # the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + old_status="${5}" # the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL +else + roles="${1}" # the roles that should be notified for this event + args_host="${2}" # the host generated this event + unique_id="${3}" # the unique id of this event + alarm_id="${4}" # the unique id of the alarm that generated this event + event_id="${5}" # the incremental id of the event, for this alarm id + when="${6}" # the timestamp this event occurred + name="${7}" # the name of the alarm, as given in netdata health.d entries + chart="${8}" # the name of the chart (type.id) + family="${9}" # the family of the chart + status="${10}" # the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + old_status="${11}" # the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + value="${12}" # the current value of the alarm + old_value="${13}" # the previous value of the alarm + src="${14}" # the line number and file the alarm has been configured + duration="${15}" # the duration in seconds of the previous alarm state + non_clear_duration="${16}" # the total duration in seconds this is/was non-clear + units="${17}" # the units of the value + info="${18}" # a short description of the alarm + value_string="${19}" # friendly value (with units) + # shellcheck disable=SC2034 + # variable is unused, but https://github.com/netdata/netdata/pull/5164#discussion_r255572947 + old_value_string="${20}" # friendly old value (with units), previously named "old_value_string" + calc_expression="${21}" # contains the expression that was evaluated to trigger the alarm + calc_param_values="${22}" # the values of the parameters in the expression, at the time of the evaluation + total_warnings="${23}" # Total number of alarms in WARNING state + total_critical="${24}" # Total number of alarms in CRITICAL state +fi + +# ----------------------------------------------------------------------------- +# find a suitable hostname to use, if netdata did not supply a hostname + +if [ -z ${args_host} ]; then + this_host=$(hostname -s 2>/dev/null) + host="${this_host}" + args_host="${this_host}" +else + host="${args_host}" +fi + +# ----------------------------------------------------------------------------- +# screen statuses we don't need to send a notification + +# don't do anything if this is not WARNING, CRITICAL or CLEAR +if [ "${status}" != "WARNING" ] && [ "${status}" != "CRITICAL" ] && [ "${status}" != "CLEAR" ]; then + info "not sending notification for ${status} of '${host}.${chart}.${name}'" + exit 1 +fi + +# don't do anything if this is CLEAR, but it was not WARNING or CRITICAL +if [ "${clear_alarm_always}" != "YES" ] && [ "${old_status}" != "WARNING" ] && [ "${old_status}" != "CRITICAL" ] && [ "${status}" = "CLEAR" ]; then + info "not sending notification for ${status} of '${host}.${chart}.${name}' (last status was ${old_status})" + exit 1 +fi + +# ----------------------------------------------------------------------------- +# load configuration + +# By default fetch images from the global public registry. +# This is required by default, since all notification methods need to download +# images via the Internet, and private registries might not be reachable. +# This can be overwritten at the configuration file. +images_base_url="https://registry.my-netdata.io" + +# curl options to use +curl_options="" + +# hostname handling +use_fqdn="NO" + +# needed commands +# if empty they will be searched in the system path +curl= +sendmail= + +# enable / disable features +for method_name in ${method_names^^}; do + declare SEND_${method_name}="YES" + declare DEFAULT_RECIPIENT_${method_name} +done + +for method_name in ${method_names}; do + declare -A role_recipients_${method_name} +done + +# slack configs +SLACK_WEBHOOK_URL= + +# Microsoft Team configs +MSTEAM_WEBHOOK_URL= + +# rocketchat configs +ROCKETCHAT_WEBHOOK_URL= + +# alerta configs +ALERTA_WEBHOOK_URL= +ALERTA_API_KEY= + +# flock configs +FLOCK_WEBHOOK_URL= + +# discord configs +DISCORD_WEBHOOK_URL= + +# pushover configs +PUSHOVER_APP_TOKEN= + +# pushbullet configs +PUSHBULLET_ACCESS_TOKEN= +PUSHBULLET_SOURCE_DEVICE= + +# twilio configs +TWILIO_ACCOUNT_SID= +TWILIO_ACCOUNT_TOKEN= +TWILIO_NUMBER= + +# hipchat configs +HIPCHAT_SERVER= +HIPCHAT_AUTH_TOKEN= + +# messagebird configs +MESSAGEBIRD_ACCESS_KEY= +MESSAGEBIRD_NUMBER= + +# kavenegar configs +KAVENEGAR_API_KEY= +KAVENEGAR_SENDER= + +# telegram configs +TELEGRAM_BOT_TOKEN= + +# kafka configs +SEND_KAFKA="YES" +KAFKA_URL= +KAFKA_SENDER_IP= + +# pagerduty.com configs +PD_SERVICE_KEY= +USE_PD_VERSION= + +# fleep.io configs +FLEEP_SENDER="${host}" + +# Amazon SNS configs +AWSSNS_MESSAGE_FORMAT= + +# Matrix configs +MATRIX_HOMESERVER= +MATRIX_ACCESSTOKEN= + +# syslog configs +SYSLOG_FACILITY= + +# email configs +EMAIL_SENDER= +EMAIL_CHARSET=$(locale charmap 2>/dev/null) +EMAIL_THREADING= +EMAIL_PLAINTEXT_ONLY= + +# irc configs +IRC_NICKNAME= +IRC_REALNAME= +IRC_NETWORK= + +# hangouts configs +declare -A HANGOUTS_WEBHOOK_URI +declare -A HANGOUTS_WEBHOOK_THREAD + +# dynatrace configs +DYNATRACE_SPACE= +DYNATRACE_SERVER= +DYNATRACE_TOKEN= +DYNATRACE_TAG_VALUE= +DYNATRACE_ANNOTATION_TYPE= +DYNATRACE_EVENT= +SEND_DYNATRACE= + +# stackpulse configs +STACKPULSE_WEBHOOK= + +# opsgenie configs +OPSGENIE_API_KEY= + +# load the stock and user configuration files +# these will overwrite the variables above + +if [ ${unittest} ]; then + if source "${cfgfile}"; then + error "Failed to load requested config file." + exit 1 + fi +else + for CONFIG in "${NETDATA_STOCK_CONFIG_DIR}/health_alarm_notify.conf" "${NETDATA_USER_CONFIG_DIR}/health_alarm_notify.conf"; do + if [ -f "${CONFIG}" ]; then + debug "Loading config file '${CONFIG}'..." + source "${CONFIG}" || error "Failed to load config file '${CONFIG}'." + else + warning "Cannot find file '${CONFIG}'." + fi + done +fi + +OPSGENIE_API_URL=${OPSGENIE_API_URL:-"https://api.opsgenie.com"} + +# If we didn't autodetect the character set for e-mail and it wasn't +# set by the user, we need to set it to a reasonable default. UTF-8 +# should be correct for almost all modern UNIX systems. +if [ -z ${EMAIL_CHARSET} ]; then + EMAIL_CHARSET="UTF-8" +fi + +# If we've been asked to use FQDN's for the URL's in the alarm, do so, +# unless we're sending an alarm for a child system which we can't get the +# FQDN of easily. +if [ "${use_fqdn}" = "YES" ] && [ "${host}" = "$(hostname -s 2>/dev/null)" ]; then + host="$(hostname -f 2>/dev/null)" +fi + +# ----------------------------------------------------------------------------- +# filter a recipient based on alarm event severity + +filter_recipient_by_criticality() { + local method="${1}" x="${2}" r s + shift + + r="${x/|*/}" # the recipient + s="${x/*|/}" # the severity required for notifying this recipient + + # no severity filtering for this person + [ "${r}" = "${s}" ] && return 0 + + # the severity is invalid + s="${s^^}" + if [ "${s}" != "CRITICAL" ]; then + error "SEVERITY FILTERING for ${x} VIA ${method}: invalid severity '${s,,}', only 'critical' is supported." + return 0 + fi + + # create the status tracking directory for this user + [ ! -d "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}" ] && + mkdir -p "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}" + + case "${status}" in + CRITICAL) + # make sure he will get future notifications for this alarm too + touch "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: the alarm is CRITICAL (will now receive next status change)" + return 0 + ;; + + WARNING) + if [ -f "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" ]; then + # we do not remove the file, so that he will get future notifications of this alarm + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: recipient has been notified for this alarm in the past (will still receive next status change)" + return 0 + fi + ;; + + *) + if [ -f "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" ]; then + # remove the file, so that he will only receive notifications for CRITICAL states for this alarm + rm "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: recipient has been notified for this alarm (will only receive CRITICAL notifications from now on)" + return 0 + fi + ;; + esac + + debug "SEVERITY FILTERING for ${x} VIA ${method}: BLOCK: recipient should not receive this notification" + return 1 +} + +# ----------------------------------------------------------------------------- +# verify the delivery methods supported + +# check slack +[ -z "${SLACK_WEBHOOK_URL}" ] && SEND_SLACK="NO" + +# check rocketchat +[ -z "${ROCKETCHAT_WEBHOOK_URL}" ] && SEND_ROCKETCHAT="NO" + +# check alerta +[ -z "${ALERTA_WEBHOOK_URL}" ] && SEND_ALERTA="NO" + +# check flock +[ -z "${FLOCK_WEBHOOK_URL}" ] && SEND_FLOCK="NO" + +# check discord +[ -z "${DISCORD_WEBHOOK_URL}" ] && SEND_DISCORD="NO" + +# check pushover +[ -z "${PUSHOVER_APP_TOKEN}" ] && SEND_PUSHOVER="NO" + +# check pushbullet +[ -z "${PUSHBULLET_ACCESS_TOKEN}" ] && SEND_PUSHBULLET="NO" + +# check twilio +{ [ -z "${TWILIO_ACCOUNT_TOKEN}" ] || [ -z "${TWILIO_ACCOUNT_SID}" ] || [ -z "${TWILIO_NUMBER}" ]; } && SEND_TWILIO="NO" + +# check hipchat +[ -z "${HIPCHAT_AUTH_TOKEN}" ] && SEND_HIPCHAT="NO" + +# check messagebird +{ [ -z "${MESSAGEBIRD_ACCESS_KEY}" ] || [ -z "${MESSAGEBIRD_NUMBER}" ]; } && SEND_MESSAGEBIRD="NO" + +# check kavenegar +{ [ -z "${KAVENEGAR_API_KEY}" ] || [ -z "${KAVENEGAR_SENDER}" ]; } && SEND_KAVENEGAR="NO" + +# check telegram +[ -z "${TELEGRAM_BOT_TOKEN}" ] && SEND_TELEGRAM="NO" + +# check kafka +{ [ -z "${KAFKA_URL}" ] || [ -z "${KAFKA_SENDER_IP}" ]; } && SEND_KAFKA="NO" + +# check irc +[ -z "${IRC_NETWORK}" ] && SEND_IRC="NO" + +# check hangouts +[ ${#HANGOUTS_WEBHOOK_URI[@]} -eq 0 ] && SEND_HANGOUTS="NO" + +# check fleep +#shellcheck disable=SC2153 +{ [ -z "${FLEEP_SERVER}" ] || [ -z "${FLEEP_SENDER}" ]; } && SEND_FLEEP="NO" + +# check dynatrace +{ [ -z "${DYNATRACE_SPACE}" ] || + [ -z "${DYNATRACE_SERVER}" ] || + [ -z "${DYNATRACE_TOKEN}" ] || + [ -z "${DYNATRACE_TAG_VALUE}" ] || + [ -z "${DYNATRACE_EVENT}" ]; } && SEND_DYNATRACE="NO" + +# check opsgenie +[ -z "${OPSGENIE_API_KEY}" ] && SEND_OPSGENIE="NO" + +# check matrix +{ [ -z "${MATRIX_HOMESERVER}" ] || [ -z "${MATRIX_ACCESSTOKEN}" ]; } && SEND_MATRIX="NO" + +# check stackpulse +[ -z "${STACKPULSE_WEBHOOK}" ] && SEND_STACKPULSE="NO" + +if [ "${SEND_PUSHOVER}" = "YES" ] || + [ "${SEND_SLACK}" = "YES" ] || + [ "${SEND_ROCKETCHAT}" = "YES" ] || + [ "${SEND_ALERTA}" = "YES" ] || + [ "${SEND_PD}" = "YES" ] || + [ "${SEND_FLOCK}" = "YES" ] || + [ "${SEND_DISCORD}" = "YES" ] || + [ "${SEND_HIPCHAT}" = "YES" ] || + [ "${SEND_TWILIO}" = "YES" ] || + [ "${SEND_MESSAGEBIRD}" = "YES" ] || + [ "${SEND_KAVENEGAR}" = "YES" ] || + [ "${SEND_TELEGRAM}" = "YES" ] || + [ "${SEND_PUSHBULLET}" = "YES" ] || + [ "${SEND_KAFKA}" = "YES" ] || + [ "${SEND_FLEEP}" = "YES" ] || + [ "${SEND_PROWL}" = "YES" ] || + [ "${SEND_HANGOUTS}" = "YES" ] || + [ "${SEND_MATRIX}" = "YES" ] || + [ "${SEND_CUSTOM}" = "YES" ] || + [ "${SEND_MSTEAM}" = "YES" ] || + [ "${SEND_DYNATRACE}" = "YES" ] || + [ "${SEND_STACKPULSE}" = "YES" ] || + [ "${SEND_OPSGENIE}" = "YES" ]; then + # if we need curl, check for the curl command + if [ -z "${curl}" ]; then + curl="$(command -v curl 2>/dev/null)" + fi + if [ -z "${curl}" ]; then + error "Cannot find curl command in the system path. Disabling all curl based notifications." + SEND_PUSHOVER="NO" + SEND_PUSHBULLET="NO" + SEND_TELEGRAM="NO" + SEND_SLACK="NO" + SEND_MSTEAM="NO" + SEND_ROCKETCHAT="NO" + SEND_ALERTA="NO" + SEND_PD="NO" + SEND_FLOCK="NO" + SEND_DISCORD="NO" + SEND_TWILIO="NO" + SEND_HIPCHAT="NO" + SEND_MESSAGEBIRD="NO" + SEND_KAVENEGAR="NO" + SEND_KAFKA="NO" + SEND_FLEEP="NO" + SEND_PROWL="NO" + SEND_HANGOUTS="NO" + SEND_MATRIX="NO" + SEND_CUSTOM="NO" + SEND_DYNATRACE="NO" + SEND_STACKPULSE="NO" + SEND_OPSGENIE="NO" + fi +fi + +if [ "${SEND_SMS}" = "YES" ]; then + if [ -z "${sendsms}" ]; then + sendsms="$(command -v sendsms 2>/dev/null)" + fi + if [ -z "${sendsms}" ]; then + SEND_SMS="NO" + fi +fi +# if we need sendmail, check for the sendmail command +if [ "${SEND_EMAIL}" = "YES" ] && [ -z "${sendmail}" ]; then + sendmail="$(command -v sendmail 2>/dev/null)" + if [ -z "${sendmail}" ]; then + debug "Cannot find sendmail command in the system path. Disabling email notifications." + SEND_EMAIL="NO" + fi +fi + +# if we need logger, check for the logger command +if [ "${SEND_SYSLOG}" = "YES" ] && [ -z "${logger}" ]; then + logger="$(command -v logger 2>/dev/null)" + if [ -z "${logger}" ]; then + debug "Cannot find logger command in the system path. Disabling syslog notifications." + SEND_SYSLOG="NO" + fi +fi + +# if we need aws, check for the aws command +if [ "${SEND_AWSSNS}" = "YES" ] && [ -z "${aws}" ]; then + aws="$(command -v aws 2>/dev/null)" + if [ -z "${aws}" ]; then + debug "Cannot find aws command in the system path. Disabling Amazon SNS notifications." + SEND_AWSSNS="NO" + fi +fi + +# ----------------------------------------------------------------------------- +# find the recipients' addresses per method + +# netdata may call us with multiple roles, and roles may have multiple but +# overlapping recipients - so, here we find the unique recipients. +for method_name in ${method_names}; do + send_var="SEND_${method_name^^}" + if [ "${!send_var}" = "NO" ]; then + continue + fi + + declare -A arr_var=() + + for x in ${roles//,/ }; do + # the roles 'silent' and 'disabled' mean: + # don't send a notification for this role + if [ "${x}" = "silent" ] || [ "${x}" = "disabled" ]; then + continue + fi + + role_recipients="role_recipients_${method_name}[$x]" + default_recipient_var="DEFAULT_RECIPIENT_${method_name^^}" + + a="${!role_recipients}" + [ -z "${a}" ] && a="${!default_recipient_var}" + for r in ${a//,/ }; do + [ "${r}" != "disabled" ] && filter_recipient_by_criticality ${method_name} "${r}" && arr_var[${r/|*/}]="1" + done + done + + # build the list of recipients + to_var="to_${method_name}" + declare to_${method_name}="${!arr_var[*]}" + + [ -z "${!to_var}" ] && declare ${send_var}="NO" +done + +# ----------------------------------------------------------------------------- +# handle fixup of the email recipient list. + +fix_to_email() { + to_email= + while [ -n "${1}" ]; do + [ -n "${to_email}" ] && to_email="${to_email}, " + to_email="${to_email}${1}" + shift 1 + done +} + +# ${to_email} without quotes here +fix_to_email ${to_email} + +# ----------------------------------------------------------------------------- +# handle output if we're running in unit test mode +if [ ${unittest} ]; then + for method_name in ${method_names}; do + to_var="to_${method_name}" + echo "results: ${method_name}: ${!to_var}" + done + exit 0 +fi + +# ----------------------------------------------------------------------------- +# check that we have at least a method enabled +proceed=0 +for method in "${SEND_EMAIL}" \ + "${SEND_PUSHOVER}" \ + "${SEND_TELEGRAM}" \ + "${SEND_SLACK}" \ + "${SEND_ROCKETCHAT}" \ + "${SEND_ALERTA}" \ + "${SEND_FLOCK}" \ + "${SEND_DISCORD}" \ + "${SEND_TWILIO}" \ + "${SEND_HIPCHAT}" \ + "${SEND_MESSAGEBIRD}" \ + "${SEND_KAVENEGAR}" \ + "${SEND_PUSHBULLET}" \ + "${SEND_KAFKA}" \ + "${SEND_PD}" \ + "${SEND_FLEEP}" \ + "${SEND_PROWL}" \ + "${SEND_MATRIX}" \ + "${SEND_CUSTOM}" \ + "${SEND_IRC}" \ + "${SEND_HANGOUTS}" \ + "${SEND_AWSSNS}" \ + "${SEND_SYSLOG}" \ + "${SEND_SMS}" \ + "${SEND_MSTEAM}" \ + "${SEND_DYNATRACE}" \ + "${SEND_STACKPULSE}" \ + "${SEND_OPSGENIE}" ; do + + if [ "${method}" == "YES" ]; then + proceed=1 + break + fi +done +if [ "$proceed" -eq 0 ]; then + fatal "All notification methods are disabled. Not sending notification for host '${host}', chart '${chart}' to '${roles}' for '${name}' = '${value}' for status '${status}'." +fi + +# ----------------------------------------------------------------------------- +# get the date the alarm happened + +date=$(date --date=@${when} "${date_format}" 2>/dev/null) +[ -z "${date}" ] && date=$(date "${date_format}" 2>/dev/null) +[ -z "${date}" ] && date=$(date --date=@${when} 2>/dev/null) +[ -z "${date}" ] && date=$(date 2>/dev/null) + +# ---------------------------------------------------------------------------- +# prepare some extra headers if we've been asked to thread e-mails +if [ "${SEND_EMAIL}" == "YES" ] && [ "${EMAIL_THREADING}" != "NO" ]; then + email_thread_headers="In-Reply-To: <${chart}-${name}@${host}>\\r\\nReferences: <${chart}-${name}@${host}>" +else + email_thread_headers= +fi + +# ----------------------------------------------------------------------------- +# function to URL encode a string + +urlencode() { + local string="${1}" strlen encoded pos c o + + strlen=${#string} + for ((pos = 0; pos < strlen; pos++)); do + c=${string:pos:1} + case "${c}" in + [-_.~a-zA-Z0-9]) + o="${c}" + ;; + + *) + printf -v o '%%%02x' "'${c}" + ;; + esac + encoded+="${o}" + done + + REPLY="${encoded}" + echo "${REPLY}" +} + +# ----------------------------------------------------------------------------- +# function to convert a duration in seconds, to a human readable duration +# using DAYS, MINUTES, SECONDS + +duration4human() { + local s="${1}" d=0 h=0 m=0 ds="day" hs="hour" ms="minute" ss="second" ret + d=$((s / 86400)) + s=$((s - (d * 86400))) + h=$((s / 3600)) + s=$((s - (h * 3600))) + m=$((s / 60)) + s=$((s - (m * 60))) + + if [ ${d} -gt 0 ]; then + [ ${m} -ge 30 ] && h=$((h + 1)) + [ ${d} -gt 1 ] && ds="days" + [ ${h} -gt 1 ] && hs="hours" + if [ ${h} -gt 0 ]; then + ret="${d} ${ds} and ${h} ${hs}" + else + ret="${d} ${ds}" + fi + elif [ ${h} -gt 0 ]; then + [ ${s} -ge 30 ] && m=$((m + 1)) + [ ${h} -gt 1 ] && hs="hours" + [ ${m} -gt 1 ] && ms="minutes" + if [ ${m} -gt 0 ]; then + ret="${h} ${hs} and ${m} ${ms}" + else + ret="${h} ${hs}" + fi + elif [ ${m} -gt 0 ]; then + [ ${m} -gt 1 ] && ms="minutes" + [ ${s} -gt 1 ] && ss="seconds" + if [ ${s} -gt 0 ]; then + ret="${m} ${ms} and ${s} ${ss}" + else + ret="${m} ${ms}" + fi + else + [ ${s} -gt 1 ] && ss="seconds" + ret="${s} ${ss}" + fi + + REPLY="${ret}" + echo "${REPLY}" +} + +# ----------------------------------------------------------------------------- +# email sender + +send_email() { + local ret opts=() sender_email="${EMAIL_SENDER}" sender_name= + if [ "${SEND_EMAIL}" = "YES" ]; then + + if [ -n "${EMAIL_SENDER}" ]; then + if [[ ${EMAIL_SENDER} =~ ^\".*\"\ \<.*\>$ ]]; then + # the name includes double quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d '"' -f 2)" + elif [[ ${EMAIL_SENDER} =~ ^\'.*\'\ \<.*\>$ ]]; then + # the name includes single quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d "'" -f 2)" + elif [[ ${EMAIL_SENDER} =~ ^.*\ \<.*\>$ ]]; then + # the name does not have any quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 1)" + fi + fi + + [ -n "${sender_email}" ] && opts+=(-f "${sender_email}") + [ -n "${sender_name}" ] && sendmail --help 2>&1 | grep -q "\-F " && opts+=(-F "${sender_name}") + + if [ "${debug}" = "1" ]; then + echo >&2 "--- BEGIN sendmail command ---" + printf >&2 "%q " "${sendmail}" -t "${opts[@]}" + echo >&2 + echo >&2 "--- END sendmail command ---" + fi + + "${sendmail}" -t "${opts[@]}" + ret=$? + + if [ ${ret} -eq 0 ]; then + info "sent email notification for: ${host} ${chart}.${name} is ${status} to '${to_email}'" + return 0 + else + error "failed to send email notification for: ${host} ${chart}.${name} is ${status} to '${to_email}' with error code ${ret}." + return 1 + fi + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pushover sender + +send_pushover() { + local apptoken="${1}" usertokens="${2}" when="${3}" url="${4}" status="${5}" title="${6}" message="${7}" httpcode sent=0 user priority + + if [ "${SEND_PUSHOVER}" = "YES" ] && [ -n "${apptoken}" ] && [ -n "${usertokens}" ] && [ -n "${title}" ] && [ -n "${message}" ]; then + + # https://pushover.net/api + priority=-2 + case "${status}" in + CLEAR) priority=-1 ;; # low priority: no sound or vibration + WARNING) priority=0 ;; # normal priority: respect quiet hours + CRITICAL) priority=1 ;; # high priority: bypass quiet hours + *) priority=-2 ;; # lowest priority: no notification at all + esac + + for user in ${usertokens}; do + httpcode=$(docurl \ + --form-string "token=${apptoken}" \ + --form-string "user=${user}" \ + --form-string "html=1" \ + --form-string "title=${title}" \ + --form-string "message=${message}" \ + --form-string "timestamp=${when}" \ + --form-string "url=${url}" \ + --form-string "url_title=Open netdata dashboard to view the alarm" \ + --form-string "priority=${priority}" \ + https://api.pushover.net/1/messages.json) + + if [ "${httpcode}" = "200" ]; then + info "sent pushover notification for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send pushover notification for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pushbullet sender + +send_pushbullet() { + local userapikey="${1}" source_device="${2}" recipients="${3}" url="${4}" title="${5}" message="${6}" httpcode sent=0 user + if [ "${SEND_PUSHBULLET}" = "YES" ] && [ -n "${userapikey}" ] && [ -n "${recipients}" ] && [ -n "${message}" ] && [ -n "${title}" ]; then + #https://docs.pushbullet.com/#create-push + for user in ${recipients}; do + httpcode=$(docurl \ + --header 'Access-Token: '${userapikey}'' \ + --header 'Content-Type: application/json' \ + --data-binary @<( + cat <<EOF + {"title": "${title}", + "type": "link", + "email": "${user}", + "body": "$(echo -n ${message})", + "url": "${url}", + "source_device_iden": "${source_device}"} +EOF + ) "https://api.pushbullet.com/v2/pushes" -X POST) + + if [ "${httpcode}" = "200" ]; then + info "sent pushbullet notification for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send pushbullet notification for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# kafka sender + +send_kafka() { + local httpcode sent=0 + if [ "${SEND_KAFKA}" = "YES" ]; then + httpcode=$(docurl -X POST \ + --data "{host_ip:\"${KAFKA_SENDER_IP}\",when:${when},name:\"${name}\",chart:\"${chart}\",family:\"${family}\",status:\"${status}\",old_status:\"${old_status}\",value:${value},old_value:${old_value},duration:${duration},non_clear_duration:${non_clear_duration},units:\"${units}\",info:\"${info}\"}" \ + "${KAFKA_URL}") + + if [ "${httpcode}" = "204" ]; then + info "sent kafka data for: ${host} ${chart}.${name} is ${status} and ip '${KAFKA_SENDER_IP}'" + sent=$((sent + 1)) + else + error "failed to send kafka data for: ${host} ${chart}.${name} is ${status} and ip '${KAFKA_SENDER_IP}' with HTTP response status code ${httpcode}." + fi + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pagerduty.com sender + +send_pd() { + local recipients="${1}" sent=0 severity current_time payload url response_code + unset t + case ${status} in + CLEAR) t='resolve' ; severity='info' ;; + WARNING) t='trigger' ; severity='warning' ;; + CRITICAL) t='trigger' ; severity='critical' ;; + esac + + if [ ${SEND_PD} = "YES" ] && [ -n "${t}" ]; then + if [ "$(uname)" == "Linux" ]; then + current_time=$(date -d @${when} +'%Y-%m-%dT%H:%M:%S.000') + else + current_time=$(date -r ${when} +'%Y-%m-%dT%H:%M:%S.000') + fi + for PD_SERVICE_KEY in ${recipients}; do + d="${status} ${name} = ${value_string} - ${host}, ${family}" + if [ ${USE_PD_VERSION} = "2" ]; then + payload="$( + cat <<EOF + { + "payload" : { + "summary": "${info:0:1024}", + "source" : "${args_host}", + "severity" : "${severity}", + "timestamp" : "${current_time}", + "group" : "${family}", + "class" : "${chart}", + "custom_details": { + "value_w_units": "${value_string}", + "when": "${when}", + "duration" : "${duration}", + "roles": "${roles}", + "alarm_id" : "${alarm_id}", + "name" : "${name}", + "chart" : "${chart}", + "family" : "${family}", + "status" : "${status}", + "old_status" : "${old_status}", + "value" : "${value}", + "old_value" : "${old_value}", + "src" : "${src}", + "non_clear_duration" : "${non_clear_duration}", + "units" : "${units}", + "info" : "${info}" + } + }, + "routing_key": "${PD_SERVICE_KEY}", + "event_action": "${t}", + "dedup_key": "${unique_id}" + } +EOF + )" + url="https://events.pagerduty.com/v2/enqueue" + response_code="202" + else + payload="$( + cat <<EOF + { + "service_key": "${PD_SERVICE_KEY}", + "event_type": "${t}", + "incident_key" : "${alarm_id}", + "description": "${d}", + "details": { + "value_w_units": "${value_string}", + "when": "${when}", + "duration" : "${duration}", + "roles": "${roles}", + "alarm_id" : "${alarm_id}", + "name" : "${name}", + "chart" : "${chart}", + "family" : "${family}", + "status" : "${status}", + "old_status" : "${old_status}", + "value" : "${value}", + "old_value" : "${old_value}", + "src" : "${src}", + "non_clear_duration" : "${non_clear_duration}", + "units" : "${units}", + "info" : "${info}" + } + } +EOF + )" + url="https://events.pagerduty.com/generic/2010-04-15/create_event.json" + response_code="200" + fi + httpcode=$(docurl -X POST --data "${payload}" ${url}) + if [ "${httpcode}" = "${response_code}" ]; then + info "sent pagerduty notification for: ${host} ${chart}.${name} is ${status}'" + sent=$((sent + 1)) + else + error "failed to send pagerduty notification for: ${host} ${chart}.${name} is ${status}, with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# twilio sender + +send_twilio() { + local accountsid="${1}" accounttoken="${2}" twilionumber="${3}" recipients="${4}" title="${5}" message="${6}" httpcode sent=0 user + if [ "${SEND_TWILIO}" = "YES" ] && [ -n "${accountsid}" ] && [ -n "${accounttoken}" ] && [ -n "${twilionumber}" ] && [ -n "${recipients}" ] && [ -n "${message}" ] && [ -n "${title}" ]; then + #https://www.twilio.com/packages/labs/code/bash/twilio-sms + for user in ${recipients}; do + httpcode=$(docurl -X POST \ + --data-urlencode "From=${twilionumber}" \ + --data-urlencode "To=${user}" \ + --data-urlencode "Body=${title} ${message}" \ + -u "${accountsid}:${accounttoken}" \ + "https://api.twilio.com/2010-04-01/Accounts/${accountsid}/Messages.json") + + if [ "${httpcode}" = "201" ]; then + info "sent Twilio SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Twilio SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# hipchat sender + +send_hipchat() { + local authtoken="${1}" recipients="${2}" message="${3}" httpcode sent=0 room color msg_format notify + + # remove <small></small> from the message + message="${message//<small>/}" + message="${message//<\/small>/}" + + if [ "${SEND_HIPCHAT}" = "YES" ] && [ -n "${HIPCHAT_SERVER}" ] && [ -n "${authtoken}" ] && [ -n "${recipients}" ] && [ -n "${message}" ]; then + # Valid values: html, text. + # Defaults to 'html'. + msg_format="html" + + # Background color for message. Valid values: yellow, green, red, purple, gray, random. Defaults to 'yellow'. + case "${status}" in + WARNING) color="yellow" ;; + CRITICAL) color="red" ;; + CLEAR) color="green" ;; + *) color="gray" ;; + esac + + # Whether this message should trigger a user notification (change the tab color, play a sound, notify mobile phones, etc). + # Each recipient's notification preferences are taken into account. + # Defaults to false. + notify="true" + + for room in ${recipients}; do + httpcode=$(docurl -X POST \ + -H "Content-type: application/json" \ + -H "Authorization: Bearer ${authtoken}" \ + -d "{\"color\": \"${color}\", \"from\": \"${host}\", \"message_format\": \"${msg_format}\", \"message\": \"${message}\", \"notify\": \"${notify}\"}" \ + "https://${HIPCHAT_SERVER}/v2/room/${room}/notification") + + if [ "${httpcode}" = "204" ]; then + info "sent HipChat notification for: ${host} ${chart}.${name} is ${status} to '${room}'" + sent=$((sent + 1)) + else + error "failed to send HipChat notification for: ${host} ${chart}.${name} is ${status} to '${room}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# messagebird sender + +send_messagebird() { + local accesskey="${1}" messagebirdnumber="${2}" recipients="${3}" title="${4}" message="${5}" httpcode sent=0 user + if [ "${SEND_MESSAGEBIRD}" = "YES" ] && [ -n "${accesskey}" ] && [ -n "${messagebirdnumber}" ] && [ -n "${recipients}" ] && [ -n "${message}" ] && [ -n "${title}" ]; then + #https://developers.messagebird.com/docs/messaging + for user in ${recipients}; do + httpcode=$(docurl -X POST \ + --data-urlencode "originator=${messagebirdnumber}" \ + --data-urlencode "recipients=${user}" \ + --data-urlencode "body=${title} ${message}" \ + --data-urlencode "datacoding=auto" \ + -H "Authorization: AccessKey ${accesskey}" \ + "https://rest.messagebird.com/messages") + + if [ "${httpcode}" = "201" ]; then + info "sent Messagebird SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Messagebird SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# kavenegar sender + +send_kavenegar() { + local API_KEY="${1}" kavenegarsender="${2}" recipients="${3}" title="${4}" message="${5}" httpcode sent=0 user + if [ "${SEND_KAVENEGAR}" = "YES" ] && [ -n "${API_KEY}" ] && [ -n "${kavenegarsender}" ] && [ -n "${recipients}" ] && [ -n "${message}" ] && [ -n "${title}" ]; then + # http://api.kavenegar.com/v1/{API-KEY}/sms/send.json + for user in ${recipients}; do + httpcode=$(docurl -X POST http://api.kavenegar.com/v1/${API_KEY}/sms/send.json \ + --data-urlencode "sender=${kavenegarsender}" \ + --data-urlencode "receptor=${user}" \ + --data-urlencode "message=${title} ${message}") + + if [ "${httpcode}" = "200" ]; then + info "sent Kavenegar SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Kavenegar SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# telegram sender + +send_telegram() { + local bottoken="${1}" chatids="${2}" message="${3}" httpcode sent=0 chatid emoji disableNotification="" + + if [ "${status}" = "CLEAR" ]; then disableNotification="--data-urlencode disable_notification=true"; fi + + case "${status}" in + WARNING) emoji="⚠️" ;; + CRITICAL) emoji="🔴" ;; + CLEAR) emoji="✅" ;; + *) emoji="⚪️" ;; + esac + + if [ "${SEND_TELEGRAM}" = "YES" ] && [ -n "${bottoken}" ] && [ -n "${chatids}" ] && [ -n "${message}" ]; then + for chatid in ${chatids}; do + # https://core.telegram.org/bots/api#sendmessage + httpcode=$(docurl ${disableNotification} \ + --data-urlencode "parse_mode=HTML" \ + --data-urlencode "disable_web_page_preview=true" \ + --data-urlencode "text=${emoji} ${message}" \ + "https://api.telegram.org/bot${bottoken}/sendMessage?chat_id=${chatid}") + + if [ "${httpcode}" = "200" ]; then + info "sent telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}'" + sent=$((sent + 1)) + elif [ "${httpcode}" = "401" ]; then + error "failed to send telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}': Wrong bot token." + else + error "failed to send telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# Microsoft Team sender + +send_msteam() { + + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_MSTEAM}" != "YES" ] && return 1 + + case "${status}" in + WARNING) icon="${MSTEAM_ICON_WARNING}" && color="${MSTEAM_COLOR_WARNING}" ;; + CRITICAL) icon="${MSTEAM_ICON_CRITICAL}" && color="${MSTEAM_COLOR_CRITICAL}" ;; + CLEAR) icon="${MSTEAM_ICON_CLEAR}" && color="${MSTEAM_COLOR_CLEAR}" ;; + *) icon="${MSTEAM_ICON_DEFAULT}" && color="${MSTEAM_COLOR_DEFAULT}" ;; + esac + + for channel in ${channels}; do + ## More details are available here regarding the payload syntax options : https://docs.microsoft.com/en-us/outlook/actionable-messages/message-card-reference + ## Online designer : https://acdesignerbeta.azurewebsites.net/ + payload="$( + cat <<EOF + { + "@context": "http://schema.org/extensions", + "@type": "MessageCard", + "themeColor": "${color}", + "title": "$icon Alert ${status} from netdata for ${host}", + "text": "${host} ${status_message}, ${chart} (_${family}_), *${alarm}*", + "potentialAction": [ + { + "@type": "OpenUri", + "name": "Netdata", + "targets": [ + { "os": "default", "uri": "${goto_url}" } + ] + } + ] + } +EOF + )" + + # Replacing in the webhook CHANNEL string by the MS Teams channel name from conf file. + webhook="${webhook//CHANNEL/${channel}}" + + httpcode=$(docurl -H "Content-Type: application/json" -d "${payload}" "${webhook}") + + if [ "${httpcode}" = "200" ]; then + info "sent Microsoft team notification for: ${host} ${chart}.${name} is ${status} to '${webhook}'" + sent=$((sent + 1)) + else + error "failed to send Microsoft team notification for: ${host} ${chart}.${name} is ${status} to '${webhook}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# slack sender + +send_slack() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_SLACK}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels}; do + # Default entry in the recipient is without a hash in front (backwards-compatible). Accept specification of channel or user. + if [ "${channel::1}" != "#" ] && [ "${channel::1}" != "@" ]; then channel="#$channel"; fi + + # If channel is equal to "#" then do not send the channel attribute at all. Slack also defines channels and users in webhooks. + if [ "${channel}" = "#" ]; then + ch="" + chstr="without specifying a channel" + else + ch="\"channel\": \"${channel}\"," + chstr="to '${channel}'" + fi + + payload="$( + cat <<EOF + { + $ch + "username": "netdata on ${host}", + "icon_url": "${images_base_url}/images/banner-icon-144x144.png", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "attachments": [ + { + "fallback": "${alarm} - ${chart} (${family}) - ${info}", + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "short": true + }, + { + "title": "${family}", + "short": true + } + ], + "thumb_url": "${image}", + "footer": "by ${host}", + "ts": ${when} + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ]; then + info "sent slack notification for: ${host} ${chart}.${name} is ${status} ${chstr}" + sent=$((sent + 1)) + else + error "failed to send slack notification for: ${host} ${chart}.${name} is ${status} ${chstr}, with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# rocketchat sender + +send_rocketchat() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_ROCKETCHAT}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels}; do + payload="$( + cat <<EOF + { + "channel": "#${channel}", + "alias": "netdata on ${host}", + "avatar": "${images_base_url}/images/banner-icon-144x144.png", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "attachments": [ + { + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "short": true, + "value": "chart" + }, + { + "title": "${family}", + "short": true, + "value": "family" + } + ], + "thumb_url": "${image}", + "ts": "${when}" + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ]; then + info "sent rocketchat notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send rocketchat notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# alerta sender + +send_alerta() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel severity resource event payload auth + + [ "${SEND_ALERTA}" != "YES" ] && return 1 + + case "${status}" in + CRITICAL) severity="critical" ;; + WARNING) severity="warning" ;; + CLEAR) severity="cleared" ;; + *) severity="indeterminate" ;; + esac + + if [[ ${chart} == httpcheck* ]]; then + resource=$chart + event=$name + else + resource="${host}:${family}" + event="${chart}.${name}" + fi + + for channel in ${channels}; do + payload="$( + cat <<EOF + { + "resource": "${resource}", + "event": "${event}", + "environment": "${channel}", + "severity": "${severity}", + "service": ["Netdata"], + "group": "Performance", + "value": "${value_string}", + "text": "${info}", + "tags": ["alarm_id:${alarm_id}"], + "attributes": { + "roles": "${roles}", + "name": "${name}", + "chart": "${chart}", + "family": "${family}", + "source": "${src}", + "moreInfo": "<a href=\"${goto_url}\">View Netdata</a>" + }, + "origin": "netdata/${host}", + "type": "netdataAlarm", + "rawData": "${BASH_ARGV[@]}" + } +EOF + )" + + if [ -n "${ALERTA_API_KEY}" ]; then + auth="Key ${ALERTA_API_KEY}" + fi + + httpcode=$(docurl -X POST "${webhook}/alert" -H "Content-Type: application/json" -H "Authorization: $auth" --data "${payload}") + + if [ "${httpcode}" = "200" ] || [ "${httpcode}" = "201" ]; then + info "sent alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + elif [ "${httpcode}" = "202" ]; then + info "suppressed alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + else + error "failed to send alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# flock sender + +send_flock() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_FLOCK}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels}; do + httpcode=$(docurl -X POST "${webhook}" -H "Content-Type: application/json" -d "{ + \"sendAs\": { + \"name\" : \"netdata on ${host}\", + \"profileImage\" : \"${images_base_url}/images/banner-icon-144x144.png\" + }, + \"text\": \"${host} *${status_message}*\", + \"timestamp\": \"${when}\", + \"attachments\": [ + { + \"description\": \"${chart} (${family}) - ${info}\", + \"color\": \"${color}\", + \"title\": \"${alarm}\", + \"url\": \"${goto_url}\", + \"text\": \"${info}\", + \"views\": { + \"image\": { + \"original\": { \"src\": \"${image}\", \"width\": 400, \"height\": 400 }, + \"thumbnail\": { \"src\": \"${image}\", \"width\": 50, \"height\": 50 }, + \"filename\": \"${image}\" + } + } + } + ] + }") + if [ "${httpcode}" = "200" ]; then + info "sent flock notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send flock notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# discord sender + +send_discord() { + local webhook="${1}/slack" channels="${2}" httpcode sent=0 channel color payload username + + [ "${SEND_DISCORD}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels}; do + username="netdata on ${host}" + [ ${#username} -gt 32 ] && username="${username:0:29}..." + + payload="$( + cat <<EOF + { + "channel": "#${channel}", + "username": "${username}", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "icon_url": "${images_base_url}/images/banner-icon-144x144.png", + "attachments": [ + { + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "value": "${family}" + } + ], + "thumb_url": "${image}", + "footer_icon": "${images_base_url}/images/banner-icon-144x144.png", + "footer": "${host}", + "ts": ${when} + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ]; then + info "sent discord notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send discord notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# fleep sender + +send_fleep() { + local httpcode sent=0 webhooks="${1}" data message + if [ "${SEND_FLEEP}" = "YES" ]; then + message="${host} ${status_message}, \`${chart}\` (${family}), *${alarm}*\\n${info}" + + for hook in ${webhooks}; do + data="{ " + data="${data} 'message': '${message}', " + data="${data} 'user': '${FLEEP_SENDER}' " + data="${data} }" + + httpcode=$(docurl -X POST --data "${data}" "https://fleep.io/hook/${hook}") + + if [ "${httpcode}" = "200" ]; then + info "sent fleep data for: ${host} ${chart}.${name} is ${status} and user '${FLEEP_SENDER}'" + sent=$((sent + 1)) + else + error "failed to send fleep data for: ${host} ${chart}.${name} is ${status} and user '${FLEEP_SENDER}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# Prowl sender + +send_prowl() { + local httpcode sent=0 data message keys prio=0 alarm_url event + if [ "${SEND_PROWL}" = "YES" ]; then + message="$(urlencode "${host} ${status_message}, \`${chart}\` (${family}), *${alarm}*\\n${info}")" + message="description=${message}" + keys="$(urlencode "$(echo "${1}" | tr ' ' ,)")" + keys="apikey=${keys}" + app="application=Netdata" + + case "${status}" in + CRITICAL) + prio=2 + ;; + WARNING) + prio=1 + ;; + esac + prio="priority=${prio}" + + alarm_url="$(urlencode ${goto_url})" + alarm_url="url=${alarm_url}" + event="$(urlencode "${host} ${status_message}")" + event="event=${event}" + + data="${keys}&${prio}&${alarm_url}&${app}&${event}&${message}" + + httpcode=$(docurl -X POST --data "${data}" "https://api.prowlapp.com/publicapi/add") + + if [ "${httpcode}" = "200" ]; then + info "sent prowl data for: ${host} ${chart}.${name} is ${status}" + sent=1 + else + error "failed to send prowl data for: ${host} ${chart}.${name} is ${status} with with error code ${httpcode}." + fi + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# irc sender + +send_irc() { + local NICKNAME="${1}" REALNAME="${2}" CHANNELS="${3}" NETWORK="${4}" SERVERNAME="${5}" MESSAGE="${6}" sent=0 channel color send_alarm reply_codes error + + if [ "${SEND_IRC}" = "YES" ] && [ -n "${NICKNAME}" ] && [ -n "${REALNAME}" ] && [ -n "${CHANNELS}" ] && [ -n "${NETWORK}" ] && [ -n "${SERVERNAME}" ]; then + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + SNDMESSAGE="${MESSAGE//$'\n'/", "}" + for CHANNEL in ${CHANNELS}; do + error=0 + send_alarm=$(echo -e "USER ${NICKNAME} guest ${REALNAME} ${SERVERNAME}\\nNICK ${NICKNAME}\\nJOIN ${CHANNEL}\\nPRIVMSG ${CHANNEL} :${SNDMESSAGE}\\nQUIT\\n" \ | nc "${NETWORK}" 6667) + reply_codes=$(echo "${send_alarm}" | cut -d ' ' -f 2 | grep -o '[0-9]*') + for code in ${reply_codes}; do + if [ "${code}" -ge 400 ] && [ "${code}" -le 599 ]; then + error=1 + break + fi + done + + if [ "${error}" -eq 0 ]; then + info "sent irc notification for: ${host} ${chart}.${name} is ${status} to '${CHANNEL}'" + sent=$((sent + 1)) + else + error "failed to send irc notification for: ${host} ${chart}.${name} is ${status} to '${CHANNEL}', with error code ${code}." + fi + done + fi + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# Amazon SNS sender + +send_awssns() { + local targets="${1}" message='' sent=0 region='' + local default_format="${status} on ${host} at ${date}: ${chart} ${value_string}" + + [ "${SEND_AWSSNS}" = "YES" ] || return 1 + + message=${AWSSNS_MESSAGE_FORMAT:-${default_format}} + + for target in ${targets}; do + # Extract the region from the target ARN. We need to explicitly specify the region so that it matches up correctly. + region="$(echo ${target} | cut -f 4 -d ':')" + if ${aws} sns publish --region "${region}" --subject "${host} ${status_message} - ${name//_/ } - ${chart}" --message "${message}" --target-arn ${target} &>/dev/null; then + info "sent Amazon SNS notification for: ${host} ${chart}.${name} is ${status} to '${target}'" + sent=$((sent + 1)) + else + error "failed to send Amazon SNS notification for: ${host} ${chart}.${name} is ${status} to '${target}'" + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# Matrix sender + +send_matrix() { + local homeserver="${1}" webhook accesstoken rooms="${2}" httpcode sent=0 payload + + [ "${SEND_MATRIX}" != "YES" ] && return 1 + [ -z "${MATRIX_ACCESSTOKEN}" ] && return 1 + + accesstoken="${MATRIX_ACCESSTOKEN}" + + case "${status}" in + WARNING) emoji="⚠️" ;; + CRITICAL) emoji="🔴" ;; + CLEAR) emoji="✅" ;; + *) emoji="⚪️" ;; + esac + + for room in ${rooms}; do + webhook="$homeserver/_matrix/client/r0/rooms/$(urlencode $room)/send/m.room.message?access_token=$accesstoken" + payload="$( + cat <<EOF + { + "msgtype": "m.notice", + "format": "org.matrix.custom.html", + "formatted_body": "${emoji} ${host} ${status_message} - <b>${name//_/ }</b><br>${chart} (${family})<br><a href=\"${goto_url}\">${alarm}</a><br><i>${info}</i>", + "body": "${emoji} ${host} ${status_message} - ${name//_/ } ${chart} (${family}) ${goto_url} ${alarm} ${info}" + } +EOF + )" + + httpcode=$(docurl -X POST --data "${payload}" "${webhook}") + if [ "${httpcode}" == "200" ]; then + info "sent Matrix notification for: ${host} ${chart}.${name} is ${status} to '${room}'" + sent=$((sent + 1)) + else + error "failed to send Matrix notification for: ${host} ${chart}.${name} is ${status} to '${room}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# syslog sender + +send_syslog() { + local facility=${SYSLOG_FACILITY:-"local6"} level='info' targets="${1}" + local priority='' message='' server='' port='' prefix='' + local temp1='' temp2='' + + [ "${SEND_SYSLOG}" = "YES" ] || return 1 + + if [ "${status}" = "CRITICAL" ]; then + level='crit' + elif [ "${status}" = "WARNING" ]; then + level='warning' + fi + + for target in ${targets}; do + priority="${facility}.${level}" + message='' + server='' + port='' + prefix='' + temp1='' + temp2='' + + prefix=$(echo ${target} | cut -d '/' -f 2) + temp1=$(echo ${target} | cut -d '/' -f 1) + + if [ ${prefix} != ${temp1} ]; then + if (echo ${temp1} | grep -q '@'); then + temp2=$(echo ${temp1} | cut -d '@' -f 1) + server=$(echo ${temp1} | cut -d '@' -f 2) + + if [ ${temp2} != ${server} ]; then + priority=${temp2} + fi + + port=$(echo ${server} | rev | cut -d ':' -f 1 | rev) + + if (echo ${server} | grep -E -q '\[.*\]'); then + if (echo ${port} | grep -q ']'); then + port='' + else + server=$(echo ${server} | rev | cut -d ':' -f 2- | rev) + fi + else + if [ ${port} = ${server} ]; then + port='' + else + server=$(echo ${server} | cut -d ':' -f 1) + fi + fi + else + priority=${temp1} + fi + fi + + message="${prefix} ${status} on ${host} at ${date}: ${chart} ${value_string}" + + if [ ${server} ]; then + logger_options="${logger_options} -n ${server}" + if [ ${port} ]; then + logger_options="${logger_options} -P ${port}" + fi + fi + + ${logger} -p ${priority} ${logger_options} "${message}" + done + + return $? +} + +# ----------------------------------------------------------------------------- +# SMS sender + +send_sms() { + local recipients="${1}" errcode errmessage sent=0 + + # Human readable SMS + local msg="${host} ${status_message}: ${chart} (${family}), ${alarm}" + + # limit it to 160 characters + msg="${msg:0:160}" + + if [ "${SEND_SMS}" = "YES" ] && [ -n "${sendsms}" ] && [ -n "${recipients}" ] && [ -n "${msg}" ]; then + # http://api.kavenegar.com/v1/{API-KEY}/sms/send.json + for phone in ${recipients}; do + errmessage=$($sendsms $phone "$msg" 2>&1) + errcode=$? + if [ ${errcode} -eq 0 ]; then + info "sent smstools3 SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send smstools3 SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with error code ${errcode}: ${errmessage}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# hangouts sender + +send_hangouts() { + local rooms="${1}" httpcode sent=0 room color payload webhook thread + + [ "${SEND_HANGOUTS}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="#ffa700" ;; + CRITICAL) color="#d62d20" ;; + CLEAR) color="#008744" ;; + *) color="#777777" ;; + esac + + for room in ${rooms}; do + if [ -z "${HANGOUTS_WEBHOOK_URI[$room]}" ] ; then + info "Can't send Hangouts notification for: ${host} ${chart}.${name} to room ${room}. HANGOUTS_WEBHOOK_URI[$room] not defined" + else + if [ -n "${HANGOUTS_WEBHOOK_THREAD[$room]}" ]; then + thread="\"name\" : \"${HANGOUTS_WEBHOOK_THREAD[$room]}\"" + fi + webhook="${HANGOUTS_WEBHOOK_URI[$room]}" + payload="$( + cat <<EOF + { + "cards": [ + { + "header": { + "title": "Netdata on ${host}", + "imageUrl": "${images_base_url}/images/banner-icon-144x144.png", + "imageStyle": "IMAGE" + }, + "sections": [ + { + "header": "<b>${host}</b>", + "widgets": [ + { + "keyValue": { + "topLabel": "Status Message", + "content": "<b>${status_message}</b>", + "contentMultiline": "true", + "iconUrl": "${image}", + "onClick": { + "openLink": { + "url": "${goto_url}" + } + } + } + }, + { + "keyValue": { + "topLabel": "${chart} | ${family}", + "content": "<font color=${color}>${alarm}</font>", + "contentMultiline": "true" + } + } + ] + }, + { + "widgets": [ + { + "textParagraph": { + "text": "<font color=\"#0057e7\">@ ${date}\n<b>${info}</b></font>" + } + } + ] + }, + { + "widgets": [ + { + "buttons": [ + { + "textButton": { + "text": "Go to ${host}", + "onClick": { + "openLink": { + "url": "${goto_url}" + } + } + } + } + ] + } + ] + } + ] + } + ], + "thread": { + $thread + } + } +EOF + )" + + httpcode=$(docurl -H "Content-Type: application/json" -X POST -d "${payload}" "${webhook}") + + if [ "${httpcode}" = "200" ]; then + info "sent hangouts notification for: ${host} ${chart}.${name} is ${status} to '${room}'" + sent=$((sent + 1)) + else + error "failed to send hangouts notification for: ${host} ${chart}.${name} is ${status} to '${room}', with HTTP response status code ${httpcode}." + fi + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# Dynatrace sender + +send_dynatrace() { + [ "${SEND_DYNATRACE}" != "YES" ] && return 1 + + local dynatrace_url="${DYNATRACE_SERVER}/e/${DYNATRACE_SPACE}/api/v1/events" + local description="NetData Notification for: ${host} ${chart}.${name} is ${status}" + local payload="" + + payload=$(cat <<EOF +{ + "title": "NetData Alarm from ${host}", + "source" : "${DYNATRACE_ANNOTATION_TYPE}", + "description" : "${description}", + "eventType": "${DYNATRACE_EVENT}", + "attachRules":{ + "tagRule":[{ + "meTypes":["HOST"], + "tags":["${DYNATRACE_TAG_VALUE}"] + }] + }, + "customProperties":{ + "description": "${description}" + } +} +EOF +) + + # echo ${payload} + + httpcode=$(docurl -X POST -H "Authorization: Api-token ${DYNATRACE_TOKEN}" -H "Content-Type: application/json" -d "${payload}" ${dynatrace_url}) + ret=$? + + + if [ ${ret} -eq 0 ]; then + if [ "${httpcode}" = "200" ]; then + info "sent ${DYNATRACE_EVENT} to ${DYNATRACE_SERVER}" + return 0 + else + warning "Dynatrace ${DYNATRACE_SERVER} responded ${httpcode} notification for: ${host} ${chart}.${name} is ${status} was not sent!" + return 1 + fi + else + error "failed to sent ${DYNATRACE_EVENT} notification for: ${host} ${chart}.${name} is ${status} to ${DYNATRACE_SERVER} with error code ${ret}." + return 1 + fi +} + + +# ----------------------------------------------------------------------------- +# Stackpulse sender + +send_stackpulse() { + local payload httpcode oldv currv + [ "${SEND_STACKPULSE}" != "YES" ] && return 1 + + # We are sending null when values are nan to avoid errors while JSON message is parsed + [ "${old_value}" != "nan" ] && oldv="${old_value}" || oldv="null" + [ "${value}" != "nan" ] && currv="${value}" || currv="null" + + payload=$(cat <<EOF + { + "Node" : "${host}", + "Chart" : "${chart}", + "OldValue" : ${oldv}, + "Value" : ${currv}, + "Units" : "${units}", + "OldStatus" : "${old_status}", + "Status" : "${status}", + "Alarm" : "${name}", + "Date": ${when}, + "Duration": ${duration}, + "NonClearDuration": ${non_clear_duration}, + "Description" : "${status_message}, ${info}", + "CalcExpression" : "${calc_expression}", + "CalcParamValues" : "${calc_param_values}", + "TotalWarnings" : "${total_warnings}", + "TotalCritical" : "${total_critical}", + "ID" : ${alarm_id} + } +EOF +) + + httpcode=$(docurl -X POST -H "Content-Type: application/json" -d "${payload}" ${STACKPULSE_WEBHOOK}) + if [ "${httpcode}" = "200" ]; then + info "sent stackpulse notification for: ${host} ${chart}.${name} is ${status}" + else + error "failed to send stackpulse notification for: ${host} ${chart}.${name} is ${status}, with HTTP response status code ${httpcode}." + return 1 + fi + + return 0 +} +# ----------------------------------------------------------------------------- +# Opsgenie sender + +send_opsgenie() { + local payload httpcode oldv currv + [ "${SEND_OPSGENIE}" != "YES" ] && return 1 + + if [ -z "${OPSGENIE_API_KEY}" ] ; then + info "Can't send Opsgenie notification, because OPSGENIE_API_KEY is not defined" + return 1 + fi + + # We are sending null when values are nan to avoid errors while JSON message is parsed + [ "${old_value}" != "nan" ] && oldv="${old_value}" || oldv="null" + [ "${value}" != "nan" ] && currv="${value}" || currv="null" + + payload=$(cat <<EOF + { + "host" : "${host}", + "unique_id" : "${unique_id}", + "alarmId" : ${alarm_id}, + "eventId" : ${event_id}, + "chart" : "${chart}", + "when": ${when}, + "name" : "${name}", + "family" : "${family}", + "status" : "${status}", + "old_status" : "${old_status}", + "value" : ${currv}, + "old_value" : ${oldv}, + "duration": ${duration}, + "non_clear_duration": ${non_clear_duration}, + "units" : "${units}", + "info" : "${status_message}, ${info}", + "calc_expression" : "${calc_expression}", + "total_warnings" : "${total_warnings}", + "total_critical" : "${total_critical}", + "src" : "${src}" + } +EOF +) + + httpcode=$(docurl -X POST -H "Content-Type: application/json" -d "${payload}" "${OPSGENIE_API_URL}/v1/json/integrations/webhooks/netdata?apiKey=${OPSGENIE_API_KEY}") + # https://docs.opsgenie.com/docs/alert-api#create-alert + if [ "${httpcode}" = "200" ]; then + info "sent opsgenie notification for: ${host} ${chart}.${name} is ${status}" + else + error "failed to send opsgenie notification for: ${host} ${chart}.${name} is ${status}, with HTTP error code ${httpcode}." + return 1 + fi + + return 0 +} + +# ----------------------------------------------------------------------------- +# prepare the content of the notification + +# the url to send the user on click +urlencode "${args_host}" >/dev/null +url_host="${REPLY}" +urlencode "${chart}" >/dev/null +url_chart="${REPLY}" +urlencode "${family}" >/dev/null +url_family="${REPLY}" +urlencode "${name}" >/dev/null +url_name="${REPLY}" + +redirect_params="host=${url_host}&chart=${url_chart}&family=${url_family}&alarm=${url_name}&alarm_unique_id=${unique_id}&alarm_id=${alarm_id}&alarm_event_id=${event_id}&alarm_when=${when}" +GOTOCLOUD=0 + +if [ "${NETDATA_REGISTRY_URL}" == "https://registry.my-netdata.io" ]; then + if [ -z "${NETDATA_REGISTRY_UNIQUE_ID}" ]; then + if [ -f "@registrydir_POST@/netdata.public.unique.id" ]; then + NETDATA_REGISTRY_UNIQUE_ID="$(cat "@registrydir_POST@/netdata.public.unique.id")" + fi + fi + if [ -n "${NETDATA_REGISTRY_UNIQUE_ID}" ]; then + GOTOCLOUD=1 + fi +fi + +if [ ${GOTOCLOUD} -eq 0 ]; then + goto_url="${NETDATA_REGISTRY_URL}/goto-host-from-alarm.html?${redirect_params}" +else + # Temporarily disable alarm redirection, as the cloud endpoint no longer exists. This functionality will be restored after discussion on #9487. For now, just lead to netdata.cloud + #goto_url="${NETDATA_REGISTRY_CLOUD_BASE_URL}/alarms/redirect?agentID=${NETDATA_REGISTRY_UNIQUE_ID}&${redirect_params}" + goto_url="${NETDATA_REGISTRY_CLOUD_BASE_URL}" +fi + +# the severity of the alarm +severity="${status}" + +# the time the alarm was raised +duration4human ${duration} >/dev/null +duration_txt="${REPLY}" +duration4human ${non_clear_duration} >/dev/null +non_clear_duration_txt="${REPLY}" +raised_for="(was ${old_status,,} for ${duration_txt})" + +# the key status message +status_message="status unknown" + +# the color of the alarm +color="grey" + +# the alarm value +alarm="${name//_/ } = ${value_string}" + +# the image of the alarm +image="${images_base_url}/images/banner-icon-144x144.png" + +# prepare the title based on status +case "${status}" in +CRITICAL) + image="${images_base_url}/images/alert-128-red.png" + status_message="is critical" + color="#ca414b" + ;; + +WARNING) + image="${images_base_url}/images/alert-128-orange.png" + status_message="needs attention" + color="#ffc107" + ;; + +CLEAR) + image="${images_base_url}/images/check-mark-2-128-green.png" + status_message="recovered" + color="#77ca6d" + ;; +esac + +if [ "${status}" = "CLEAR" ]; then + severity="Recovered from ${old_status}" + if [ ${non_clear_duration} -gt ${duration} ]; then + raised_for="(alarm was raised for ${non_clear_duration_txt})" + fi + + # don't show the value when the status is CLEAR + # for certain alarms, this value might not have any meaning + alarm="${name//_/ } ${raised_for}" + +elif { [ "${old_status}" = "WARNING" ] && [ "${status}" = "CRITICAL" ]; }; then + severity="Escalated to ${status}" + if [ ${non_clear_duration} -gt ${duration} ]; then + raised_for="(alarm is raised for ${non_clear_duration_txt})" + fi + +elif { [ "${old_status}" = "CRITICAL" ] && [ "${status}" = "WARNING" ]; }; then + severity="Demoted to ${status}" + if [ ${non_clear_duration} -gt ${duration} ]; then + raised_for="(alarm is raised for ${non_clear_duration_txt})" + fi + +else + raised_for= +fi + +# prepare HTML versions of elements +info_html= +[ -n "${info}" ] && info_html=" <small><br/>${info}</small>" + +raised_for_html= +[ -n "${raised_for}" ] && raised_for_html="<br/><small>${raised_for}</small>" + +# ----------------------------------------------------------------------------- +# send the slack notification + +# slack aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_slack "${SLACK_WEBHOOK_URL}" "${to_slack}" +SENT_SLACK=$? + +# ----------------------------------------------------------------------------- +# send the hangouts notification + +# hangouts aggregates posts from the same room +# so we use "${host} ${status}" as the room, to make them diff + +send_hangouts "${to_hangouts}" +SENT_HANGOUTS=$? + +# ----------------------------------------------------------------------------- +# send the Microsoft notification + +# Microsoft team aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_msteam "${MSTEAM_WEBHOOK_URL}" "${to_msteam}" +SENT_MSTEAM=$? + +# ----------------------------------------------------------------------------- +# send the rocketchat notification + +# rocketchat aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_rocketchat "${ROCKETCHAT_WEBHOOK_URL}" "${to_rocketchat}" +SENT_ROCKETCHAT=$? + +# ----------------------------------------------------------------------------- +# send the alerta notification + +# alerta aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_alerta "${ALERTA_WEBHOOK_URL}" "${to_alerta}" +SENT_ALERTA=$? + +# ----------------------------------------------------------------------------- +# send the flock notification + +# flock aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_flock "${FLOCK_WEBHOOK_URL}" "${to_flock}" +SENT_FLOCK=$? + +# ----------------------------------------------------------------------------- +# send the discord notification + +# discord aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_discord "${DISCORD_WEBHOOK_URL}" "${to_discord}" +SENT_DISCORD=$? + +# ----------------------------------------------------------------------------- +# send the pushover notification + +send_pushover "${PUSHOVER_APP_TOKEN}" "${to_pushover}" "${when}" "${goto_url}" "${status}" "${host} ${status_message} - ${name//_/ } - ${chart}" " +<font color=\"${color}\"><b>${alarm}</b></font>${info_html}<br/> +<small><b>${chart}</b><br/>Chart<br/> </small> +<small><b>${family}</b><br/>Family<br/> </small> +<small><b>${severity}</b><br/>Severity<br/> </small> +<small><b>${date}${raised_for_html}</b><br/>Time<br/> </small> +<a href=\"${goto_url}\">View Netdata</a><br/> +<small><small>The source of this alarm is line ${src}</small></small> +" + +SENT_PUSHOVER=$? + +# ----------------------------------------------------------------------------- +# send the pushbullet notification + +send_pushbullet "${PUSHBULLET_ACCESS_TOKEN}" "${PUSHBULLET_SOURCE_DEVICE}" "${to_pushbullet}" "${goto_url}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm}\\n +Severity: ${severity}\\n +Chart: ${chart}\\n +Family: ${family}\\n +${date}\\n +The source of this alarm is line ${src}" + +SENT_PUSHBULLET=$? + +# ----------------------------------------------------------------------------- +# send the twilio SMS + +send_twilio "${TWILIO_ACCOUNT_SID}" "${TWILIO_ACCOUNT_TOKEN}" "${TWILIO_NUMBER}" "${to_twilio}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_TWILIO=$? + +# ----------------------------------------------------------------------------- +# send the messagebird SMS + +send_messagebird "${MESSAGEBIRD_ACCESS_KEY}" "${MESSAGEBIRD_NUMBER}" "${to_messagebird}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_MESSAGEBIRD=$? + +# ----------------------------------------------------------------------------- +# send the kavenegar SMS + +send_kavenegar "${KAVENEGAR_API_KEY}" "${KAVENEGAR_SENDER}" "${to_kavenegar}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_KAVENEGAR=$? + +# ----------------------------------------------------------------------------- +# send the telegram.org message + +# https://core.telegram.org/bots/api#formatting-options +send_telegram "${TELEGRAM_BOT_TOKEN}" "${to_telegram}" "${host} ${status_message} - <b>${name//_/ }</b> +${chart} (${family}) +<a href=\"${goto_url}\">${alarm}</a> +<i>${info}</i>" + +SENT_TELEGRAM=$? + +# ----------------------------------------------------------------------------- +# send the kafka message + +send_kafka +SENT_KAFKA=$? + +# ----------------------------------------------------------------------------- +# send the pagerduty.com message + +send_pd "${to_pd}" +SENT_PD=$? + +# ----------------------------------------------------------------------------- +# send the fleep message + +send_fleep "${to_fleep}" +SENT_FLEEP=$? + +# ----------------------------------------------------------------------------- +# send the Prowl message + +send_prowl "${to_prowl}" +SENT_PROWL=$? + +# ----------------------------------------------------------------------------- +# send the irc message + +send_irc "${IRC_NICKNAME}" "${IRC_REALNAME}" "${to_irc}" "${IRC_NETWORK}" "${host}" "${host} ${status_message} - ${name//_/ } - ${chart} ----- ${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_IRC=$? + +# ----------------------------------------------------------------------------- +# send the SMS message with smstools3 + +send_sms "${to_sms}" + +SENT_SMS=$? + +# ----------------------------------------------------------------------------- +# send the custom message + +send_custom() { + # is it enabled? + [ "${SEND_CUSTOM}" != "YES" ] && return 1 + + # do we have any sender? + [ -z "${1}" ] && return 1 + + # call the custom_sender function + custom_sender "${@}" +} + +send_custom "${to_custom}" +SENT_CUSTOM=$? + +# ----------------------------------------------------------------------------- +# send hipchat message + +send_hipchat "${HIPCHAT_AUTH_TOKEN}" "${to_hipchat}" " \ +${host} ${status_message}<br/> \ +<b>${alarm}</b> ${info_html}<br/> \ +<b>${chart}</b> (family <b>${family}</b>)<br/> \ +<b>${date}${raised_for_html}</b><br/> \ +<a href=\\\"${goto_url}\\\">View netdata dashboard</a> \ +(source of alarm ${src}) \ +" + +SENT_HIPCHAT=$? + +# ----------------------------------------------------------------------------- +# send the Amazon SNS message + +send_awssns "${to_awssns}" + +SENT_AWSSNS=$? + +# ----------------------------------------------------------------------------- +# send the Matrix message +send_matrix "${MATRIX_HOMESERVER}" "${to_matrix}" + +SENT_MATRIX=$? + + +# ----------------------------------------------------------------------------- +# send the syslog message + +send_syslog "${to_syslog}" + +SENT_SYSLOG=$? + +# ----------------------------------------------------------------------------- +# send the email + +IFS='' read -r -d '' email_plaintext_part <<EOF +Content-Type: text/plain; encoding=${EMAIL_CHARSET} +Content-Disposition: inline +Content-Transfer-Encoding: 8bit + +${host} ${status_message} + +${alarm} ${info} +${raised_for} + +Chart : ${chart} +Family : ${family} +Severity: ${severity} +URL : ${goto_url} +Source : ${src} +Date : ${date} +Notification generated on ${host} + +Evaluated Expression : ${calc_expression} +Expression Variables : ${calc_param_values} + +The host has ${total_warnings} WARNING and ${total_critical} CRITICAL alarm(s) raised. +EOF + +if [[ "${EMAIL_PLAINTEXT_ONLY}" == "YES" ]]; then + +send_email <<EOF +To: ${to_email} +Subject: ${host} ${status_message} - ${name//_/ } - ${chart} +MIME-Version: 1.0 +Content-Type: multipart/alternative; boundary="multipart-boundary" +${email_thread_headers} + +This is a MIME-encoded multipart message + +--multipart-boundary +${email_plaintext_part} +--multipart-boundary-- +EOF + +else + +IFS='' read -r -d '' email_html_part <<EOF +Content-Type: text/html; encoding=${EMAIL_CHARSET} +Content-Disposition: inline +Content-Transfer-Encoding: 8bit + +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> +<html xmlns="http://www.w3.org/1999/xhtml" style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; box-sizing: border-box; font-size: 14px; margin: 0; padding: 0;"> +<body style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 14px; width: 100% !important; min-height: 100%; line-height: 1.6; background: #f6f6f6; margin:0; padding: 0;"> +<table> + <tbody> + <tr> + <td style="vertical-align: top;" valign="top"></td> + <td width="700" style="vertical-align: top; display: block !important; max-width: 700px !important; clear: both !important; margin: 0 auto; padding: 0;" valign="top"> + <div style="max-width: 700px; display: block; margin: 0 auto; padding: 20px;"> + <table width="100%" cellpadding="0" cellspacing="0" style="background: #fff; border: 1px solid #e9e9e9;"> + <tbody> + <tr> + <td bgcolor="#eee" style="padding: 5px 20px 5px 20px; background-color: #eee;"> + <div style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 20px; color: #777; font-weight: bold;">netdata notification</div> + </td> + </tr> + <tr> + <td bgcolor="${color}" style="font-size: 16px; vertical-align: top; font-weight: 400; text-align: center; margin: 0; padding: 10px; color: #ffffff; background: ${color} !important; border: 1px solid ${color}; border-top-color: ${color};" align="center" valign="top"> + <h1 style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-weight: 400; margin: 0;">${host} ${status_message}</h1> + </td> + </tr> + <tr> + <td style="vertical-align: top;" valign="top"> + <div style="margin: 0; padding: 20px; max-width: 700px;"> + <table width="100%" cellpadding="0" cellspacing="0" style="max-width:700px"> + <tbody> + <tr> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding:0 0 20px;" align="left" valign="top"> + <span>${chart}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Chart</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span><b>${alarm}</b>${info_html}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Alarm</span> + </td> + </tr> + <tr> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span>${family}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Family</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span>${severity}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Severity</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"><span>${date}</span> + <span>${raised_for_html}</span> <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Time</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span>${calc_expression}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Evaluated Expression</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + <span>${calc_param_values}</span> + <span style="display: block; color: #666666; font-size: 12px; font-weight: 300; line-height: 1; text-transform: uppercase;">Expression Variables</span> + </td> + </tr> + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;" align="left" valign="top"> + The host has ${total_warnings} WARNING and ${total_critical} CRITICAL alarm(s) raised. + </td> + </tr> + + <tr style="margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 18px; vertical-align: top; margin: 0; padding: 0 0 20px;"> + <a href="${goto_url}" style="font-size: 14px; color: #ffffff; text-decoration: none; line-height: 1.5; font-weight: bold; text-align: center; display: inline-block; text-transform: capitalize; background: #35568d; border-width: 1px; border-style: solid; border-color: #2b4c86; margin: 0; padding: 10px 15px;" target="_blank">View Netdata</a> + </td> + </tr> + <tr style="text-align: center; margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 11px; vertical-align: top; margin: 0; padding: 10px 0 0 0; color: #666666;" align="center" valign="bottom">The source of this alarm is line <code>${src}</code><br/>(alarms are configurable, edit this file to adapt the alarm to your needs) + </td> + </tr> + <tr style="text-align: center; margin: 0; padding: 0;"> + <td style="font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif; font-size: 12px; vertical-align: top; margin:0; padding: 20px 0 0 0; color: #666666; border-top: 1px solid #f0f0f0;" align="center" valign="bottom">Sent by + <a href="https://mynetdata.io/" target="_blank">netdata</a>, the real-time performance and health monitoring, on <code>${host}</code>. + </td> + </tr> + </tbody> + </table> + </div> + </td> + </tr> + </tbody> + </table> + </div> + </td> + </tr> + </tbody> +</table> +</body> +</html> +EOF + +send_email <<EOF +To: ${to_email} +Subject: ${host} ${status_message} - ${name//_/ } - ${chart} +MIME-Version: 1.0 +Content-Type: multipart/alternative; boundary="multipart-boundary" +${email_thread_headers} + +This is a MIME-encoded multipart message + +--multipart-boundary +${email_plaintext_part} +--multipart-boundary +${email_html_part} +--multipart-boundary-- +EOF + +fi + +SENT_EMAIL=$? + +# ----------------------------------------------------------------------------- +# send the EVENT to Dynatrace +send_dynatrace "${host}" "${chart}" "${name}" "${status}" +SENT_DYNATRACE=$? + +# ----------------------------------------------------------------------------- +# send the EVENT to Stackpulse +send_stackpulse +SENT_STACKPULSE=$? + +# ----------------------------------------------------------------------------- +# send messages to Opsgenie +send_opsgenie +SENT_OPSGENIE=$? + +# ----------------------------------------------------------------------------- +# let netdata know +for state in "${SENT_EMAIL}" \ + "${SENT_PUSHOVER}" \ + "${SENT_TELEGRAM}" \ + "${SENT_SLACK}" \ + "${SENT_HANGOUTS}" \ + "${SENT_ROCKETCHAT}" \ + "${SENT_ALERTA}" \ + "${SENT_FLOCK}" \ + "${SENT_DISCORD}" \ + "${SENT_TWILIO}" \ + "${SENT_HIPCHAT}" \ + "${SENT_MESSAGEBIRD}" \ + "${SENT_KAVENEGAR}" \ + "${SENT_PUSHBULLET}" \ + "${SENT_KAFKA}" \ + "${SENT_PD}" \ + "${SENT_FLEEP}" \ + "${SENT_PROWL}" \ + "${SENT_CUSTOM}" \ + "${SENT_IRC}" \ + "${SENT_AWSSNS}" \ + "${SENT_MATRIX}" \ + "${SENT_SYSLOG}" \ + "${SENT_SMS}" \ + "${SENT_MSTEAM}" \ + "${SENT_DYNATRACE}" \ + "${SENT_STACKPULSE}" \ + "${SENT_OPSGENIE}"; do + if [ "${state}" -eq 0 ]; then + # we sent something + exit 0 + fi +done +# we did not send anything +exit 1 diff --git a/health/notifications/alarm-test.sh b/health/notifications/alarm-test.sh new file mode 100755 index 0000000..828aa75 --- /dev/null +++ b/health/notifications/alarm-test.sh @@ -0,0 +1,12 @@ +#!/usr/bin/env bash + +# netdata +# real-time performance and health monitoring, done right! +# (C) 2017 Costa Tsaousis <costa@tsaousis.gr> +# SPDX-License-Identifier: GPL-3.0-or-later +# +# Script to test alarm notifications for netdata + +dir="$(dirname "${0}")" +"${dir}/alarm-notify.sh" test "${1}" +exit $? diff --git a/health/notifications/alerta/Makefile.inc b/health/notifications/alerta/Makefile.inc new file mode 100644 index 0000000..10f26b0 --- /dev/null +++ b/health/notifications/alerta/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + alerta/README.md \ + alerta/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/alerta/README.md b/health/notifications/alerta/README.md new file mode 100644 index 0000000..8f1679a --- /dev/null +++ b/health/notifications/alerta/README.md @@ -0,0 +1,94 @@ +<!-- +title: "alerta.io" +description: "Send alarm notifications to Alerta to see the latest health status updates from multiple nodes in a single interface." +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/alerta/README.md +--> + +# alerta.io + +The [Alerta](https://alerta.io) monitoring system is a tool used to +consolidate and de-duplicate alerts from multiple sources for quick +‘at-a-glance’ visualisation. With just one system you can monitor +alerts from many other monitoring tools on a single screen. + +![](https://docs.alerta.io/en/latest/_images/alerta-screen-shot-3.png) + +Netadata alarms can be sent to Alerta so you can see in one place +alerts coming from many Netdata hosts or also from a multi-host +Netadata configuration. The big advantage over other notifications +systems is that there is a main view of all active alarms with +the most recent state, and it is also possible to view alarm history. + +## Deploying Alerta + +It is recommended to set up the server in a separated server, VM or +container. If you have other Nginx or Apache server in your organization, +it is recommended to proxy to this new server. + +The easiest way to install Alerta is to use the Docker image available +on [Docker hub][1]. Alternatively, follow the ["getting started"][2] +tutorial to deploy Alerta to an Ubuntu server. More advanced +configurations are out os scope of this tutorial but information +about different deployment scenarios can be found in the [docs][3]. + +[1]: https://hub.docker.com/r/alerta/alerta-web/ + +[2]: http://alerta.readthedocs.io/en/latest/gettingstarted/tutorial-1-deploy-alerta.html + +[3]: http://docs.alerta.io/en/latest/deployment.html + +## Send alarms to Alerta + +Step 1. Create an API key (if authentication is enabled) + +You will need an API key to send messages from any source, if +Alerta is configured to use authentication (recommended). To +create an API key go to "Configuration -> API Keys" and create +a new API key called "netdata" with `write:alerts` permission. + +Step 2. configure Netdata to send alarms to Alerta + +On your system run: + +```sh +/etc/netdata/edit-config health_alarm_notify.conf +``` + +and modify the file as below: + +``` +# enable/disable sending alerta notifications +SEND_ALERTA="YES" + +# here set your alerta server API url +# this is the API url you defined when installed Alerta server, +# it is the same for all users. Do not include last slash. +ALERTA_WEBHOOK_URL="http://yourserver/alerta/api" + +# Login with an administrative user to you Alerta server and create an API KEY +# with write permissions. +ALERTA_API_KEY="INSERT_YOUR_API_KEY_HERE" + +# you can define environments in /etc/alertad.conf option ALLOWED_ENVIRONMENTS +# standard environments are Production and Development +# if a role's recipients are not configured, a notification will be send to +# this Environment (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_ALERTA="Production" +``` + +## Test alarms + +We can test alarms using the standard approach: + +```sh +/opt/netdata/netdata-plugins/plugins.d/alarm-notify.sh test +``` + +Note: Netdata will send 3 alarms, and because last alarm is "CLEAR" +you will not see them in main Alerta page, you need to select to see +"closed" alarm in top-right lookup. A little change in `alarm-notify.sh` +that let us test each state one by one will be useful. + +For more information see <https://docs.alerta.io> + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Falerta%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/awssns/Makefile.inc b/health/notifications/awssns/Makefile.inc new file mode 100644 index 0000000..ee86f4b --- /dev/null +++ b/health/notifications/awssns/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + awssns/README.md \ + awssns/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/awssns/README.md b/health/notifications/awssns/README.md new file mode 100644 index 0000000..c682513 --- /dev/null +++ b/health/notifications/awssns/README.md @@ -0,0 +1,43 @@ +<!-- +title: "Amazon SNS" +description: "hello" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/awssns/README.md +--> + +# Amazon SNS + +As part of it's AWS suite, Amazon provides a notification broker service called 'Simple Notification Service' or SNS. Amazon SNS works kind of similarly to Netdata's own notification system, allowing dispatch of a single notification to multiple subscribers of different types. Among other things, SNS supports sending notifications to: + +- Email addresses. +- Mobile Phones via SMS. +- HTTP or HTTPS web hooks. +- AWS Lambda functions. +- AWS SQS queues. +- Mobile applications via push notifications. + +To get this working, you will need: + +- The Amazon Web Services CLI tools. Most distributions provide these with the package name `awscli`. +- An actual home directory for the user you run Netdata as, instead of just using `/` as a home directory. Setup of this is distribution specific. `/var/lib/netdata` is the recommended directory (because the permissions will already be correct) if you are using a dedicated user (which is how most distributions work). +- An Amazon SNS topic to send notifications to with one or more subscribers. The [Getting + Started](https://docs.aws.amazon.com/sns/latest/dg/sns-getting-started.html) section of the Amazon SNS documentation + covers the basics of how to set this up. Make note of the Topic ARN when you create the topic. +- While not mandatory, it is highly recommended to create a dedicated IAM user on your account for Netdata to send notifications. This user needs to have programmatic access, and should only allow access to SNS. If you're really paranoid, you can create one for each system or group of systems. + +Once you have all the above, run the following command as the user Netdata runs under: + +``` +aws configure +``` + +THis will prompt you for the access key and secret key for accessing Amazon SNS (as well as the default region and output format, but you can leave those blank because we don't use them). + +Once that's done, you're ready to go and can specify the desired topic ARN as a recipient. + +Notes: + +- Netdata's native email notification support is far better in almost all respects than it's support through Amazon SNS. If you want email notifications, use the native support, not SNS. + - If you need to change the notification format for SNS notifications, you can do so by specifying the format in `AWSSNS_MESSAGE_FORMAT` in the configuration. This variable supports all the same variables you can use in custom notifications. + - While Amazon SNS supports sending differently formatted messages for different delivery methods, Netdata does not currently support this functionality. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fawssns%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/custom/Makefile.inc b/health/notifications/custom/Makefile.inc new file mode 100644 index 0000000..c64ebda --- /dev/null +++ b/health/notifications/custom/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + custom/README.md \ + custom/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/custom/README.md b/health/notifications/custom/README.md new file mode 100644 index 0000000..04376d5 --- /dev/null +++ b/health/notifications/custom/README.md @@ -0,0 +1,88 @@ +<!-- +title: "Custom" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/custom/README.md +--> + +# Custom + +Netdata allows you to send custom notifications to any endpoint you choose. + +To configure custom notifications, you will need to customize `health_alarm_notify.conf`. You can look at the other senders in `/usr/libexec/netdata/plugins.d/alarm-notify.sh` for examples of how to modify the `custom_sender()` function in `health_alarm_notify.conf`. Ensure you follow the instructions of changing any configuration file to [persist your configuration](/docs/configuration-guide.md#persist-my-configuration). + +As with other notifications, you will also need to define the recipient list in `DEFAULT_RECIPIENT_CUSTOM` and/or the `role_recipients_custom` array. + +The following is a sample `custom_sender` function in `health_alarm_notify.conf`, to send an SMS via an imaginary HTTPS endpoint to the SMS gateway: + +``` + custom_sender() { + # example human readable SMS + local msg="${host} ${status_message}: ${alarm} ${raised_for}" + + # limit it to 160 characters and encode it for use in a URL + urlencode "${msg:0:160}" >/dev/null; msg="${REPLY}" + + # a space separated list of the recipients to send alarms to + to="${1}" + + for phone in ${to}; do + httpcode=$(docurl -X POST \ + --data-urlencode "From=XXX" \ + --data-urlencode "To=${phone}" \ + --data-urlencode "Body=${msg}" \ + -u "${accountsid}:${accounttoken}" \ + https://domain.website.com/) + + if [ "${httpcode}" = "200" ]; then + info "sent custom notification ${msg} to ${phone}" + sent=$((sent + 1)) + else + error "failed to send custom notification ${msg} to ${phone} with HTTP error code ${httpcode}." + fi + done +} +``` + +Variables available to the custom_sender: + +- `${to_custom}` the list of recipients for the alarm +- `${host}` the host generated this event +- `${url_host}` same as `${host}` but URL encoded +- `${unique_id}` the unique id of this event +- `${alarm_id}` the unique id of the alarm that generated this event +- `${event_id}` the incremental id of the event, for this alarm id +- `${when}` the timestamp this event occurred +- `${name}` the name of the alarm, as given in Netdata health.d entries +- `${url_name}` same as `${name}` but URL encoded +- `${chart}` the name of the chart (type.id) +- `${url_chart}` same as `${chart}` but URL encoded +- `${family}` the family of the chart +- `${url_family}` same as `${family}` but URL encoded +- `${status}` the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL +- `${old_status}` the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL +- `${value}` the current value of the alarm +- `${old_value}` the previous value of the alarm +- `${src}` the line number and file the alarm has been configured +- `${duration}` the duration in seconds of the previous alarm state +- `${duration_txt}` same as `${duration}` for humans +- `${non_clear_duration}` the total duration in seconds this is/was non-clear +- `${non_clear_duration_txt}` same as `${non_clear_duration}` for humans +- `${units}` the units of the value +- `${info}` a short description of the alarm +- `${value_string}` friendly value (with units) +- `${old_value_string}` friendly old value (with units) +- `${image}` the URL of an image to represent the status of the alarm +- `${color}` a color in #AABBCC format for the alarm +- `${goto_url}` the URL the user can click to see the Netdata dashboard +- `${calc_expression}` the expression evaluated to provide the value for the alarm +- `${calc_param_values}` the value of the variables in the evaluated expression +- `${total_warnings}` the total number of alarms in WARNING state on the host +- `${total_critical}` the total number of alarms in CRITICAL state on the host + +The following are more human friendly: + +- `${alarm}` like "name = value units" +- `${status_message}` like "needs attention", "recovered", "is critical" +- `${severity}` like "Escalated to CRITICAL", "Recovered from WARNING" +- `${raised_for}` like "(alarm was raised for 10 minutes)" + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fcustom%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/discord/Makefile.inc b/health/notifications/discord/Makefile.inc new file mode 100644 index 0000000..78de723 --- /dev/null +++ b/health/notifications/discord/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + discord/README.md \ + discord/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/discord/README.md b/health/notifications/discord/README.md new file mode 100644 index 0000000..1650d9c --- /dev/null +++ b/health/notifications/discord/README.md @@ -0,0 +1,50 @@ +<!-- +title: "Discordapp.com" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/discord/README.md +--> + +# Discordapp.com + +This is what you will get: + +![image](https://cloud.githubusercontent.com/assets/7321975/22215935/b49ede7e-e162-11e6-98d0-ae8541e6b92e.png) + +You need: + +1. The **incoming webhook URL** as given by Discord. Create a webhook by following the official [Discord documentation](https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks). You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). +2. One or more Discord channels to post the messages to. + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# sending discord notifications + +# note: multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending discord notifications +SEND_DISCORD="YES" + +# Create a webhook by following the official documentation - +# https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks +DISCORD_WEBHOOK_URL="https://discordapp.com/api/webhooks/XXXXXXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role's recipients are not configured, a notification will be send to +# this discord channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_DISCORD="alarms" +``` + +You can define multiple channels like this: `alarms systems`. +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_discord[sysadmin]="systems" +role_recipients_discord[dba]="databases systems" +role_recipients_discord[webmaster]="marketing development" +``` + +The keywords `systems`, `databases`, `marketing`, `development` are discordapp.com channels (they should already exist within your discord server). + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fdiscord%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/dynatrace/Makefile.inc b/health/notifications/dynatrace/Makefile.inc new file mode 100644 index 0000000..a2ae623 --- /dev/null +++ b/health/notifications/dynatrace/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + dynatrace/README.md \ + dynatrace/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/dynatrace/README.md b/health/notifications/dynatrace/README.md new file mode 100644 index 0000000..cc82ee7 --- /dev/null +++ b/health/notifications/dynatrace/README.md @@ -0,0 +1,36 @@ +<!-- +title: "Dynatrace" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/dynatrace/README.md +--> + +# Dynatrace + +Dynatrace allows you to receive notifications using their Events REST API. + +See [the Dynatrace documentation](https://www.dynatrace.com/support/help/extend-dynatrace/dynatrace-api/environment-api/events/post-event/) about POSTing an event in the Events API for more details. + + + +You need: + +1. Dynatrace Server. You can use the same on all your Netdata servers but make sure the server is network visible from your Netdata hosts. +The Dynatrace server should be with protocol prefixed (`http://` or `https://`). For example: `https://monitor.example.com` +This is a required parameter. +2. API Token. Generate a secure access API token that enables access to your Dynatrace monitoring data via the REST-based API. +Generate a Dynatrace API authentication token. On your Dynatrace server, go to **Settings** --> **Integration** --> **Dynatrace API** --> **Generate token**. +See [Dynatrace API - Authentication](https://www.dynatrace.com/support/help/extend-dynatrace/dynatrace-api/basics/dynatrace-api-authentication/) for more details. +This is a required parameter. +3. API Space. This is the URL part of the page you have access in order to generate the API Token. For example, the URL + for a generated API token might look like: + `https://monitor.illumineit.com/e/2a93fe0e-4cd5-469a-9d0d-1a064235cfce/#settings/integration/apikeys;gf=all` In that + case, my space is _2a93fe0e-4cd5-469a-9d0d-1a064235cfce_ This is a required parameter. +4. Generate a Server Tag. On your Dynatrace Server, go to **Settings** --> **Tags** --> **Manually applied tags** and create the Tag. +The Netdata alarm is sent as a Dynatrace Event to be correlated with all those hosts tagged with this Tag you have created. +This is a required parameter. +5. Specify the Dynatrace event. This can be one of `CUSTOM_INFO`, `CUSTOM_ANNOTATION`, `CUSTOM_CONFIGURATION`, and `CUSTOM_DEPLOYMENT`. +The default value is `CUSTOM_INFO`. +This is a required parameter. +6. Specify the annotation type. This is the source of the Dynatrace event. Put whatever it fits you, for example, +_Netdata Alarm_, which is also the default value. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fdynatrace%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/email/Makefile.inc b/health/notifications/email/Makefile.inc new file mode 100644 index 0000000..95dc7cf --- /dev/null +++ b/health/notifications/email/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + email/README.md \ + email/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/email/README.md b/health/notifications/email/README.md new file mode 100644 index 0000000..827a9c0 --- /dev/null +++ b/health/notifications/email/README.md @@ -0,0 +1,63 @@ +<!-- +title: "Email" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/email/README.md +--> + +# Email + +You need a working `sendmail` command for email alerts to work. Almost all MTAs provide a `sendmail` interface. + +Netdata sends all emails as user `netdata`, so make sure your `sendmail` works for local users. + +email notifications look like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/18407294/e9218c68-7714-11e6-8739-e4dd8a498252.png) + +## configuration + +To edit `health_alarm_notify.conf` on your system run `/etc/netdata/edit-config health_alarm_notify.conf`. + +You can configure recipients in [`/etc/netdata/health_alarm_notify.conf`](https://github.com/netdata/netdata/blob/99d44b7d0c4e006b11318a28ba4a7e7d3f9b3bae/conf.d/health_alarm_notify.conf#L101). + +You can also configure per role recipients [in the same file, a few lines below](https://github.com/netdata/netdata/blob/99d44b7d0c4e006b11318a28ba4a7e7d3f9b3bae/conf.d/health_alarm_notify.conf#L313). + +Changes to this file do not require a Netdata restart. + +You can test your configuration by issuing the commands: + +```sh +# become user netdata +sudo su -s /bin/bash netdata + +# send a test alarm +/usr/libexec/netdata/plugins.d/alarm-notify.sh test [ROLE] +``` + +Where `[ROLE]` is the role you want to test. The default (if you don't give a `[ROLE]`) is `sysadmin`. + +Note that in versions before 1.16, the plugins.d directory may be installed in a different location in certain OSs (e.g. under `/usr/lib/netdata`). +You can always find the location of the alarm-notify.sh script in `netdata.conf`. + +## Simple SMTP transport configuration + +If you want an alternative to `sendmail` in order to have a simple MTA configuration for sending emails and auth to an existing SMTP server, you can do the following: + +- Install `msmtp`. +- Modify the `sendmail` path in `health_alarm_notify.conf` to point to the location of `mstmp`: +``` +# The full path to the sendmail command. +# If empty, the system $PATH will be searched for it. +# If not found, email notifications will be disabled (silently). +sendmail="/usr/bin/msmtp" +``` +- Login as netdata : +```sh +(sudo) su -s /bin/bash netdata +``` +- Configure `~/.msmtprc` as shown [in the documentation](https://marlam.de/msmtp/documentation/). +- Finally set the appropriate permissions on the `.msmtprc` file : +```sh +chmod 600 ~/.msmtprc +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Femail%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/flock/Makefile.inc b/health/notifications/flock/Makefile.inc new file mode 100644 index 0000000..5bde161 --- /dev/null +++ b/health/notifications/flock/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + flock/README.md \ + flock/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/flock/README.md b/health/notifications/flock/README.md new file mode 100644 index 0000000..b24ecdb --- /dev/null +++ b/health/notifications/flock/README.md @@ -0,0 +1,37 @@ +<!-- +title: "Flock" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/flock/README.md +--> + +# Flock + +This is what you will get: + +![Flock](https://i.imgur.com/ok9bRzw.png) + +You need: + +The **incoming webhook URL** as given by flock.com. +You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). + +Get them here: <https://admin.flock.com/webhooks> + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# sending flock notifications + +# enable/disable sending pushover notifications +SEND_FLOCK="YES" + +# Login to flock.com and create an incoming webhook. +# You need only one for all your Netdata servers. +# Without it, Netdata cannot send flock notifications. +FLOCK_WEBHOOK_URL="https://api.flock.com/hooks/sendMessage/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role recipient is not configured, no notification will be sent +DEFAULT_RECIPIENT_FLOCK="alarms" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fflock%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/hangouts/Makefile.inc b/health/notifications/hangouts/Makefile.inc new file mode 100644 index 0000000..6ff1dff --- /dev/null +++ b/health/notifications/hangouts/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + hangouts/README.md \ + hangouts/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/hangouts/README.md b/health/notifications/hangouts/README.md new file mode 100644 index 0000000..886abfc --- /dev/null +++ b/health/notifications/hangouts/README.md @@ -0,0 +1,55 @@ +<!-- +title: "Send notifications to Google Hangouts" +description: "Send alerts to Send notifications to Google Hangouts any time an anomaly or performance issue strikes a node in your infrastructure." +sidebar_label: "Google Hangouts" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/hangouts/README.md +--> + +# Send notifications to Google Hangouts + +[Google Hangouts](https://hangouts.google.com/) is a cross-platform messaging app developed by Google. You can configure +Netdata to send alarm notifications to a Hangouts room in order to stay aware of possible health or performance issues +on your nodes. Here's an example of the notification in action: + +![Netdata on Hangouts](https://user-images.githubusercontent.com/1153921/66427166-47de6900-e9c8-11e9-8322-b4b03f084dc1.png) + +To receive notifications in Google Hangouts, you need the following in your Hangouts setup: + +1. One or more rooms. +2. An **incoming webhook** for each room. + +Follow [Google's documentation](https://developers.google.com/hangouts/chat/how-tos/webhooks) to create an incoming +webhook for each room you want to send Netdata notifications to. + +Set the webhook URIs and room names in `health_alarm_notify.conf`. To edit it on your system, run +`/etc/netdata/edit-config health_alarm_notify.conf`): + +## Threads (optional) + +Instead to receive alarms on different threads, Netdata allows you to concentrate them inside an unique thread when you +set the variable `HANGOUTS_WEBHOOK_THREAD[NAME]`. + +``` +#------------------------------------------------------------------------------ +# hangouts (google hangouts chat) global notification options +# enable/disable sending hangouts notifications +SEND_HANGOUTS="YES" +# On Hangouts, in the room you choose, create an incoming webhook, +# copy the link and paste it below and also identify the room name. +# Without it, netdata cannot send hangouts notifications to that room. +# HANGOUTS_WEBHOOK_URI[ROOM_NAME]="URLforroom1" +HANGOUTS_WEBHOOK_URI[systems]="https://chat.googleapis.com/v1/spaces/AAAAXXXXXXX/..." +HANGOUTS_WEBHOOK_URI[development]="https://chat.googleapis.com/v1/spaces/AAAAYYYYY/..." +# On Hangouts, copy a thread link and change the values for space and thread +# HANGOUTS_WEBHOOK_THREAD[systems]="spaces/AAAAXXXXXXX/threads/XXXXXXXXXXX" +# if a DEFAULT_RECIPIENT_HANGOUTS are not configured, +# notifications wouldn't be send to hangouts rooms. +# DEFAULT_RECIPIENT_HANGOUTS="systems development|critical" +DEFAULT_RECIPIENT_HANGOUTS="sysadmin devops alarms|critical" +``` + +You can define multiple rooms like this: `sysadmin devops alarms|critical`. + +The keywords `sysadmin`, `devops`, and `alarms` are Hangouts rooms. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fhangouts%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/health_alarm_notify.conf b/health/notifications/health_alarm_notify.conf new file mode 100755 index 0000000..be669e1 --- /dev/null +++ b/health/notifications/health_alarm_notify.conf @@ -0,0 +1,1251 @@ +# Configuration for alarm notifications +# +# This configuration is used by: alarm-notify.sh +# changes take effect immediately (the next alarm will use them). +# +# alarm-notify.sh can send: +# - e-mails (using the sendmail command), +# - push notifications to your mobile phone (pushover.net), +# - messages to your slack team (slack.com), +# - messages to your alerta server (alerta.io), +# - messages to your flock team (flock.com), +# - messages to your discord guild (discordapp.com), +# - messages to your telegram chat / group chat (telegram.org) +# - sms messages to your cell phone or any sms enabled device (twilio.com) +# - sms messages to your cell phone or any sms enabled device (messagebird.com) +# - sms messages to your cell phone or any sms enabled device (smstools3) +# - notifications to users on pagerduty.com +# - push notifications to iOS devices (via prowlapp.com) +# - notifications to Amazon SNS topics (aws.amazon.com) +# - messages to your irc channel on your selected network +# - messages to a local or remote syslog daemon +# - message to Microsoft Team (through webhook) +# - message to Rocket.Chat (through webhook) +# - message to Google Hangouts Chat (through webhook) +# +# The 'to' line given at netdata alarms defines a *role*, so that many +# people can be notified for each role. +# +# This file is a BASH script itself. +# +# +#------------------------------------------------------------------------------ +# proxy configuration +# +# If you need to send curl based notifications (pushover, pushbullet, slack, alerta, +# flock, discord, telegram) via a proxy, set these to your proxy address: +#export http_proxy="http://10.0.0.1:3128/" +#export https_proxy="http://10.0.0.1:3128/" + + +#------------------------------------------------------------------------------ +# notifications images +# +# Images in notifications need to be downloaded from an Internet facing site. +# To allow notification providers fetch the icons/images, by default we set +# the URL of the global public netdata registry. +# If you have an Internet facing netdata (or you have copied the images/ folder +# of netdata to your web server), set its URL here, to fetch the notification +# images from it. +#images_base_url="http://my.public.netdata.server:19999" + + +#------------------------------------------------------------------------------ +# date handling +# +# You can configure netdata alerts to send dates in any format you want. +# This uses standard `date` command format strings. See `man date` for +# more info on what you can put in here. Note that this has to start with a '+', otherwise it won't work. +# +# For ISO 8601 dates, use '+%FT%T%z' +# For RFC 5322 dates, use '+%a, %d %b %Y %H:%M:%S %z' +# For RFC 3339 dates, use '+%F %T%:z' +# For RFC 1123 dates, use '+%a, %d %b %Y %H:%M:%S %Z' +# For RFC 1036 dates, use '+%A, %d-%b-%y %H:%M:%S %Z' +# For a reasonably local date and time (in that order), use '+%x %X' +# For the old default behavior (compatible with ANSI C's asctime() function), leave this empty. +date_format='' + + +#------------------------------------------------------------------------------ +# hostname handling +# +# By default, Netdata will use the simple hostname for the system (the +# hostname with everything after the first `.` removed) when displaying +# the hostname in alert notifications. If you prefer, you can uncomment +# the line below to have Netdata instead use the host's fully qualified +# domain name. +# +# This does not report correct FQDN's for child systems for which this +# system is a parent. +# +# Additionally, if the system host name is overridden in /etc/netdata.conf +# with the `hostname` option, that name will be used unconditionally +# instead of this. +#use_fqdn='YES' + + +#------------------------------------------------------------------------------ +# external commands + +# The full path to the sendmail command. +# If empty, the system $PATH will be searched for it. +# If not found, email notifications will be disabled (silently). +sendmail="" + +# The full path of the curl command. +# If empty, the system $PATH will be searched for it. +# If not found, most notifications will be silently disabled. +curl="" + +# The full path of the nc command. +# If empty, the system $PATH will be searched for it. +# If not found, irc notifications will be silently disabled. +nc="" + +# The full path of the logger command. +# If empty, the system $PATH will be searched for it. +# If not found, syslog notifications will be silently disabled. +logger="" + +# The full path of the aws command. +# If empty, the system $PATH will be searched for it. +# If not found, Amazon SNS notifications will be silently disabled. +aws="" + +# The full path of the sendsms command (smstools3). +# If empty, the system $PATH will be searched for it. +# If not found, SMS notifications will be silently disabled. +sendsms="" + +#------------------------------------------------------------------------------ +# extra options for external commands +# +# In some cases, you may need to change what options get passed to an +# external command. Such cases are covered here. + +# Extra options to pass to curl. In most cases, you shouldn't need to add anything +# to this. If you're having issues with HTTPS connections, you might try adding +# '--insecure' here, but be warned that it will make it much easier for +# third-parties to block notification delivery, and may allow disclosure +# of potentially sensitive information. +#curl_options="--insecure" + +# Extra options to pass to logger. You shouldn't have to specify anything +# here in most cases. +#logger_options="" + +#------------------------------------------------------------------------------ +# extra options + +# By default don't do anything if this is CLEAR, but it was not WARNING or CRITICAL. +# You can send it always if your system makes deduplication for alarms. +#clear_alarm_always='YES' + +# +#------------------------------------------------------------------------------ +# NOTE ABOUT RECIPIENTS +# +# When you define recipients (all types): +# +# - emails addresses +# - pushover user tokens +# - telegram chat ids +# - slack channels +# - alerta environment +# - flock rooms +# - discord channels +# - hipchat rooms +# - sms phone numbers +# - pagerduty.com (pd) services +# - irc channels +# +# You can append |critical to limit the notifications to be sent. +# +# In these examples, the first recipient receives all the alarms +# while the second one receives only notifications for alarms that +# have at some point become critical. The second user may still receive +# warning and clear notifications, but only for the event that previously +# caused a critical alarm. +# +# email : "user1@example.com user2@example.com|critical" +# pushover : "2987343...9437837 8756278...2362736|critical" +# telegram : "111827421 112746832|critical" +# slack : "alarms disasters|critical" +# alerta : "alarms disasters|critical" +# flock : "alarms disasters|critical" +# discord : "alarms disasters|critical" +# twilio : "+15555555555 +17777777777|critical" +# messagebird: "+15555555555 +17777777777|critical" +# kavenegar : "09155555555 09177777777|critical" +# pd : "<pd_service_key_1> <pd_service_key_2>|critical" +# irc : "<irc_channel_1> <irc_channel_2>|critical" +# hangouts : "alarms disasters|critical" +# +# If a recipient is set to empty string, the default recipient of the given +# notification method (email, pushover, telegram, slack, alerta, etc) will be used. +# To disable a notification, use the recipient called: disabled +# This works for all notification methods (including the default recipients). + + +#------------------------------------------------------------------------------ +# email global notification options + +# multiple recipients can be given like this: +# "admin1@example.com admin2@example.com ..." + +# the email address sending email notifications +# the default is the system user netdata runs as (usually: netdata) +# The following formats are supported: +# EMAIL_SENDER="user@domain" +# EMAIL_SENDER="User Name <user@domain>" +# EMAIL_SENDER="'User Name' <user@domain>" +# EMAIL_SENDER="\"User Name\" <user@domain>" +EMAIL_SENDER="" + +# enable/disable sending emails +SEND_EMAIL="YES" + +# if a role recipient is not configured, an email will be send to: +DEFAULT_RECIPIENT_EMAIL="root" +# to receive only critical alarms, set it to "root|critical" + +# Optionally specify the encoding to list in the Content-Type header. +# This doesn't change what encoding the e-mail is sent with, just what +# the headers say it was encoded as. +# This shouldn't need to be changed as it will almost always be +# autodetected from the environment. +#EMAIL_CHARSET="UTF-8" + +# You can also have netdata add headers to the message that will +# cause most e-mail clients to treat all notifications for a given +# chart+alarm+host combination as a single thread. This can help +# simplify tracking of alarms, as it provides an easy wway for scripts +# to corelate messages and also will cause most clients to group all the +# messages together. This is enabled by default, uncomment the line +# below if you want to disable it. +#EMAIL_THREADING="NO" + +# By default, netdata sends HTML and Plain Text emails, some clients +# do not parse HTML emails such as command line clients. +# To make emails readable in these clients, you can configure netdata +# to not send HTML but Plain Text only emails. +#EMAIL_PLAINTEXT_ONLY="YES" + +#------------------------------------------------------------------------------ +# Dynatrace global notification options +#------------------------------------------------------------------------------ +# enable/disable sending Dynatrace notifications +SEND_DYNATRACE="YES" + +# The Dynatrace server with protocol prefix (http:// or https://), example https://monitor.illumineit.com +# Required +DYNATRACE_SERVER="" + +# Generate a Dynatrace API authentication token +# Read https://www.dynatrace.com/support/help/extend-dynatrace/dynatrace-api/basics/dynatrace-api-authentication/ +# On Dynatrace server goto Settings --> Integration --> Dynatrace API --> Generate token +# Required +DYNATRACE_TOKEN="" + +# Beware: Space is taken from dynatrace URL from browser when you create the TOKEN +# Required +DYNATRACE_SPACE="" + +# Generate a Server Tag. On the Dynatrace Server go to Settings --> Tags --> Manually applied tags create the Tag +# The NetData alarm will be sent as a Dynatrace Event to be correlated with all those hosts tagged with this Tag +# you created. +# Required +DYNATRACE_TAG_VALUE="" + +# Change this to what you want +DYNATRACE_ANNOTATION_TYPE="NetData Alarm" + +# This can be CUSTOM_INFO, CUSTOM_ANNOTATION, CUSTOM_CONFIGURATION, CUSTOM_DEPLOYMENT +# Applying default value +# Required +DYNATRACE_EVENT="CUSTOM_INFO" + + +DEFAULT_RECIPIENT_DYNATRACE="" + +#------------------------------------------------------------------------------ +# Stackpulse global notification options +SEND_STACKPULSE="YES" + +# Webhook +STACKPULSE_WEBHOOK="" + +DEFAULT_RECIPIENT_STACKPULSE="" + +#------------------------------------------------------------------------------ +# opsgenie global notification options +SEND_OPSGENIE="YES" + +# Api key +OPSGENIE_API_KEY="" +OPSGENIE_API_URL="" + +DEFAULT_RECIPIENT_OPSGENIE="" + +#------------------------------------------------------------------------------ +# hangouts (google hangouts chat) global notification options + +# enable/disable sending hangouts notifications +SEND_HANGOUTS="YES" + +# On Hangouts, in the room you choose, create an incoming webhook, +# copy the link and paste it below and also give it a room name. +# Without it, netdata cannot send hangouts notifications to that room. +# You will then use the same room name in your recipients list. For each URI, you need +# HANGOUTS_WEBHOOK_URI[room_name]="WEBHOOK_URI" +# e.g. to define systems and development rooms/recipients: +# HANGOUTS_WEBHOOK_URI[systems]="URLforroom1" +# HANGOUTS_WEBHOOK_URI[development]="URLforroom2" + +# if a DEFAULT_RECIPIENT_HANGOUTS is not configured, +# notifications won't be send to hangouts rooms. For the example above, +# a valid recipients list is the following +# DEFAULT_RECIPIENT_HANGOUTS="systems development|critical" +DEFAULT_RECIPIENT_HANGOUTS="" + +#------------------------------------------------------------------------------ +# pushover (pushover.net) global notification options + +# multiple recipients can be given like this: +# "USERTOKEN1 USERTOKEN2 ..." + +# enable/disable sending pushover notifications +SEND_PUSHOVER="YES" + +# Login to pushover.net to get your pushover app token. +# You need only one for all your netdata servers (or you can have one for +# each of your netdata - your call). +# Without an app token, netdata cannot send pushover notifications. +PUSHOVER_APP_TOKEN="" + +# if a role's recipients are not configured, a notification will be send to +# this pushover user token (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_PUSHOVER="" + + +#------------------------------------------------------------------------------ +# pushbullet (pushbullet.com) push notification options + +# multiple recipients can be given like this: +# "user1@email.com user2@mail.com" + +# enable/disable sending pushbullet notifications +SEND_PUSHBULLET="YES" + +# Signup and Login to pushbullet.com +# To get your Access Token, go to https://www.pushbullet.com/#settings/account +# Create a new access token and paste it below. +# Then just set the recipients' emails. +# Please note that the if the email in the DEFAULT_RECIPIENT_PUSHBULLET does +# not have a pushbullet account, the pushbullet service will send an email +# to that address instead. + +# Without an access token, netdata cannot send pushbullet notifications. +PUSHBULLET_ACCESS_TOKEN="" +DEFAULT_RECIPIENT_PUSHBULLET="" + +# Device iden of the sending device. Optional. +PUSHBULLET_SOURCE_DEVICE="" + + +#------------------------------------------------------------------------------ +# Twilio (twilio.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending twilio SMS +SEND_TWILIO="YES" + +# Signup for free trial and select a SMS capable Twilio Number +# To get your Account SID and Token, go to https://www.twilio.com/console +# Place your sid, token and number below. +# Then just set the recipients' phone numbers. +# The trial account is only allowed to use the number specified when set up. + +# Without an account sid and token, netdata cannot send Twilio text messages. +TWILIO_ACCOUNT_SID="" +TWILIO_ACCOUNT_TOKEN="" +TWILIO_NUMBER="" +DEFAULT_RECIPIENT_TWILIO="" + + +#------------------------------------------------------------------------------ +# Messagebird (messagebird.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending messagebird SMS +SEND_MESSAGEBIRD="YES" + +# to get an access key, create a free account at https://www.messagebird.com +# verify and activate the account (no CC info needed) +# login to your account and enter your phonenumber to get some free credits +# to get the API key, click on 'API' in the sidebar, then 'API Access (REST)' +# click 'Add access key' and fill in data (you want a live key to send SMS) + +# Without an access key, netdata cannot send Messagebird text messages. +MESSAGEBIRD_ACCESS_KEY="" +MESSAGEBIRD_NUMBER="" +DEFAULT_RECIPIENT_MESSAGEBIRD="" + + +#------------------------------------------------------------------------------ +# Kavenegar (Kavenegar.com) SMS options + +# multiple recipients can be given like this: +# "09155555555 09177777777" + +# enable/disable sending kavenegar SMS +SEND_KAVENEGAR="YES" + +# to get an access key, after selecting and purchasing your desired service +# at http://kavenegar.com/pricing.html +# login to your account, go to your dashboard and my account are +# https://panel.kavenegar.com/Client/setting/account from API Key +# copy your api key. You can generate new API Key too. +# You can find and select kevenegar sender number from this place. + +# Without an API key, netdata cannot send KAVENEGAR text messages. +KAVENEGAR_API_KEY="" +KAVENEGAR_SENDER="" +DEFAULT_RECIPIENT_KAVENEGAR="" + + +#------------------------------------------------------------------------------ +# telegram (telegram.org) global notification options + +# multiple recipients can be given like this: +# "CHAT_ID_1 CHAT_ID_2 ..." + +# enable/disable sending telegram messages +SEND_TELEGRAM="YES" + +# Contact the bot @BotFather to create a new bot and receive a bot token. +# Without it, netdata cannot send telegram messages. +TELEGRAM_BOT_TOKEN="" + +# To get your chat ID send the command /getid to telegram bot @myidbot +# (https://t.me/myidbot). Each user also needs to open a conversation with the +# bot that will be sending notifications. +# If a role's recipients are not configured, a message will be sent to +# this chat id (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_TELEGRAM="" + + +#------------------------------------------------------------------------------ +# slack (slack.com) global notification options + +# multiple recipients can be given like this: +# "RECIPIENT1 RECIPIENT2 ..." + +# enable/disable sending slack notifications +SEND_SLACK="YES" + +# Login to your slack.com workspace and create an incoming webhook, using the "Incoming Webhooks" App: https://slack.com/apps/A0F7XDUAZ-incoming-webhooks +# Do not use the instructions in https://api.slack.com/incoming-webhooks#enable_webhooks, as those webhooks work only for a single channel. +# You need only one for all your netdata servers (or you can have one for each of your netdata). +# Without the app and a webhook, netdata cannot send slack notifications. +SLACK_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to: +# - A slack channel (syntax: '#channel' or 'channel') +# - A slack user (syntax: '@user') +# - The channel or user defined in slack for the webhook (syntax: '#') +# empty = do not send a notification for unconfigured roles +DEFAULT_RECIPIENT_SLACK="" + +#------------------------------------------------------------------------------ +# Microsoft Team (office.com) global notification options +# More details are available here regarding the payload syntax options : https://docs.microsoft.com/en-us/outlook/actionable-messages/message-card-reference +# Online designer : https://acdesignerbeta.azurewebsites.net/ +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending team notifications +SEND_MSTEAM="YES" + +# if a role's recipients are not configured, a notification will be send to +# this slack channel (empty = do not send a notification for unconfigured +# roles): +# For team the channel name is encoded in the URI after ....IncomingWebhook/___/..... +# This value will be replaced in the webhook value to publish to several channels in a same Team. +# In order to get it working properly, you have to replace the value between [] ....IncomingWebhook/[___]/..... by "CHANNEL" string. +DEFAULT_RECIPIENT_MSTEAM="" +# Based on the way MS Teams is working, put the differents channels here like : "CHANNEL1 CHANNEL2 ..." +# AT LEAST ONE CHANNEL IS MANDATORY +MSTEAM_WEBHOOK_URL="" + +# Define the default color scheme for alert to MS Team - icon and color +# Icons - go to https://emojipedia.org/bomb/ +MSTEAM_ICON_DEFAULT="♡" +MSTEAM_ICON_CLEAR="💚" +MSTEAM_ICON_WARNING="⚠️" +MSTEAM_ICON_CRITICAL="🔥" + +# Colors +MSTEAM_COLOR_DEFAULT="0076D7" +MSTEAM_COLOR_CLEAR="65A677" +MSTEAM_COLOR_WARNING="FFA500" +MSTEAM_COLOR_CRITICAL="D93F3C" + + +#------------------------------------------------------------------------------ +# rocketchat (rocket.chat) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending rocketchat notifications +SEND_ROCKETCHAT="YES" + +# Login to rocket.chat and create an incoming webhook. You need only one for all +# your netdata servers (or you can have one for each of your netdata). +# Without it, netdata cannot send rocketchat notifications. +ROCKETCHAT_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to +# this rocketchat channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_ROCKETCHAT="" + + +#------------------------------------------------------------------------------ +# alerta (alerta.io) global notification options + +# multiple recipients (Environments) can be given like this: +# "Production Development ..." + +# enable/disable sending alerta notifications +SEND_ALERTA="YES" + +# here set your alerta server API url +# this is the API url you defined when installed Alerta server, +# it is the same for all users. Do not include last slash. +# ALERTA_WEBHOOK_URL="https://<server>/alerta/api" +ALERTA_WEBHOOK_URL="" + +# Login with an administrative user to you Alerta server and create an API KEY +# with write permissions. +ALERTA_API_KEY="" + +# you can define environments in /etc/alertad.conf option ALLOWED_ENVIRONMENTS +# standard environments are Production and Development +# if a role's recipients are not configured, a notification will be send to +# this Environment (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_ALERTA="" + + +#------------------------------------------------------------------------------ +# flock (flock.com) global notification options + +# enable/disable sending flock notifications +SEND_FLOCK="YES" + +# Login to flock.com and create an incoming webhook. You need only one for all +# your netdata servers (or you can have one for each of your netdata). +# Without it, netdata cannot send flock notifications. +FLOCK_WEBHOOK_URL="" + +# if a role recipient is not configured, no notification will be sent +DEFAULT_RECIPIENT_FLOCK="" + + +#------------------------------------------------------------------------------ +# discord (discordapp.com) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending discord notifications +SEND_DISCORD="YES" + +# Create a webhook by following the official documentation - +# https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks +DISCORD_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to +# this discord channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_DISCORD="" + + +#------------------------------------------------------------------------------ +# hipchat global notification options + +# multiple recipients can be given like this: +# "ROOM1 ROOM2 ..." + +# enable/disable sending hipchat notifications +SEND_HIPCHAT="YES" + +# define hipchat server +HIPCHAT_SERVER="api.hipchat.com" + +# api.hipchat.com authorization token +# Without this, netdata cannot send hipchat notifications. +HIPCHAT_AUTH_TOKEN="" + +# if a role's recipients are not configured, a notification will be send to +# this hipchat room (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_HIPCHAT="" + + +#------------------------------------------------------------------------------ +# kafka notification options + +# enable/disable sending kafka notifications +SEND_KAFKA="YES" + +# The URL to POST kafka alarm data to. It should be the full URL. +KAFKA_URL="" + +# The IP to be used in the kafka message as the sender. +KAFKA_SENDER_IP="" + + +#------------------------------------------------------------------------------ +# pagerduty.com notification options +# +# pagerduty.com notifications require a "Generic API" (Events v1) +# pagerduty service. +# https://support.pagerduty.com/docs/services-and-integrations + +# multiple recipients can be given like this: +# "<pd_service_key_1> <pd_service_key_2> ..." + +# enable/disable sending pagerduty notifications +SEND_PD="YES" + +# if a role's recipients are not configured, a notification will be sent to +# the "General API" pagerduty.com service that uses this service key. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_PD="" + +# Which PD API are we going to use? For version 2 or newer, it is necessary to do a request for Pagerduty +# before to set the version(https://developer.pagerduty.com/docs/events-api-v2/overview/). +USE_PD_VERSION="1" + +#------------------------------------------------------------------------------ +# fleep notification options +# +# To send fleep.io notifications, you will need a webhook for the +# conversation you want to send to. + +# Fleep recipients are specified as the last part of the webhook URL. +# So, for a webhook URL of: https://fleep.io/hook/IJONmBuuSlWlkb_ttqyXJg, the +# recipient name would be: 'IJONmBuuSlWlkb_ttqyXJg'. + +# enable/disable sending fleep notifications +SEND_FLEEP="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_FLEEP="" + +# The user name to label the messages with. If this is unset, +# the hostname of the system the notification is for will be used. +FLEEP_SENDER="" + + +#------------------------------------------------------------------------------ +# irc notification options +# +# irc notifications require only the nc utility to be installed. + +# multiple recipients can be given like this: +# "<irc_channel_1> <irc_channel_2> ..." + +# enable/disable sending irc notifications +SEND_IRC="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_IRC="" + +# The irc network to which the recipients belong. It must be the full network. +# e.g. "irc.freenode.net" +IRC_NETWORK="" + +# The irc nickname which is required to send the notification. It must not be +# an already registered name as the connection's MODE is defined as a 'guest'. +IRC_NICKNAME="" + +# The irc realname which is required in order to make the connection and is an +# extra identifier. +IRC_REALNAME="" + + +#------------------------------------------------------------------------------ +# syslog notifications +# +# syslog notifications only need you to have a working logger command, which +# should be the case on pretty much any Linux system. + +# enable/disable sending syslog notifications +# NOTE: make sure you have everything else configured the way you want +# it _before_ turning this on. +SEND_SYSLOG="NO" + +# A note on log levels and facilities: +# +# The traditional UNIX syslog mechanism has the concept of both log +# levels and facilities. A log level indicates the relaitve severity of +# the message, while a facility specifies a generic source for the message +# (for example, the `mail` facility is where sendmail and postfix log +# their messages). All major syslog daemons have the ability to filter +# messages based on both log level and facility, and can often also make +# routing decisions for messages based on both factors. +# +# On Linux, the eight log levels in decreasing order of severity are: +# emerg, alert, crit, err, warning, notice, info, debug +# +# By default, warnings will be logged at the warning level, critical +# alerts at the crit level, and clear notifications at the invo level. +# +# And the 19 facilities you can log to are: +# auth, authpriv, cron, daemon, ftp, lpr, mail, news, syslog, user, +# uucp, local0, local1, local2, local3, local4, local5, local6, and local7 +# +# By default, netdata alerts will be logged to the local6 facility. +# +# Depending on your distribution, this means that either all your +# netdata alerts will by default end up in the main system log (usually +# /var/log/messages), or they won't be logged to a file at all. +# Neither of these are likely to be what you actually want, but any +# configuration to change that needs to happen in the syslog daemon +# configuration, not here. + +# This controls which facility is used by default for logging. Defaults +# to local6. +SYSLOG_FACILITY='' + +# If a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +# +# The recipient format for syslog uses the following format: +# [[facility.level][@host[:port]]/]prefix +# +# `prefix` gets appended to the front of all log messages generated for +# that recipient. The prefix is mandatory. +# 'host' and 'port' can be used to specify a remote syslog server to +# send messages to. Leave these out if you want messages to be delivered +# locally. 'host' can be either a hostname or an IP address. +# IPv6 addresses must have square around them. +# 'facility' and 'level' are used to override the default logging facility +# set above and the log level. If one is specified, both must be present. +# +# For example, to send messages with a 'netdata' prefix to a syslog +# daemon listening on port 514 on 'loghost' using the daemon facility and +# notice log level: +# DEFAULT_RECIPIENT_SYSLOG='daemon.notice@loghost:514/netdata' +# +DEFAULT_RECIPIENT_SYSLOG="netdata" + +#------------------------------------------------------------------------------ +# iOS Push Notifications + +# enable/disable sending iOS push notifications +SEND_PROWL="YES" + +# If a role's recipients are not configured, use the following, +# (empty = do not send a notiication for unconfigured roles) +# +# Recipients for iOS push notifications are Prowl API keys. +# +# A recipient may also consist of multiple Prowl API keys separated by +# commas, in which case notifications will be simultaneously sent for all +# of those API keys. +DEFAULT_RECIPIENT_PROWL="" + +#------------------------------------------------------------------------------ +# Amazon SNS notifications +# +# This method requires potentially complex manual configuration. See the +# netdata wiki for information on what is needed. + +# enable/disable sending Amazon SNS notifications +SEND_AWSSNS="YES" + +# Specify a template for the Amazon SNS notifications. This supports +# the same set of variables that are usable in the `custom_sender()` +# function in the custom notification configuration below. +# +AWSSNS_MESSAGE_FORMAT="${status} on ${host} at ${date}: ${chart} ${value_string}" + +# If a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +# +# Recipients for AWS SNS notifications are specified as topic ARN's. +# +DEFAULT_RECIPIENT_AWSSNS="" + +#------------------------------------------------------------------------------ +# SMS Server Tools 3 (smstools3) global notification options + +# enable/disable sending SMS Server Tools 3 SMS notifications +SEND_SMS="YES" + +# if a role's recipients are not configured, a notification will be sent to +# this SMS channel (empty = do not send a notification for unconfigured +# roles). Multiple recipients can be given like this: "PHONE1 PHONE2 ..." + +DEFAULT_RECIPIENT_SMS="" + +# Matrix notifications +# + +# enable/disable Matrix notifications +SEND_MATRIX="YES" + +# The url of the Matrix homeserver +# e.g https://matrix.org:8448 +MATRIX_HOMESERVER= + +# An access token from a valid Matrix account. Tokens usually don't expire, +# can be controlled from a Matrix client. +# See https://matrix.org/docs/guides/client-server.html +MATRIX_ACCESSTOKEN= + +# Specify the default rooms to receive the notification if no rooms are provided +# in a role's recipients. +# The format is !roomid:homeservername +DEFAULT_RECIPIENT_MATRIX="" + +#------------------------------------------------------------------------------ +# custom notifications +# + +# enable/disable sending custom notifications +SEND_CUSTOM="YES" + +# if a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +DEFAULT_RECIPIENT_CUSTOM="" + +# The custom_sender() is a custom function to do whatever you need to do +custom_sender() { + # variables you can use: + # ${host} the host generated this event + # ${url_host} same as ${host} but URL encoded + # ${unique_id} the unique id of this event + # ${alarm_id} the unique id of the alarm that generated this event + # ${event_id} the incremental id of the event, for this alarm id + # ${when} the timestamp this event occurred + # ${name} the name of the alarm, as given in netdata health.d entries + # ${url_name} same as ${name} but URL encoded + # ${chart} the name of the chart (type.id) + # ${url_chart} same as ${chart} but URL encoded + # ${family} the family of the chart + # ${url_family} same as ${family} but URL encoded + # ${status} the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + # ${old_status} the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + # ${value} the current value of the alarm + # ${old_value} the previous value of the alarm + # ${src} the line number and file the alarm has been configured + # ${duration} the duration in seconds of the previous alarm state + # ${duration_txt} same as ${duration} for humans + # ${non_clear_duration} the total duration in seconds this is/was non-clear + # ${non_clear_duration_txt} same as ${non_clear_duration} for humans + # ${units} the units of the value + # ${info} a short description of the alarm + # ${value_string} friendly value (with units) + # ${old_value_string} friendly old value (with units) + # ${image} the URL of an image to represent the status of the alarm + # ${color} a color in #AABBCC format for the alarm + # ${goto_url} the URL the user can click to see the netdata dashboard + # ${calc_expression} the expression evaluated to provide the value for the alarm + # ${calc_param_values} the value of the variables in the evaluated expression + # ${total_warnings} the total number of alarms in WARNING state on the host + # ${total_critical} the total number of alarms in CRITICAL state on the host + + # these are more human friendly: + # ${alarm} like "name = value units" + # ${status_message} like "needs attention", "recovered", "is critical" + # ${severity} like "Escalated to CRITICAL", "Recovered from WARNING" + # ${raised_for} like "(alarm was raised for 10 minutes)" + + # example human readable SMS + local msg="${host} ${status_message}: ${alarm} ${raised_for}" + + # limit it to 160 characters and encode it for use in a URL + urlencode "${msg:0:160}" >/dev/null; msg="${REPLY}" + + # a space separated list of the recipients to send alarms to + to="${1}" + + # Sample send SMS to an imaginary SMS gateway accessible via HTTPS + #for phone in ${to}; do + # httpcode=$(docurl -X POST \ + # --data-urlencode "From=XXX" \ + # --data-urlencode "To=${phone}" \ + # --data-urlencode "Body=${msg}" \ + # -u "${accountsid}:${accounttoken}" \ + # https://domain.website.com/) + # + # if [ "${httpcode}" = "200" ]; then + # info "sent custom notification ${msg} to ${phone}" + # sent=$((sent + 1)) + # else + # error "failed to send custom notification ${msg} to ${phone} with HTTP error code ${httpcode}." + # fi + #done + + info "not sending custom notification to ${to}, for ${status} of '${host}.${chart}.${name}' - custom_sender() is not configured." +} + + +############################################################################### +# RECIPIENTS PER ROLE + +# ----------------------------------------------------------------------------- +# generic system alarms +# CPU, disks, network interfaces, entropy, etc + +role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[sysadmin]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[sysadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[sysadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[sysadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[sysadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[sysadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[sysadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[sysadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[sysadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[sysadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[sysadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[sysadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[sysadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[sysadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[sysadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[sysadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[sysadming]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[sysadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[sysadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[sysadmin]="${DEFAULT_RECIPIENT_MSTEAM}" + +role_recipients_rocketchat[sysadmin]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_dynatrace[sysadmin]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[sysadmin]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[sysadmin]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[sysadmin]="${DEFAULT_RECIPIENT_STACKPULSE}" + +# ----------------------------------------------------------------------------- +# DNS related alarms + +role_recipients_email[domainadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[domainadmin]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[domainadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[domainadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[domainadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[domainadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[domainadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[domainadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[domainadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[domainadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[domainadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[domainadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[domainadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[domainadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[domainadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[domainadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[domainadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[domainadmin]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[domainadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[domainadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[domainadmin]="${DEFAULT_RECIPIENT_MSTEAM}" + +role_recipients_rocketchat[domainadmin]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[domainadmin]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[domainadmin]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[domainadmin]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[domainadmin]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[domainadmin]="${DEFAULT_RECIPIENT_STACKPULSE}" + +# ----------------------------------------------------------------------------- +# database servers alarms +# mysql, redis, memcached, postgres, etc + +role_recipients_email[dba]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[dba]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[dba]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[dba]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[dba]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[dba]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[dba]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[dba]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[dba]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[dba]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[dba]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[dba]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[dba]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[dba]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[dba]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[dba]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[dba]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[dba]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[dba]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[dba]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[dba]="${DEFAULT_RECIPIENT_MSTEAM}" + +role_recipients_rocketchat[dba]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[dba]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[dba]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[dba]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[dba]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[dba]="${DEFAULT_RECIPIENT_STACKPULSE}" + +# ----------------------------------------------------------------------------- +# web servers alarms +# apache, nginx, lighttpd, etc + +role_recipients_email[webmaster]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[webmaster]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[webmaster]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[webmaster]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[webmaster]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[webmaster]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[webmaster]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[webmaster]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[webmaster]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[webmaster]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[webmaster]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[webmaster]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[webmaster]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[webmaster]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[webmaster]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[webmaster]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[webmaster]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[webmaster]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[webmaster]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[webmaster]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[webmaster]="${DEFAULT_RECIPIENT_MSTEAM}" + +role_recipients_rocketchat[webmaster]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[webmaster]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[webmaster]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[webmaster]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[webmaster]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[webmaster]="${DEFAULT_RECIPIENT_STACKPULSE}" + +# ----------------------------------------------------------------------------- +# proxy servers alarms +# squid, etc + +role_recipients_email[proxyadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[proxyadmin]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[proxyadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[proxyadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[proxyadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[proxyadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[proxyadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[proxyadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[proxyadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[proxyadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[proxyadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[proxyadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[proxyadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[proxyadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[proxyadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[proxyadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[proxyadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[proxyadmin]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[porxyadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[proxyadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[proxyadmin]="${DEFAULT_RECIPIENT_MSTEAM}" + +role_recipients_rocketchat[proxyadmin]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[proxyadmin]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[proxyadmin]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[proxyadmin]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[proxyadmin]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[proxyadmin]="${DEFAULT_RECIPIENT_STACKPULSE}" + +# ----------------------------------------------------------------------------- +# peripheral devices +# UPS, photovoltaics, etc + +role_recipients_email[sitemgr]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[sitemgr]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[sitemgr]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[sitemgr]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[sitemgr]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[sitemgr]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[sitemgr]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[sitemgr]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[sitemgr]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[sitemgr]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[sitemgr]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[sitemgr]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[sitemgr]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[sitemgr]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[sitemgr]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_syslog[sitemgr]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[sitemgr]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[sitemgr]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[sitemgr]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteam[sitemgr]="${DEFAULT_RECIPIENT_MSTEAM}" + +role_recipients_rocketchat[sitemgr]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[sitemgr]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[sitemgr]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[sitemgr]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[sitemgr]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[sitemgr]="${DEFAULT_RECIPIENT_STACKPULSE}" diff --git a/health/notifications/health_email_recipients.conf b/health/notifications/health_email_recipients.conf new file mode 100644 index 0000000..f56c6c6 --- /dev/null +++ b/health/notifications/health_email_recipients.conf @@ -0,0 +1,2 @@ +# OBSOLETE FILE +# REPLACED WITH health_alarm_notify.conf diff --git a/health/notifications/irc/Makefile.inc b/health/notifications/irc/Makefile.inc new file mode 100644 index 0000000..1a68f65 --- /dev/null +++ b/health/notifications/irc/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + irc/README.md \ + irc/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/irc/README.md b/health/notifications/irc/README.md new file mode 100644 index 0000000..e7f22e1 --- /dev/null +++ b/health/notifications/irc/README.md @@ -0,0 +1,78 @@ +<!-- +title: "IRC" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/irc/README.md +--> + +# IRC + +This is what you will get: + +IRCCloud web client:\ +![image](https://user-images.githubusercontent.com/31221999/36793487-3735673e-1ca6-11e8-8880-d1d8b6cd3bc0.png) + +Irssi terminal client: +![image](https://user-images.githubusercontent.com/31221999/36793486-3713ada6-1ca6-11e8-8c12-70d956ad801e.png) + +You need: + +1. The `nc` utility. If you do not set the path, Netdata will search for it in your system `$PATH`. + +Set the path for `nc` in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# external commands +# +# The full path of the nc command. +# If empty, the system $PATH will be searched for it. +# If not found, irc notifications will be silently disabled. +nc="/usr/bin/nc" +``` + +2. Αn `IRC_NETWORK` to which your preferred channels belong to. +3. One or more channels ( `DEFAULT_RECIPIENT_IRC` ) to post the messages to. +4. An `IRC_NICKNAME` and an `IRC_REALNAME` to identify in IRC. + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# irc notification options +# +# irc notifications require only the nc utility to be installed. + +# multiple recipients can be given like this: +# "<irc_channel_1> <irc_channel_2> ..." + +# enable/disable sending irc notifications +SEND_IRC="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_IRC="#system-alarms" + +# The irc network to which the recipients belong. It must be the full network. +IRC_NETWORK="irc.freenode.net" + +# The irc nickname which is required to send the notification. It must not be +# an already registered name as the connection's MODE is defined as a 'guest'. +IRC_NICKNAME="netdata-alarm-user" + +# The irc realname which is required in order to make the connection and is an +# extra identifier. +IRC_REALNAME="netdata-user" +``` + +You can define multiple channels like this: `#system-alarms #networking-alarms`.\ +You can also filter the notifications like this: `#system-alarms|critical`.\ +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_irc[sysadmin]="#user-alarms #networking-alarms #system-alarms" +role_recipients_irc[dba]="#databases-alarms" +role_recipients_irc[webmaster]="#networking-alarms" +``` + +The keywords `#user-alarms`, `#networking-alarms`, `#system-alarms`, `#databases-alarms` are irc channels which belong to the specified IRC network. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Firc%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/kavenegar/Makefile.inc b/health/notifications/kavenegar/Makefile.inc new file mode 100644 index 0000000..b98e794 --- /dev/null +++ b/health/notifications/kavenegar/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + kavenegar/README.md \ + kavenegar/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/kavenegar/README.md b/health/notifications/kavenegar/README.md new file mode 100644 index 0000000..b59799f --- /dev/null +++ b/health/notifications/kavenegar/README.md @@ -0,0 +1,46 @@ +<!-- +title: "Kavenegar" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/kavenegar/README.md +--> + +# Kavenegar + +[Kavenegar](https://kavenegar.com/) as service for software developers, based in Iran, provides send and receive SMS, calling voice by using its APIs. + +Will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to kavenegar.com +2. Get your APIKEY and Sender from `http://panel.kavenegar.com/client/setting/account` +3. Fill in KAVENEGAR_API_KEY="" KAVENEGAR_SENDER="" +4. Add the recipient phone numbers to DEFAULT_RECIPIENT_KAVENEGAR="" + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# Kavenegar (kavenegar.com) SMS options + +# multiple recipients can be given like this: +# "09155555555 09177777777" + +# enable/disable sending kavenegar SMS +SEND_KAVENEGAR="YES" + +# to get an access key, after selecting and purchasing your desired service +# at http://kavenegar.com/pricing.html +# login to your account, go to your dashboard and my account are +# https://panel.kavenegar.com/Client/setting/account from API Key +# copy your api key. You can generate new API Key too. +# You can find and select kevenegar sender number from this place. + +# Without an API key, Netdata cannot send KAVENEGAR text messages. +KAVENEGAR_API_KEY="" +KAVENEGAR_SENDER="" +DEFAULT_RECIPIENT_KAVENEGAR="" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fkavenegar%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/matrix/Makefile.inc b/health/notifications/matrix/Makefile.inc new file mode 100644 index 0000000..9937d80 --- /dev/null +++ b/health/notifications/matrix/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + matrix/README.md \ + matrix/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/matrix/README.md b/health/notifications/matrix/README.md new file mode 100644 index 0000000..ea22b4a --- /dev/null +++ b/health/notifications/matrix/README.md @@ -0,0 +1,58 @@ +<!-- +title: "Send Netdata notifications to Matrix network rooms" +description: "Stay aware of warning or critical anomalies by sending health alarms to Matrix network rooms with Netdata's health monitoring watchdog." +sidebar_label: "Matrix" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/matrix/README.md +--> + +# Matrix + +Send notifications to [Matrix](https://matrix.org/) network rooms. + +The requirements for this notification method are: + +1. The url of the homeserver (`https://homeserver:port`). +2. Credentials for connecting to the homeserver, in the form of a valid access token for your account (or for a + dedicated notification account). These tokens usually don't expire. +3. The room ids that you want to sent the notification to. + +To obtain the access token, you can use the following `curl` command: + +```bash +curl -XPOST -d '{"type":"m.login.password", "user":"example", "password":"wordpass"}' "https://homeserver:8448/_matrix/client/r0/login" +``` + +The room ids are unique identifiers and can be obtained from the room settings in a Matrix client (e.g. Riot). Their +format is `!uniqueid:homeserver`. + +Multiple room ids can be defined by separating with a space character. + +Detailed information about the Matrix client API is available at the [official +site](https://matrix.org/docs/guides/client-server.html). + +Your `health_alarm_notify.conf` should look like this: + +```conf +############################################################################### +# Matrix notifications +# + +# enable/disable Matrix notifications +SEND_MATRIX="YES" + +# The url of the Matrix homeserver +# e.g https://matrix.org:8448 +MATRIX_HOMESERVER="https://matrix.org:8448" + +# A access token from a valid Matrix account. Tokens usually don't expire, +# can be controlled from a Matrix client. +# See https://matrix.org/docs/guides/client-server.html +MATRIX_ACCESSTOKEN="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# Specify the default rooms to receive the notification if no rooms are provided +# in a role's recipients. +# The format is !roomid:homeservername +DEFAULT_RECIPIENT_MATRIX="!XXXXXXXXXXXX:matrix.org" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fmatrix%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/messagebird/Makefile.inc b/health/notifications/messagebird/Makefile.inc new file mode 100644 index 0000000..f8d2332 --- /dev/null +++ b/health/notifications/messagebird/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + messagebird/README.md \ + messagebird/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/messagebird/README.md b/health/notifications/messagebird/README.md new file mode 100644 index 0000000..8e3d1a5 --- /dev/null +++ b/health/notifications/messagebird/README.md @@ -0,0 +1,45 @@ +<!-- +title: "Messagebird" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/messagebird/README.md +--> + +# Messagebird + +The messagebird notifications will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to messagebird.com +2. Pick an SMS capable number after sign up to get some free credits +3. Go to <https://www.messagebird.com/app/settings/developers/access> +4. Create a new access key under 'API ACCESS (REST)' (you will want a live key) +5. Fill in MESSAGEBIRD_ACCESS_KEY="XXXXXXXX" MESSAGEBIRD_NUMBER="+XXXXXXXXXXX" +6. Add the recipient phone numbers to DEFAULT_RECIPIENT_MESSAGEBIRD="+XXXXXXXXXXX" + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# Messagebird (messagebird.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending messagebird SMS +SEND_MESSAGEBIRD="YES" + +# to get an access key, create a free account at https://www.messagebird.com +# verify and activate the account (no CC info needed) +# login to your account and enter your phonenumber to get some free credits +# to get the API key, click on 'API' in the sidebar, then 'API Access (REST)' +# click 'Add access key' and fill in data (you want a live key to send SMS) + +# Without an access key, Netdata cannot send Messagebird text messages. +MESSAGEBIRD_ACCESS_KEY="XXXXXXXX" +MESSAGEBIRD_NUMBER="XXXXXXX" +DEFAULT_RECIPIENT_MESSAGEBIRD="XXXXXXX" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fmessagebird%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/opsgenie/Makefile.inc b/health/notifications/opsgenie/Makefile.inc new file mode 100644 index 0000000..c85bb7c --- /dev/null +++ b/health/notifications/opsgenie/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + opsgenie/README.md \ + opsgenie/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/opsgenie/README.md b/health/notifications/opsgenie/README.md new file mode 100644 index 0000000..10b9f52 --- /dev/null +++ b/health/notifications/opsgenie/README.md @@ -0,0 +1,62 @@ +<!-- +title: "Send notifications to Opsgenie" +description: "Send alerts to your Opsgenie incident response account any time an anomaly or performance issue strikes a node in your infrastructure." +sidebar_label: "Opsgenie" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/opsgenie/README.md +--> + +# Send notifications to Opsgenie + +[Opsgenie](https://www.atlassian.com/software/opsgenie) is an alerting and incident response tool. It is designed to +group and filter alarms, build custom routing rules for on-call teams, and correlate deployments and commits to +incidents. + +The first step is to create a [Netdata integration](https://docs.opsgenie.com/docs/api-integration) in the +[Opsgenie](https://www.atlassian.com/software/opsgenie) dashboard. After this, you need to edit +`health_alarm_notify.conf` on your system, by running the following from your [config +directory](/docs/configure/nodes.md): + +```bash +./edit-config health_alarm_notify.conf +``` + +Change the variable `OPSGENIE_API_KEY` with the API key you got from Opsgenie. `OPSGENIE_API_URL` defaults to +`https://api.opsgenie.com`, however there are region-specific API URLs such as `https://eu.api.opsgenie.com`, so set +this if required. + +```conf +SEND_OPSGENIE="YES" + +# Api key +# Default Opsgenie API +OPSGENIE_API_KEY="11111111-2222-3333-4444-555555555555" +OPSGENIE_API_URL="" +``` + +Changes to `health_alarm_notify.conf` do not require a Netdata restart. You can test your Opsgenie notifications +configuration by issuing the commands, replacing `ROLE` with your preferred role: + +```sh +# become user netdata +sudo su -s /bin/bash netdata + +# send a test alarm +/usr/libexec/netdata/plugins.d/alarm-notify.sh test ROLE +``` + +If everything works, you'll see alarms in your Opsgenie platform: + +![Example alarm notifications in +Opsgenie](https://user-images.githubusercontent.com/49162938/92184518-f725f900-ee40-11ea-9afa-e7c639c72206.png) + +If sending the test notifications fails, you can look in `/var/log/netdata/error.log` to find the relevant error +message: + +```log +2020-09-03 23:07:00: alarm-notify.sh: ERROR: failed to send opsgenie notification for: hades test.chart.test_alarm is CRITICAL, with HTTP error code 401. +``` + +You can find more details about the Opsgenie error codes in their [response +docs](https://docs.opsgenie.com/docs/response). + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fopsgenie%2FREADME%2FDonations-netdata-has-received&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/pagerduty/Makefile.inc b/health/notifications/pagerduty/Makefile.inc new file mode 100644 index 0000000..ee9b091 --- /dev/null +++ b/health/notifications/pagerduty/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pagerduty/README.md \ + pagerduty/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pagerduty/README.md b/health/notifications/pagerduty/README.md new file mode 100644 index 0000000..b1f60d4 --- /dev/null +++ b/health/notifications/pagerduty/README.md @@ -0,0 +1,46 @@ +<!-- +title: "PagerDuty" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pagerduty/README.md +--> + +# PagerDuty + +[PagerDuty](https://www.pagerduty.com/company/) is the enterprise incident resolution service that integrates with ITOps and DevOps monitoring stacks to improve operational reliability and agility. From enriching and aggregating events to correlating them into incidents, PagerDuty streamlines the incident management process by reducing alert noise and resolution times. + +Here is an example of a PagerDuty dashboard with Netdata notifications: + +![PagerDuty dashboard with Netdata notifications](https://cloud.githubusercontent.com/assets/19278582/21233877/b466a08a-c2a5-11e6-8d66-ee6eed43818f.png) + +To have Netdata send notifications to PagerDuty, you'll first need to set up a PagerDuty `Generic API` service and install the PagerDuty agent on the host running Netdata. See the following guide for details: + +<https://www.pagerduty.com/docs/guides/agent-install-guide/> + +During the setup of the `Generic API` PagerDuty service, you'll obtain a `pagerduty service key`. Keep this **service key** handy. + +Once the PagerDuty agent is installed on your host and can send notifications from your host to your `Generic API` service on PagerDuty, add the **service key** to `DEFAULT_RECIPIENT_PD` in `health_alarm_notify.conf`: + +``` +#------------------------------------------------------------------------------ +# pagerduty.com notification options +# +# pagerduty.com notifications require the pagerduty agent to be installed and +# a "Generic API" pagerduty service. +# https://www.pagerduty.com/docs/guides/agent-install-guide/ + +# multiple recipients can be given like this: +# "<pd_service_key_1> <pd_service_key_2> ..." + +# enable/disable sending pagerduty notifications +SEND_PD="YES" + +# if a role's recipients are not configured, a notification will be sent to +# the "General API" pagerduty.com service that uses this service key. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_PD="<service key>" + +# Which PD API are we going to use? For version 2 or newer, it is necessary to do a request for Pagerduty +# before to set the version(https://developer.pagerduty.com/docs/events-api-v2/overview/). +USE_PD_VERSION="1" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fpagerduty%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/prowl/Makefile.inc b/health/notifications/prowl/Makefile.inc new file mode 100644 index 0000000..64a1deb --- /dev/null +++ b/health/notifications/prowl/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + prowl/README.md \ + prowl/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/prowl/README.md b/health/notifications/prowl/README.md new file mode 100644 index 0000000..7c60de2 --- /dev/null +++ b/health/notifications/prowl/README.md @@ -0,0 +1,29 @@ +<!-- +title: "Prowl" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/prowl/README.md +--> + +# Prowl + +[Prowl](https://www.prowlapp.com/) is a push notification service for iOS devices. Netdata +supports delivering notifications to iOS devices through Prowl. + +Because of how Netdata integrates with Prowl, there is a hard limit of +at most 1000 notifications per hour (starting from the first notification +sent). Any alerts beyond the first thousand in an hour will be dropped. + +Warning messages will be sent with the 'High' priority, critical messages +will be sent with the 'Emergency' priority, and all other messages will +be sent with the normal priority. Opening the notification's associated +URL will take you to the Netdata dashboard of the system that issued +the alert, directly to the chart that it triggered on. + +## configuration + +To use this, you will need a Prowl API key, which can be requested through +the Prowl website after registering. + +Once you have an API key, simply specify that as a recipient for Prowl +notifications. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fprowl%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() diff --git a/health/notifications/pushbullet/Makefile.inc b/health/notifications/pushbullet/Makefile.inc new file mode 100644 index 0000000..d3a9459 --- /dev/null +++ b/health/notifications/pushbullet/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pushbullet/README.md \ + pushbullet/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pushbullet/README.md b/health/notifications/pushbullet/README.md new file mode 100644 index 0000000..7a098d6 --- /dev/null +++ b/health/notifications/pushbullet/README.md @@ -0,0 +1,48 @@ +<!-- +title: "PushBullet" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pushbullet/README.md +--> + +# PushBullet + +Will look like this on your browser: +![image](https://cloud.githubusercontent.com/assets/4300670/19109636/278b1c0c-8aee-11e6-8a09-7fc94fdbfec8.png) + +And like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/4300670/19109635/278a1dde-8aee-11e6-9984-0bc87a13312d.png) + +You will need: + +1. Signup and Login to pushbullet.com +2. Get your Access Token, go to <https://www.pushbullet.com/#settings/account> and create a new one +3. Fill in the PUSHBULLET_ACCESS_TOKEN with that value +4. Add the recipient emails to DEFAULT_RECIPIENT_PUSHBULLET + !!PLEASE NOTE THAT IF THE RECIPIENT DOES NOT HAVE A PUSHBULLET ACCOUNT, PUSHBULLET SERVICE WILL SEND AN EMAIL!! + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# pushbullet (pushbullet.com) push notification options + +# multiple recipients can be given like this: +# "user1@email.com user2@mail.com" + +# enable/disable sending pushbullet notifications +SEND_PUSHBULLET="YES" + +# Signup and Login to pushbullet.com +# To get your Access Token, go to https://www.pushbullet.com/#settings/account +# And create a new access token +# Then just set the recipients emails +# Please note that the if the email in the DEFAULT_RECIPIENT_PUSHBULLET does +# not have a pushbullet account, the pushbullet service will send an email +# to that address instead + +# Without an access token, Netdata cannot send pushbullet notifications. +PUSHBULLET_ACCESS_TOKEN="o.Sometokenhere" +DEFAULT_RECIPIENT_PUSHBULLET="admin1@example.com admin3@somemail.com" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fpushbullet%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/pushover/Makefile.inc b/health/notifications/pushover/Makefile.inc new file mode 100644 index 0000000..9b703a1 --- /dev/null +++ b/health/notifications/pushover/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pushover/README.md \ + pushover/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pushover/README.md b/health/notifications/pushover/README.md new file mode 100644 index 0000000..3ba97fb --- /dev/null +++ b/health/notifications/pushover/README.md @@ -0,0 +1,23 @@ +<!-- +title: "PushOver" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pushover/README.md +--> + +# PushOver + +pushover.net allows you to receive push notifications on your mobile phone. The service seems free for up to 7.500 messages per month. + +Netdata will send warning messages with priority `0` and critical messages with priority `1`. pushover.net allows you to select do-not-disturb hours. The way this is configured, critical notifications will ring and vibrate your phone, even during the do-not-disturb-hours. All other notifications will be delivered silently. + +You need: + +1. APP TOKEN. You can use the same on all your Netdata servers. +2. USER TOKEN for each user you are going to send notifications to. This is the actual recipient of the notification. + +The configuration is like above (slack messages). + +pushover.net notifications look like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/18407319/839c10c4-7715-11e6-92c0-12f8215128d3.png) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fpushover%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/rocketchat/Makefile.inc b/health/notifications/rocketchat/Makefile.inc new file mode 100644 index 0000000..58f210b --- /dev/null +++ b/health/notifications/rocketchat/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + rocketchat/README.md \ + rocketchat/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/rocketchat/README.md b/health/notifications/rocketchat/README.md new file mode 100644 index 0000000..a54f582 --- /dev/null +++ b/health/notifications/rocketchat/README.md @@ -0,0 +1,52 @@ +<!-- +title: "Rocket.Chat" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/rocketchat/README.md +--> + +# Rocket.Chat + +This is what you will get: +![Netdata on RocketChat](https://i.imgur.com/Zu4t3j3.png) +You need: + +1. The **incoming webhook URL** as given by RocketChat. You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). +2. One or more channels to post the messages to. + +Get them here: <https://rocket.chat/docs/administrator-guides/integrations/index.html#how-to-create-a-new-incoming-webhook> + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# rocketchat (rocket.chat) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending rocketchat notifications +SEND_ROCKETCHAT="YES" + +# Login to rocket.chat and create an incoming webhook. You need only one for all +# your Netdata servers (or you can have one for each of your Netdata). +# Without it, Netdata cannot send rocketchat notifications. +ROCKETCHAT_WEBHOOK_URL="<your_incoming_webhook_url>" + +# if a role's recipients are not configured, a notification will be send to +# this rocketchat channel (empty = do not send a notification for unconfigured +# roles). +DEFAULT_RECIPIENT_ROCKETCHAT="monitoring_alarms" +``` + +You can define multiple channels like this: `alarms systems`. +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_rocketchat[sysadmin]="systems" +role_recipients_rocketchat[dba]="databases systems" +role_recipients_rocketchat[webmaster]="marketing development" +``` + +The keywords `systems`, `databases`, `marketing`, `development` are RocketChat channels (they should already exist). +Both public and private channels can be used, even if they differ from the channel configured in your RocketChat incoming webhook. + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Frocketchat%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/slack/Makefile.inc b/health/notifications/slack/Makefile.inc new file mode 100644 index 0000000..043bfaf --- /dev/null +++ b/health/notifications/slack/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + slack/README.md \ + slack/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/slack/README.md b/health/notifications/slack/README.md new file mode 100644 index 0000000..e338e9a --- /dev/null +++ b/health/notifications/slack/README.md @@ -0,0 +1,50 @@ +<!-- +title: "Slack" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/slack/README.md +--> + +# Slack + +This is what you will get: +![image](https://cloud.githubusercontent.com/assets/2662304/18407116/bbd0fee6-7710-11e6-81cf-58c0defaee2b.png) + +You need: + +1. The **incoming webhook URL** as given by slack.com. You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). +2. One or more channels to post the messages to. + +To get a webhook that works on multiple channels, you will need to login to your slack.com workspace and create an incoming webhook using the [Incoming Webhooks App](https://slack.com/apps/A0F7XDUAZ-incoming-webhooks). +Do NOT use the instructions in <https://api.slack.com/incoming-webhooks#enable_webhooks>, as the particular webhooks work only for a single channel. + +Set the webhook and the recipients in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +SEND_SLACK="YES" + +SLACK_WEBHOOK_URL="https://hooks.slack.com/services/XXXXXXXX/XXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role's recipients are not configured, a notification will be send to: +# - A slack channel (syntax: '#channel' or 'channel') +# - A slack user (syntax: '@user') +# - The channel or user defined in slack for the webhook (syntax: '#') +# empty = do not send a notification for unconfigured roles +DEFAULT_RECIPIENT_SLACK="alarms" +``` + +You can define multiple recipients like this: `# #alarms systems @myuser`. +This example will send the alarm to: + +- The recipient defined in slack for the webhook (not known to Netdata) +- The channel 'alarms' +- The channel 'systems' +- The user @myuser + +You can give different recipients per **role** using these (at the same file): + +``` +role_recipients_slack[sysadmin]="systems" +role_recipients_slack[dba]="databases systems" +role_recipients_slack[webmaster]="marketing development" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fslack%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/smstools3/Makefile.inc b/health/notifications/smstools3/Makefile.inc new file mode 100644 index 0000000..4764b9e --- /dev/null +++ b/health/notifications/smstools3/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + smstools3/README.md \ + smstools3/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/smstools3/README.md b/health/notifications/smstools3/README.md new file mode 100644 index 0000000..6d90e70 --- /dev/null +++ b/health/notifications/smstools3/README.md @@ -0,0 +1,44 @@ +<!-- +title: "SMS Server Tools 3" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/smstools3/README.md +--> + +# SMS Server Tools 3 + +The [SMS Server Tools 3](http://smstools3.kekekasvi.com/) is a SMS Gateway software which can send and receive short messages through GSM modems and mobile phones. + +To have Netdata send notifications via SMS Server Tools 3, you'll first need to [install](http://smstools3.kekekasvi.com/index.php?p=compiling) and [configure](http://smstools3.kekekasvi.com/index.php?p=configure) smsd. + +Ensure that the user `netdata` can execute `sendsms`. Any user executing `sendsms` needs to: + +- Have write permissions to `/tmp` and `/var/spool/sms/outgoing` +- Be a member of group `smsd` + +To ensure that the steps above are successful, just `su netdata` and execute `sendsms phone message`. + +You then just need to configure the recipient phone numbers in `health_alarm_notify.conf`: + +```sh +#------------------------------------------------------------------------------ +# SMS Server Tools 3 (smstools3) global notification options + +# enable/disable sending SMS Server Tools 3 SMS notifications +SEND_SMS="YES" + +# if a role's recipients are not configured, a notification will be sent to +# this SMS channel (empty = do not send a notification for unconfigured +# roles). Multiple recipients can be given like this: "PHONE1 PHONE2 ..." + +DEFAULT_RECIPIENT_SMS="" +``` + +Netdata uses the script `sendsms` that is installed by `smstools3` and just passes a phone number and a message to it. If `sendsms` is not in `$PATH`, you can pass its location in `health_alarm_notify.conf`: + +```sh +# The full path of the sendsms command (smstools3). +# If empty, the system $PATH will be searched for it. +# If not found, SMS notifications will be silently disabled. +sendsms="" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fsmstools3%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/stackpulse/Makefile.inc b/health/notifications/stackpulse/Makefile.inc new file mode 100644 index 0000000..eabcb4b --- /dev/null +++ b/health/notifications/stackpulse/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + stackpulse/README.md \ + stackpulse/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/stackpulse/README.md b/health/notifications/stackpulse/README.md new file mode 100644 index 0000000..13d2f72 --- /dev/null +++ b/health/notifications/stackpulse/README.md @@ -0,0 +1,80 @@ +<!-- +title: "Send notifications to StackPulse" +description: "Send alerts to your StackPulse Netdata integration any time an anomaly or performance issue strikes a node in your infrastructure." +sidebar_label: "StackPulse" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/stackpulse/README.md +--> + +# Send notifications to StackPulse + +[StackPulse](https://stackpulse.com/) is a software-as-a-service platform for site reliability engineering. +It helps SREs, DevOps Engineers and Software Developers reduce toil and alert fatigue while improving reliability of +software services by managing, analyzing and automating incident response activities. + +Sending Netdata alarm notifications to StackPulse allows you to create smart automated response workflows +(StackPulse playbooks) that will help you drive down your MTTD and MTTR by performing any of the following: + +- Enriching the incident with data from multiple sources +- Performing triage actions and analyzing their results +- Orchestrating incident management and notification flows +- Performing automatic and semi-automatic remediation actions +- Analyzing incident data and remediation patterns to improve reliability of your services + +To send the notification you need: + +1. Create a Netdata integration in the `StackPulse Administration Portal`, and copy the `Endpoint` URL. + +![Creating a Netdata integration in StackPulse](https://user-images.githubusercontent.com/49162938/93023348-d9455a80-f5dd-11ea-8e05-67d07dce93e4.png) + +2. On your node, navigate to `/etc/netdata/` and run the following command: + +```sh +$ ./edit-config health_alarm_notify.conf +``` + +3. Set the `STACKPULSE_WEBHOOK` variable to `Endpoint` URL you copied earlier: + +``` +SEND_STACKPULSE="YES" +STACKPULSE_WEBHOOK="https://hooks.stackpulse.io/v1/webhooks/YOUR_UNIQUE_ID" +``` + +4. Now [restart Netdata](/docs/getting-started.md#start-stop-and-restart-netdata). When your node creates an alarm, you + can see the associated notification on your StackPulse Administration Portal + +## React to alarms with playbooks + +StackPulse allow users to create `Playbooks` giving additional information about events that happen in specific +scenarios. For example, you could create a Playbook that responds to a "low disk space" alarm by compressing and +cleaning up storage partitions with dynamic data. + +![image](https://user-images.githubusercontent.com/49162938/93207961-4c201400-f74b-11ea-94d1-42a29d007b62.png) + +![The StackPulse Administration Portal with a Netdata +alarm](https://user-images.githubusercontent.com/49162938/93208199-bfc22100-f74b-11ea-83c4-728be23dcf4d.png) +### Create Playbooks for Netdata alarms + +To create a Playbook, you need to access the StackPulse Administration Portal. After the initial setup, you need to +access the **TRIGGER** tab to define the scenarios used to trigger the event. The following variables are available: + +- `Hostname`: The host that generated the event. +- `Chart`: The name of the chart. +- `OldValue` : The previous value of the alarm. +- `Value`: The current value of the alarm. +- `Units` : The units of the value. +- `OldStatus` : The previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL. +- `State`: The current alarm status, the acceptable values are the same of `OldStatus`. +- `Alarm` : The name of the alarm, as given in Netdata's health.d entries. +- `Date` : The timestamp this event occurred. +- `Duration` : The duration in seconds of the previous alarm state. +- `NonClearDuration` : The total duration in seconds this is/was non-clear. +- `Description` : A short description of the alarm copied from the alarm definition. +- `CalcExpression` : The expression that was evaluated to trigger the alarm. +- `CalcParamValues` : The values of the parameters in the expression, at the time of the evaluation. +- `TotalWarnings` : Total number of alarms in WARNING state. +- `TotalCritical` : Total number of alarms in CRITICAL state. +- `ID` : The unique id of the alarm that generated this event. + +For more details how to create a scenario, take a look at the [StackPulse documentation](https://docs.stackpulse.io). + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fopsgenie%2FREADME%2FDonations-netdata-has-received&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/syslog/Makefile.inc b/health/notifications/syslog/Makefile.inc new file mode 100644 index 0000000..94a8acc --- /dev/null +++ b/health/notifications/syslog/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + syslog/README.md \ + syslog/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/syslog/README.md b/health/notifications/syslog/README.md new file mode 100644 index 0000000..456394d --- /dev/null +++ b/health/notifications/syslog/README.md @@ -0,0 +1,34 @@ +<!-- +title: "Syslog" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/syslog/README.md +--> + +# Syslog + +You need a working `logger` command for this to work. This is the case on pretty much every Linux system in existence, and most BSD systems. + +Logged messages will look like this: + +``` +netdata WARNING on hostname at Tue Apr 3 09:00:00 EDT 2018: disk_space._ out of disk space time = 5h +``` + +## configuration + +System log targets are configured as recipients in [`/etc/netdata/health_alarm_notify.conf`](https://github.com/netdata/netdata/blob/36bedc044584dea791fd29455bdcd287c3306cb2/conf.d/health_alarm_notify.conf#L534) (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`). + +You can als configure per-role targets in the same file a bit further down. + +Targets are defined as follows: + +``` +[[facility.level][@host[:port]]/]prefix +``` + +`prefix` defines what the log messages are prefixed with. By default, all lines are prefixed with 'netdata'. + +The `facility` and `level` are the standard syslog facility and level options, for more info on them see your local `logger` and `syslog` documentation. By default, Netdata will log to the `local6` facility, with a log level dependent on the type of message (`crit` for CRITICAL, `warning` for WARNING, and `info` for everything else). + +You can configure sending directly to remote log servers by specifying a host (and optionally a port). However, this has a somewhat high overhead, so it is much preferred to use your local syslog daemon to handle the forwarding of messages to remote systems (pretty much all of them allow at least simple forwarding, and most of the really popular ones support complex queueing and routing of messages to remote log servers). + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fsyslog%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/telegram/Makefile.inc b/health/notifications/telegram/Makefile.inc new file mode 100644 index 0000000..ffca071 --- /dev/null +++ b/health/notifications/telegram/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + telegram/README.md \ + telegram/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/telegram/README.md b/health/notifications/telegram/README.md new file mode 100644 index 0000000..c1c6f2a --- /dev/null +++ b/health/notifications/telegram/README.md @@ -0,0 +1,45 @@ +<!-- +title: "Telegram" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/telegram/README.md +--> + +# Telegram + +[Telegram](https://telegram.org/) is a messaging app with a focus on speed and security, it’s super-fast, simple and free. You can use Telegram on all your devices at the same time — your messages sync seamlessly across any number of your phones, tablets or computers. + +With Telegram, you can send messages, photos, videos and files of any type (doc, zip, mp3, etc), as well as create groups for up to 100,000 people or channels for broadcasting to unlimited audiences. You can write to your phone contacts and find people by their usernames. As a result, Telegram is like SMS and email combined — and can take care of all your personal or business messaging needs. + +Netdata will send warning messages without vibration. + +You need to: + +1. Get a bot token. To get one, contact the [@BotFather](https://t.me/BotFather) bot and send the command `/newbot`. Follow the instructions. +2. Start a conversation with your bot or invite it into a group where you want it to send messages. +3. Find the chat ID for every chat you want to send messages to. Contact the [@myidbot](https://t.me/myidbot) bot and send the `/getid` command to get your personal chat ID or invite it into a group and use the `/getgroupid` command to get the group chat ID. Group IDs start with a hyphen, supergroup IDs start with `-100`. + Alternatively, you can get the chat ID directly from the bot API. Send *your* bot a command in the chat you want to use, then check `https://api.telegram.org/bot{YourBotToken}/getUpdates`, eg. `https://api.telegram.org/bot111122223:7OpFlFFRzRBbrUUmIjj5HF9Ox2pYJZy5/getUpdates` +4. Set the bot token and the chat ID of the recipient in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: +``` +SEND_TELEGRAM="YES" +TELEGRAM_BOT_TOKEN="111122223:7OpFlFFRzRBbrUUmIjj5HF9Ox2pYJZy5" +DEFAULT_RECIPIENT_TELEGRAM="-100233335555" +``` + +You can define multiple recipients like this: `"-100311112222 212341234|critical"`. +This example will send: + +- All alerts to the group with ID -100311112222 +- Critical alerts to the user with ID 212341234 + +You can give different recipients per **role** using these (in the same file): + +``` +role_recipients_telegram[sysadmin]="212341234" +role_recipients_telegram[dba]="-1004444333321" +role_recipients_telegram[webmaster]="49999333322 -1009999222255" +``` + +Telegram messages look like this: + +![Netdata notifications via Telegram](https://user-images.githubusercontent.com/1153921/66612223-f07dfb80-eb75-11e9-976f-5734ffd93ecd.png) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Ftelegram%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/twilio/Makefile.inc b/health/notifications/twilio/Makefile.inc new file mode 100644 index 0000000..0f2d8d8 --- /dev/null +++ b/health/notifications/twilio/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + twilio/README.md \ + twilio/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/twilio/README.md b/health/notifications/twilio/README.md new file mode 100644 index 0000000..b36d40b --- /dev/null +++ b/health/notifications/twilio/README.md @@ -0,0 +1,47 @@ +<!-- +title: "Twilio" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/twilio/README.md +--> + +# Twilio + +Will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to twilio.com +2. Pick an SMS capable number during sign up. +3. Get your SID, and Token from <https://www.twilio.com/console> +4. Fill in TWILIO_ACCOUNT_SID="XXXXXXXX" TWILIO_ACCOUNT_TOKEN="XXXXXXXXX" TWILIO_NUMBER="+XXXXXXXXXXX" +5. Add the recipient phone numbers to DEFAULT_RECIPIENT_TWILIO="+XXXXXXXXXXX" + +!!PLEASE NOTE THAT IF YOUR ACCOUNT IS A TRIAL ACCOUNT YOU WILL ONLY BE ABLE TO SEND NOTIFICATIONS TO THE NUMBER YOU SIGNED UP WITH + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# Twilio (twilio.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending twilio SMS +SEND_TWILIO="YES" + +# Signup for free trial and select a SMS capable Twilio Number +# To get your Account SID and Token, go to https://www.twilio.com/console +# Place your sid, token and number below. +# Then just set the recipients' phone numbers. +# The trial account is only allowed to use the number specified when set up. + +# Without an account sid and token, Netdata cannot send Twilio text messages. +TWILIO_ACCOUNT_SID="xxxxxxxxx" +TWILIO_ACCOUNT_TOKEN="xxxxxxxxxx" +TWILIO_NUMBER="xxxxxxxxxxx" +DEFAULT_RECIPIENT_TWILIO="+15555555555" +``` + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Ftwilio%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) diff --git a/health/notifications/web/Makefile.inc b/health/notifications/web/Makefile.inc new file mode 100644 index 0000000..b564d83 --- /dev/null +++ b/health/notifications/web/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + web/README.md \ + web/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/web/README.md b/health/notifications/web/README.md new file mode 100644 index 0000000..9e49186 --- /dev/null +++ b/health/notifications/web/README.md @@ -0,0 +1,13 @@ +<!-- +title: "Dashboard" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/web/README.md +--> + +# Dashboard + +The Netdata dashboard shows HTML notifications, when it is open. + +Such web notifications look like this: +![image](https://cloud.githubusercontent.com/assets/2662304/18407279/82bac6a6-7714-11e6-847e-c2e84eeacbfb.png) + +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fhealth%2Fnotifications%2Fweb%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) |