diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 14:31:17 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-05-04 14:31:17 +0000 |
commit | 8020f71afd34d7696d7933659df2d763ab05542f (patch) | |
tree | 2fdf1b5447ffd8bdd61e702ca183e814afdcb4fc /health | |
parent | Initial commit. (diff) | |
download | netdata-8020f71afd34d7696d7933659df2d763ab05542f.tar.xz netdata-8020f71afd34d7696d7933659df2d763ab05542f.zip |
Adding upstream version 1.37.1.upstream/1.37.1upstream
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'health')
146 files changed, 16670 insertions, 0 deletions
diff --git a/health/Makefile.am b/health/Makefile.am new file mode 100644 index 0000000..7c8d7f9 --- /dev/null +++ b/health/Makefile.am @@ -0,0 +1,103 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +AUTOMAKE_OPTIONS = subdir-objects +MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +SUBDIRS = \ + notifications \ + $(NULL) + +CLEANFILES = \ + $(NULL) + +dist_noinst_DATA = \ + README.md \ + $(NULL) + +userhealthconfigdir=$(configdir)/health.d +dist_userhealthconfig_DATA = \ + $(NULL) + +# Explicitly install directories to avoid permission issues due to umask +install-exec-local: + $(INSTALL) -d $(DESTDIR)$(userhealthconfigdir) + +healthconfigdir=$(libconfigdir)/health.d +dist_healthconfig_DATA = \ + health.d/adaptec_raid.conf \ + health.d/anomalies.conf \ + health.d/apcupsd.conf \ + health.d/bcache.conf \ + health.d/beanstalkd.conf \ + health.d/bind_rndc.conf \ + health.d/boinc.conf \ + health.d/btrfs.conf \ + health.d/ceph.conf \ + health.d/cgroups.conf \ + health.d/cpu.conf \ + health.d/cockroachdb.conf \ + health.d/disks.conf \ + health.d/dnsmasq_dhcp.conf \ + health.d/dns_query.conf \ + health.d/dockerd.conf \ + health.d/entropy.conf \ + health.d/exporting.conf \ + health.d/fping.conf \ + health.d/geth.conf \ + health.d/ioping.conf \ + health.d/gearman.conf \ + health.d/go.d.plugin.conf \ + health.d/haproxy.conf \ + health.d/hdfs.conf \ + health.d/httpcheck.conf \ + health.d/ipc.conf \ + health.d/ipfs.conf \ + health.d/ipmi.conf \ + health.d/isc_dhcpd.conf \ + health.d/kubelet.conf \ + health.d/linux_power_supply.conf \ + health.d/load.conf \ + health.d/mdstat.conf \ + health.d/megacli.conf \ + health.d/memcached.conf \ + health.d/memory.conf \ + health.d/ml.conf \ + health.d/mysql.conf \ + health.d/net.conf \ + health.d/netfilter.conf \ + health.d/nvme.conf \ + health.d/nut.conf \ + health.d/pihole.conf \ + health.d/ping.conf \ + health.d/postgres.conf \ + health.d/portcheck.conf \ + health.d/processes.conf \ + health.d/python.d.plugin.conf \ + health.d/qos.conf \ + health.d/ram.conf \ + health.d/redis.conf \ + health.d/retroshare.conf \ + health.d/riakkv.conf \ + health.d/scaleio.conf \ + health.d/softnet.conf \ + health.d/synchronization.conf \ + health.d/swap.conf \ + health.d/systemdunits.conf \ + health.d/timex.conf \ + health.d/tcp_conn.conf \ + health.d/tcp_listen.conf \ + health.d/tcp_mem.conf \ + health.d/tcp_orphans.conf \ + health.d/tcp_resets.conf \ + health.d/udp_errors.conf \ + health.d/unbound.conf \ + health.d/vcsa.conf \ + health.d/vernemq.conf \ + health.d/vsphere.conf \ + health.d/web_log.conf \ + health.d/whoisquery.conf \ + health.d/wmi.conf \ + health.d/x509check.conf \ + health.d/zfs.conf \ + health.d/dbengine.conf \ + $(NULL) diff --git a/health/QUICKSTART.md b/health/QUICKSTART.md new file mode 100644 index 0000000..bc2da2d --- /dev/null +++ b/health/QUICKSTART.md @@ -0,0 +1,143 @@ +<!-- +title: "Health quickstart" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/QUICKSTART.md +--> + +# Health quickstart + +In this quickstart guide, you'll learn the basics of editing health configuration files. With this knowledge, you +will be able to customize how and when Netdata triggers alarms based on the health and performance of your system or +infrastructure. + +To learn about more advanced health configurations, visit the [health reference guide](/health/REFERENCE.md). + +## Edit health configuration files + +You should [use `edit-config`](/docs/configure/nodes.md) to edit Netdata's health configuration files. `edit-config` +will open your system's default terminal editor for you to make your changes. Once you've saved and closed the editor, +`edit-config` will copy your edited file into `/etc/netdata/health.d/`, which will override the stock file in +`/usr/lib/netdata/conf.d/health.d/` and ensure your customizations are persistent between updates. + +For example, to edit the `cpu.conf` health configuration file, you would run: + +```bash +cd /etc/netdata/ # Replace with your Netdata configuration directory, if not /etc/netdata/ +./edit-config health.d/cpu.conf +``` + +Each health configuration file contains one or more health entities, which always begin with an `alarm:` or `template:` +line. You can edit these entities based on your needs. To make any changes live, be sure to [reload your health +configuration](#reload-health-configuration). + +## Reference Netdata's stock health configuration files + +While you should always [use `edit-config`](#edit-health-configuration-files), you might also want to view the stock +health configuration files Netdata ships with. Stock files can be useful as reference material, or to determine which +file you should edit with `edit-config`. + +By default, Netdata will put health configuration files in `/usr/lib/netdata/conf.d/health.d`. However, you can +double-check the location of these files by navigating to `http://NODE:19999/netdata.conf`, replacing `NODE` with the IP +address or hostname for your Agent dashboard, looking for the `stock health configuration directory` option. The value +here will show the correct path for your installation. + +```conf +[directories] + ... + # stock health config = /usr/lib/netdata/conf.d/health.d +``` + +Navigate to the health configuration directory to see all the available files and open them for reading. + +```bash +cd /usr/lib/netdata/conf.d/health.d/ +ls +adaptec_raid.conf entropy.conf memory.conf squid.conf +am2320.conf fping.conf mongodb.conf +apache.conf mysql.conf swap.conf +... +``` + +> β οΈ If you edit configuration files in your stock health configuration directory, Netdata will overwrite them during +> any updates. Please use `edit-config` as described in the [section above](#edit-health-configuration-files). + +## Write a new health entity + +While tuning existing alarms may work in some cases, you may need to write entirely new health entities based on how +your systems and applications work. + +To write a new health entity, let's create a new file inside of the `health.d/` directory. We'll name our file +`example.conf` for now. + +```bash +./edit-config health.d/example.conf +``` + +As an example, let's build a health entity that triggers an alarm your system's RAM usage goes above 80%. Copy and paste +the following into the editor: + +```yaml + alarm: ram_usage + on: system.ram +lookup: average -1m percentage of used + units: % + every: 1m + warn: $this > 80 + crit: $this > 90 + info: The percentage of RAM used by the system. +``` + +Let's look into each of the lines to see how they create a working health entity. + +- `alarm`: The name for your new entity. The name needs to follow these requirements: + - Any alphabet letter or number. + - The symbols `.` and `_`. + - Cannot be `chart name`, `dimension name`, `family name`, or `chart variable names`. +- `on`: Which chart the entity listens to. +- `lookup`: Which metrics the alarm monitors, the duration of time to monitor, and how to process the metrics into a + usable format. + - `average`: Calculate the average of all the metrics collected. + - `-1m`: Use metrics from 1 minute ago until now to calculate that average. + - `percentage`: Clarify that we're calculating a percentage of RAM usage. + - `of used`: Specify which dimension (`used`) on the `system.ram` chart you want to monitor with this entity. +- `units`: Use percentages rather than absolute units. +- `every`: How often to perform the `lookup` calculation to decide whether or not to trigger this alarm. +- `warn`/`crit`: The value at which Netdata should trigger a warning or critical alarm. +- `info`: A description of the alarm, which will appear in the dashboard and notifications. + +Let's put all these lines into a human-readable format. + +This health entity, named **ram_usage**, watches at the **system.ram** chart. It looks up the last **1 minute** of +metrics from the **used** dimension and calculates the **average** of all those metrics in a **percentage** format, +using a **% unit**. The entity performs this lookup **every minute**. If the average RAM usage percentage over the last +1 minute is **more than 80%**, the entity triggers a warning alarm. If the usage is **more than 90%**, the entity +triggers a critical alarm. + +Now that you've written a new health entity, you need to reload it to see it live on the dashboard. + +## Reload health configuration + +To make any changes to your health configuration live, you must reload Netdata's health monitoring system. To do that +without restarting all of Netdata, run the following: + +```bash +netdatacli reload-health +``` + +If you receive an error like `command not found`, this means that `netdatacli` is not installed in your `$PATH`. In that + case, you can reload only the health component by sending a `SIGUSR2` to Netdata: + +```bash +killall -USR2 netdata +``` +## What's next? + +To learn about all of Netdata's health configuration options, view the [reference guide](/health/REFERENCE.md) and +[daemon configuration](/daemon/config/README.md#health-section-options) for additional options available in the +`[health]` section of `netdata.conf`. + +Or, get guided insights into specific health configurations with our [health guides](/health/README.md#guides). + +Finally, move on to Netdata's [notification system](/health/notifications/README.md) to learn more about how Netdata can +let you know when the health of your systems or apps goes awry. + + diff --git a/health/README.md b/health/README.md new file mode 100644 index 0000000..2b1caf5 --- /dev/null +++ b/health/README.md @@ -0,0 +1,38 @@ +<!-- +title: "Health monitoring" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/README.md +--> + +# Health monitoring + +The Netdata Agent is a health watchdog for the health and performance of your systems, services, and applications. We've +worked closely with our community of DevOps engineers, SREs, and developers to define hundreds of production-ready +alarms that work without any configuration. + +The Agent's health monitoring system is also dynamic and fully customizable. You can write entirely new alarms, tune the +community-configured alarms for every app/service [the Agent collects metrics from](/collectors/COLLECTORS.md), or +silence anything you're not interested in. You can even power complex lookups by running statistical algorithms against +your metrics. + +Ready to take the next steps with health monitoring? + +[Quickstart](/health/QUICKSTART.md) + +[Configuration reference](/health/REFERENCE.md) + +## Guides + +Every infrastructure is different, so we're not interested in mandating how you should configure Netdata's health +monitoring features. Instead, these guides should give you the details you need to tweak alarms to your heart's +content. + +[Stopping notifications for individual alarms](/docs/guides/monitor/stop-notifications-alarms.md) + +[Use dimension templates to create dynamic alarms](/docs/guides/monitor/dimension-templates.md) + +## Related features + +**[Notifications](/health/notifications/README.md)**: Get notified about ongoing alarms from your Agents via your +favorite platform(s), such as Slack, Discord, PagerDuty, email, and much more. + + diff --git a/health/REFERENCE.md b/health/REFERENCE.md new file mode 100644 index 0000000..90da410 --- /dev/null +++ b/health/REFERENCE.md @@ -0,0 +1,1023 @@ +<!-- +title: "Health configuration reference" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/REFERENCE.md +--> + +# Health configuration reference + +Welcome to the health configuration reference. + +This guide contains information about editing health configuration files to tweak existing alarms or create new health +entities that are customized to the needs of your infrastructure. + +To learn the basics of locating and editing health configuration files, see the [health +quickstart](/health/QUICKSTART.md). + +## Health configuration files + +You can configure the Agent's health watchdog service by editing files in two locations: + +- The `[health]` section in `netdata.conf`. By editing the daemon's behavior, you can disable health monitoring + altogether, run health checks more or less often, and more. See [daemon + configuration](/daemon/config/README.md#health-section-options) for a table of all the available settings, their + default values, and what they control. +- The individual `.conf` files in `health.d/`. These health entity files are organized by the type of metric they are + performing calculations on or their associated collector. You should edit these files using the `edit-config` + script. For example: `sudo ./edit-config health.d/cpu.conf`. + +## Health entity reference + +The following reference contains information about the syntax and options of _health entities_, which Netdata attaches +to charts in order to trigger alarms. + +### Entity types + +There are two entity types: **alarms** and **templates**. They have the same format and feature setβthe only difference +is their label. + +**Alarms** are attached to specific charts and use the `alarm` label. + +**Templates** define rules that apply to all charts of a specific context, and use the `template` label. Templates help +you apply one entity to all disks, all network interfaces, all MySQL databases, and so on. + +Alarms have higher precedence and will override templates. If an alarm and template entity have the same name and attach +to the same chart, Netdata will use the alarm. + +### Entity format + +Netdata parses the following lines. Beneath the table is an in-depth explanation of each line's purpose and syntax. + +- The `alarm` or `template` line must be the first line of any entity. +- The `on` line is **always required**. +- The `every` line is **required** if not using `lookup`. +- Each entity **must** have at least one of the following lines: `lookup`, `calc`, `warn`, or `crit`. +- A few lines use space-separated lists to define how the entity behaves. You can use `*` as a wildcard or prefix with + `!` for a negative match. Order is important, too! See our [simple patterns docs](/libnetdata/simple_pattern/README.md) for + more examples. +- Lines terminated by a `\` are spliced together with the next line. The backslash is removed and the following line is + joined with the current one. No space is inserted, so you may split a line anywhere, even in the middle of a word. + This comes in handy if your `info` line consists of several sentences. + +| line | required | functionality | +| --------------------------------------------------- | --------------- | ------------------------------------------------------------------------------------- | +| [`alarm`/`template`](#alarm-line-alarm-or-template) | yes | Name of the alarm/template. | +| [`on`](#alarm-line-on) | yes | The chart this alarm should attach to. | +| [`class`](#alarm-line-class) | no | The general alarm classification. | +| [`type`](#alarm-line-type) | no | What area of the system the alarm monitors. | +| [`component`](#alarm-line-component) | no | Specific component of the type of the alarm. | +| [`os`](#alarm-line-os) | no | Which operating systems to run this chart. | +| [`hosts`](#alarm-line-hosts) | no | Which hostnames will run this alarm. | +| [`plugin`](#alarm-line-plugin) | no | Restrict an alarm or template to only a certain plugin. | +| [`module`](#alarm-line-module) | no | Restrict an alarm or template to only a certain module. | +| [`charts`](#alarm-line-charts) | no | Restrict an alarm or template to only certain charts. | +| [`families`](#alarm-line-families) | no | Restrict a template to only certain families. | +| [`lookup`](#alarm-line-lookup) | yes | The database lookup to find and process metrics for the chart specified through `on`. | +| [`calc`](#alarm-line-calc) | yes (see above) | A calculation to apply to the value found via `lookup` or another variable. | +| [`every`](#alarm-line-every) | no | The frequency of the alarm. | +| [`green`/`red`](#alarm-lines-green-and-red) | no | Set the green and red thresholds of a chart. | +| [`warn`/`crit`](#alarm-lines-warn-and-crit) | yes (see above) | Expressions evaluating to true or false, and when true, will trigger the alarm. | +| [`to`](#alarm-line-to) | no | A list of roles to send notifications to. | +| [`exec`](#alarm-line-exec) | no | The script to execute when the alarm changes status. | +| [`delay`](#alarm-line-delay) | no | Optional hysteresis settings to prevent floods of notifications. | +| [`repeat`](#alarm-line-repeat) | no | The interval for sending notifications when an alarm is in WARNING or CRITICAL mode. | +| [`options`](#alarm-line-options) | no | Add an option to not clear alarms. | +| [`host labels`](#alarm-line-host-labels) | no | List of labels present on a host. | +| [`info`](#alarm-line-info) | no | A brief description of the alarm. | + +The `alarm` or `template` line must be the first line of any entity. + +#### Alarm line `alarm` or `template` + +This line starts an alarm or template based on the [entity type](#entity-types) you're interested in creating. + +**Alarm:** + +```yaml +alarm: NAME +``` + +**Template:** + +```yaml +template: NAME +``` + +`NAME` can be any alpha character, with `.` (period) and `_` (underscore) as the only allowed symbols, but the names +cannot be `chart name`, `dimension name`, `family name`, or `chart variables names`. + +#### Alarm line `on` + +This line defines the chart this alarm should attach to. + +**Alarms:** + +```yaml +on: CHART +``` + +The value `CHART` should be the unique ID or name of the chart you're interested in, as shown on the dashboard. In the +image below, the unique ID is `system.cpu`. + +![Finding the unique ID of a +chart](https://user-images.githubusercontent.com/1153921/67443082-43b16e80-f5b8-11e9-8d33-d6ee052c6678.png) + +**Template:** + +```yaml +on: CONTEXT +``` + +The value `CONTEXT` should be the context you want this template to attach to. + +Need to find the context? Hover over the date on any given chart and look at the tooltip. In the image below, which +shows a disk I/O chart, the tooltip reads: `proc:/proc/diskstats, disk.io`. + +![Finding the context of a chart via the tooltip](https://user-images.githubusercontent.com/1153921/68882856-2b230880-06cd-11ea-923b-b28c4632d479.png) + +You're interested in what comes after the comma: `disk.io`. That's the name of the chart's context. + +If you create a template using the `disk.io` context, it will apply an alarm to every disk available on your system. + +#### Alarm line `class` + +This indicates the type of error (or general problem area) that the alarm or template applies to. For example, `Latency` can be used for alarms that trigger on latency issues on network interfaces, web servers, or database systems. Example: + +```yaml +class: Latency +``` + +<details> +<summary>Netdata's stock alarms use the following `class` attributes by default:</summary> + +| Class | +| ----------------| +| Errors | +| Latency | +| Utilization | +| Workload | + + +</details> + +`class` will default to `Unknown` if the line is missing from the alarm configuration. + +#### Alarm line `type` + +Type can be used to indicate the broader area of the system that the alarm applies to. For example, under the general `Database` type, you can group together alarms that operate on various database systems, like `MySQL`, `CockroachDB`, `CouchDB` etc. Example: + +```yaml +type: Database +``` +<details> +<summary>Netdata's stock alarms use the following `type` attributes by default, but feel free to adjust for your own requirements.</summary> + +| Type | Description | +| ------------------------ | ------------------------------------------------------------------------------------------------ | +| Ad Filtering | Services related to Ad Filtering (like pi-hole) | +| Certificates | Certificates monitoring related | +| Cgroups | Alerts for cpu and memory usage of control groups | +| Computing | Alerts for shared computing applications (e.g. boinc) | +| Containers | Container related alerts (e.g. docker instances) | +| Database | Database systems (e.g. MySQL, PostgreSQL, etc) | +| Data Sharing | Used to group together alerts for data sharing applications | +| DHCP | Alerts for dhcp related services | +| DNS | Alerts for dns related services | +| Kubernetes | Alerts for kubernetes nodes monitoring | +| KV Storage | Key-Value pairs services alerts (e.g. memcached) | +| Linux | Services specific to Linux (e.g. systemd) | +| Messaging | Alerts for message passing services (e.g. vernemq) | +| Netdata | Internal Netdata components monitoring | +| Other | When an alert doesn't fit in other types. | +| Power Supply | Alerts from power supply related services (e.g. apcupsd) | +| Search engine | Alerts for search services (e.g. elasticsearch) | +| Storage | Class for alerts dealing with storage services (storage devices typically live under `System`) | +| System | General system alarms (e.g. cpu, network, etc.) | +| Virtual Machine | Virtual Machine software | +| Web Proxy | Web proxy software (e.g. squid) | +| Web Server | Web server software (e.g. Apache, ngnix, etc.) | +| Windows | Alerts for monitor of wmi services | + +</details> + +If an alarm configuration is missing the `type` line, its value will default to `Unknown`. + +#### Alarm line `component` + +Component can be used to narrow down what the previous `type` value specifies for each alarm or template. Continuing from the previous example, `component` might include `MySQL`, `CockroachDB`, `MongoDB`, all under the same `Database` type. Example: + +```yaml +component: MySQL +``` +As with the `class` and `type` line, if `component` is missing from the configuration, its value will default to `Unknown`. + +#### Alarm line `os` + +The alarm or template will be used only if the operating system of the host matches this list specified in `os`. The +value is a space-separated list. + +The following example enables the entity on Linux, FreeBSD, and macOS, but no other operating systems. + +```yaml +os: linux freebsd macos +``` + +#### Alarm line `hosts` + +The alarm or template will be used only if the hostname of the host matches this space-separated list. + +The following example will load on systems with the hostnames `server` and `server2`, and any system with hostnames that +begin with `database`. It _will not load_ on the host `redis3`, but will load on any _other_ systems with hostnames that +begin with `redis`. + +```yaml +hosts: server1 server2 database* !redis3 redis* +``` + +#### Alarm line `plugin` + +The `plugin` line filters which plugin within the context this alarm should apply to. The value is a space-separated +list of [simple patterns](/libnetdata/simple_pattern/README.md). For example, +you can create a filter for an alarm that applies specifically to `python.d.plugin`: + +```yaml +plugin: python.d.plugin +``` + +The `plugin` line is best used with other options like `module`. When used alone, the `plugin` line creates a very +inclusive filter that is unlikely to be of much use in production. See [`module`](#alarm-line-module) for a +comprehensive example using both. + +#### Alarm line `module` + +The `module` line filters which module within the context this alarm should apply to. The value is a space-separated +list of [simple patterns](/libnetdata/simple_pattern/README.md). For +example, you can create an alarm that applies only on the `isc_dhcpd` module started by `python.d.plugin`: + +```yaml +plugin: python.d.plugin +module: isc_dhcpd +``` + +#### Alarm line `charts` + +The `charts` line filters which chart this alarm should apply to. It is only available on entities using the +[`template`](#alarm-line-alarm-or-template) line. +The value is a space-separated list of [simple patterns](/libnetdata/simple_pattern/README.md). For +example, a template that applies to `disk.svctm` (Average Service Time) context, but excludes the disk `sdb` from alarms: + +```yaml +template: disk_svctm_alarm + on: disk.svctm + charts: !*sdb* * +``` + +#### Alarm line `families` + +The `families` line, used only alongside templates, filters which families within the context this alarm should apply +to. The value is a space-separated list. + +The value is a space-separate list of simple patterns. See our [simple patterns docs](/libnetdata/simple_pattern/README.md) for +some examples. + +For example, you can create a template on the `disk.io` context, but filter it to only the `sda` and `sdb` families: + +```yaml +families: sda sdb +``` + +#### Alarm line `lookup` + +This line makes a database lookup to find a value. This result of this lookup is available as `$this`. + +The format is: + +```yaml +lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] [foreach DIMENSIONS] +``` + +Everything is the same with [badges](/web/api/badges/README.md). In short: + +- `METHOD` is one of `average`, `min`, `max`, `sum`, `incremental-sum`. + This is required. + +- `AFTER` is a relative number of seconds, but it also accepts a single letter for changing + the units, like `-1s` = 1 second in the past, `-1m` = 1 minute in the past, `-1h` = 1 hour + in the past, `-1d` = 1 day in the past. You need a negative number (i.e. how far in the past + to look for the value). **This is required**. + +- `at BEFORE` is by default 0 and is not required. Using this you can define the end of the + lookup. So data will be evaluated between `AFTER` and `BEFORE`. + +- `every DURATION` sets the updated frequency of the lookup (supports single letter units as + above too). + +- `OPTIONS` is a space separated list of `percentage`, `absolute`, `min2max`, `unaligned`, + `match-ids`, `match-names`. Check the [badges](/web/api/badges/README.md) documentation for more info. + +- `of DIMENSIONS` is optional and has to be the last parameter. Dimensions have to be separated + by `,` or `|`. The space characters found in dimensions will be kept as-is (a few dimensions + have spaces in their names). This accepts Netdata simple patterns _(with `words` separated by + `,` or `|` instead of spaces)_ and the `match-ids` and `match-names` options affect the searches + for dimensions. + +- `foreach DIMENSIONS` is optional, will always be the last parameter, and uses the same `,`/`|` + rules as the `of` parameter. Each dimension you specify in `foreach` will use the same rule + to trigger an alarm. If you set both `of` and `foreach`, Netdata will ignore the `of` parameter + and replace it with one of the dimensions you gave to `foreach`. + +The result of the lookup will be available as `$this` and `$NAME` in expressions. +The timestamps of the timeframe evaluated by the database lookup is available as variables +`$after` and `$before` (both are unix timestamps). + +#### Alarm line `calc` + +A `calc` is designed to apply some calculation to the values or variables available to the entity. The result of the +calculation will be made available at the `$this` variable, overwriting the value from your `lookup`, to use in warning +and critical expressions. + +When paired with `lookup`, `calc` will perform the calculation just after `lookup` has retrieved a value from Netdata's +database. + +You can use `calc` without `lookup` if you are using [other available variables](#variables). + +The `calc` line uses [expressions](#expressions) for its syntax. + +```yaml +calc: EXPRESSION +``` + +#### Alarm line `every` + +Sets the update frequency of this alarm. This is the same to the `every DURATION` given +in the `lookup` lines. + +Format: + +```yaml +every: DURATION +``` + +`DURATION` accepts `s` for seconds, `m` is minutes, `h` for hours, `d` for days. + +#### Alarm lines `green` and `red` + +Set the green and red thresholds of a chart. Both are available as `$green` and `$red` in expressions. If multiple +alarms define different thresholds, the ones defined by the first alarm will be used. These will eventually visualized +on the dashboard, so only one set of them is allowed. If you need multiple sets of them in different alarms, use +absolute numbers instead of `$red` and `$green`. + +Format: + +```yaml +green: NUMBER +red: NUMBER +``` + +#### Alarm lines `warn` and `crit` + +Define the expression that triggers either a warning or critical alarm. These are optional, and should evaluate to +either true or false (or zero/non-zero). + +The format uses Netdata's [expressions syntax](#expressions). + +```yaml +warn: EXPRESSION +crit: EXPRESSION +``` + +#### Alarm line `to` + +This will be the first parameter of the script to be executed when the alarm switches status. Its meaning is left up to +the `exec` script. + +The default `exec` script, `alarm-notify.sh`, uses this field as a space separated list of roles, which are then +consulted to find the exact recipients per notification method. + +Format: + +```yaml +to: ROLE1 ROLE2 ROLE3 ... +``` + +#### Alarm line `exec` + +The script that will be executed when the alarm changes status. + +Format: + +```yaml +exec: SCRIPT +``` + +The default `SCRIPT` is Netdata's `alarm-notify.sh`, which supports all the notifications methods Netdata supports, +including custom hooks. + +#### Alarm line `delay` + +This is used to provide optional hysteresis settings for the notifications, to defend against notification floods. These +settings do not affect the actual alarm - only the time the `exec` script is executed. + +Format: + +```yaml +delay: [[[up U] [down D] multiplier M] max X] +``` + +- `up U` defines the delay to be applied to a notification for an alarm that raised its status + (i.e. CLEAR to WARNING, CLEAR to CRITICAL, WARNING to CRITICAL). For example, `up 10s`, the + notification for this event will be sent 10 seconds after the actual event. This is used in + hope the alarm will get back to its previous state within the duration given. The default `U` + is zero. + +- `down D` defines the delay to be applied to a notification for an alarm that moves to lower + state (i.e. CRITICAL to WARNING, CRITICAL to CLEAR, WARNING to CLEAR). For example, `down 1m` + will delay the notification by 1 minute. This is used to prevent notifications for flapping + alarms. The default `D` is zero. + +- `multiplier M` multiplies `U` and `D` when an alarm changes state, while a notification is + delayed. The default multiplier is `1.0`. + +- `max X` defines the maximum absolute notification delay an alarm may get. The default `X` + is `max(U * M, D * M)` (i.e. the max duration of `U` or `D` multiplied once with `M`). + + Example: + + `delay: up 10s down 15m multiplier 2 max 1h` + + The time is `00:00:00` and the status of the alarm is CLEAR. + + | time of event | new status | delay | notification will be sent | why | + | ------------- | ---------- | --- | ------------------------- | --- | + | 00:00:01 | WARNING | `up 10s` | 00:00:11 | first state switch | + | 00:00:05 | CLEAR | `down 15m x2` | 00:30:05 | the alarm changes state while a notification is delayed, so it was multiplied | + | 00:00:06 | WARNING | `up 10s x2 x2` | 00:00:26 | multiplied twice | + | 00:00:07 | CLEAR | `down 15m x2 x2 x2` | 00:45:07 | multiplied 3 times. | + + So: + + - `U` and `D` are multiplied by `M` every time the alarm changes state (any state, not just + their matching one) and a delay is in place. + - All are reset to their defaults when the alarm switches state without a delay in place. + +#### Alarm line `repeat` + +Defines the interval between repeating notifications for the alarms in CRITICAL or WARNING mode. This will override the +default interval settings inherited from health settings in `netdata.conf`. The default settings for repeating +notifications are `default repeat warning = DURATION` and `default repeat critical = DURATION` which can be found in +health stock configuration, when one of these interval is bigger than 0, Netdata will activate the repeat notification +for `CRITICAL`, `CLEAR` and `WARNING` messages. + +Format: + +```yaml +repeat: [off] [warning DURATION] [critical DURATION] +``` + +- `off`: Turns off the repeating feature for the current alarm. This is effective when the default repeat settings has + been enabled in health configuration. +- `warning DURATION`: Defines the interval when the alarm is in WARNING state. Use `0s` to turn off the repeating + notification for WARNING mode. +- `critical DURATION`: Defines the interval when the alarm is in CRITICAL state. Use `0s` to turn off the repeating + notification for CRITICAL mode. + +#### Alarm line `options` + +The only possible value for the `options` line is + +```yaml +options: no-clear-notification +``` + +For some alarms we need compare two time-frames, to detect anomalies. For example, `health.d/httpcheck.conf` has an +alarm template called `web_service_slow` that compares the average http call response time over the last 3 minutes, +compared to the average over the last hour. It triggers a warning alarm when the average of the last 3 minutes is twice +the average of the last hour. In such cases, it is easy to trigger the alarm, but difficult to tell when the alarm is +cleared. As time passes, the newest window moves into the older, so the average response time of the last hour will keep +increasing. Eventually, the comparison will find the averages in the two time-frames close enough to clear the alarm. +However, the issue was not resolved, it's just a matter of the newer data "polluting" the old. For such alarms, it's a +good idea to tell Netdata to not clear the notification, by using the `no-clear-notification` option. + +#### Alarm line `host labels` + +Defines the list of labels present on a host. See our [host labels guide](/docs/guides/using-host-labels.md) for +an explanation of host labels and how to implement them. + +For example, let's suppose that `netdata.conf` is configured with the following labels: + +```yaml +[host labels] + installed = 20191211 + room = server +``` + +And more labels in `netdata.conf` for workstations: + +```yaml +[host labels] + installed = 201705 + room = workstation +``` + +By defining labels inside of `netdata.conf`, you can now apply labels to alarms. For example, you can add the following +line to any alarms you'd like to apply to hosts that have the label `room = server`. + +```yaml +host labels: room = server +``` + +The `host labels` is a space-separated list that accepts simple patterns. For example, you can create an alarm +that will be applied to all hosts installed in the last decade with the following line: + +```yaml +host labels: installed = 201* +``` + +See our [simple patterns docs](/libnetdata/simple_pattern/README.md) for more examples. + +#### Alarm line `info` + +The info field can contain a small piece of text describing the alarm or template. This will be rendered in +notifications and UI elements whenever the specific alarm is in focus. An example for the `ram_available` alarm is: + +```yaml +info: percentage of estimated amount of RAM available for userspace processes, without causing swapping +``` + +info fields can contain special variables in their text that will be replaced during run-time to provide more specific +alert information. Current variables supported are: + +| variable | description | +| ---------| ----------- | +| $family | Will be replaced by the family instance for the alert (e.g. eth0) | +| $label: | Followed by a chart label name, this will replace the variable with the chart label's value | + +For example, an info field like the following: + +```yaml +info: average inbound utilization for the network interface $family over the last minute +``` + +Will be rendered on the alert acting on interface `eth0` as: + +```yaml +info: average inbound utilization for the network interface eth0 over the last minute +``` + +An alert acting on a chart that has a chart label named e.g. `target`, with a value of `https://netdata.cloud/`, +can be enriched as follows: + +```yaml +info: average ratio of HTTP responses with unexpected status over the last 5 minutes for the site $label:target +``` + +Will become: + +```yaml +info: average ratio of HTTP responses with unexpected status over the last 5 minutes for the site https://netdata.cloud/ +``` + +> Please note that variable names are case sensitive. + +## Expressions + +Netdata has an internal [infix expression parser](/libnetdata/eval). This parses expressions and creates an internal +structure that allows fast execution of them. + +These operators are supported `+`, `-`, `*`, `/`, `<`, `==`, `<=`, `<>`, `!=`, `>`, `>=`, `&&`, `||`, `!`, `AND`, `OR`, `NOT`. +Boolean operators result in either `1` (true) or `0` (false). + +The conditional evaluation operator `?` is supported too. Using this operator IF-THEN-ELSE conditional statements can be +specified. The format is: `(condition) ? (true expression) : (false expression)`. So, Netdata will first evaluate the +`condition` and based on the result will either evaluate `true expression` or `false expression`. + +Example: `($this > 0) ? ($avail * 2) : ($used / 2)`. + +Nested such expressions are also supported (i.e. `true expression` and `false expression` can contain conditional +evaluations). + +Expressions also support the `abs()` function. + +Expressions can have variables. Variables start with `$`. Check below for more information. + +There are two special values you can use: + +- `nan`, for example `$this != nan` will check if the variable `this` is available. A variable can be `nan` if the + database lookup failed. All calculations (i.e. addition, multiplication, etc) with a `nan` result in a `nan`. + +- `inf`, for example `$this != inf` will check if `this` is not infinite. A value or variable can be set to infinite + if divided by zero. All calculations (i.e. addition, multiplication, etc) with a `inf` result in a `inf`. + +### Special use of the conditional operator + +A common (but not necessarily obvious) use of the conditional evaluation operator is to provide +[hysteresis](https://en.wikipedia.org/wiki/Hysteresis) around the critical or warning thresholds. This usage helps to +avoid bogus messages resulting from small variations in the value when it is varying regularly but staying close to the +threshold value, without needing to delay sending messages at all. + +An example of such usage from the default CPU usage alarms bundled with Netdata is: + +```yaml +warn: $this > (($status >= $WARNING) ? (75) : (85)) +crit: $this > (($status == $CRITICAL) ? (85) : (95)) +``` + +The above say: + +- If the alarm is currently a warning, then the threshold for being considered a warning is 75, otherwise it's 85. + +- If the alarm is currently critical, then the threshold for being considered critical is 85, otherwise it's 95. + +Which in turn, results in the following behavior: + +- While the value is rising, it will trigger a warning when it exceeds 85, and a critical alert when it exceeds 95. + +- While the value is falling, it will return to a warning state when it goes below 85, and a normal state when it goes + below 75. + +- If the value is constantly varying between 80 and 90, then it will trigger a warning the first time it goes above + 85, but will remain a warning until it goes below 75 (or goes above 85). + +- If the value is constantly varying between 90 and 100, then it will trigger a critical alert the first time it goes + above 95, but will remain a critical alert goes below 85 (at which point it will return to being a warning). + +## Variables + +You can find all the variables that can be used for a given chart, using +`http://NODE:19999/api/v1/alarm_variables?chart=CHART_NAME`, replacing `NODE` with the IP address or hostname for your +Agent dashboard. For example, [variables for the `system.cpu` chart of the +registry](https://registry.my-netdata.io/api/v1/alarm_variables?chart=system.cpu). + +> If you don't know how to find the CHART_NAME, you can read about it [here](/web/README.md#charts). + +Netdata supports 3 internal indexes for variables that will be used in health monitoring. + +<details markdown="1"><summary>The variables below can be used in both chart alarms and context templates.</summary> + +Although the `alarm_variables` link shows you variables for a particular chart, the same variables can also be used in +templates for charts belonging to a given [context](/web/README.md#contexts). The reason is that all charts of a given +context are essentially identical, with the only difference being the [family](/web/README.md#families) that +identifies a particular hardware or software instance. Charts and templates do not apply to specific families anyway, +unless if you explicitly limit an alarm with the [alarm line `families`](#alarm-line-families). + +</details> + +- **chart local variables**. All the dimensions of the chart are exposed as local variables. The value of `$this` for + the other configured alarms of the chart also appears, under the name of each configured alarm. + + Charts also define a few special variables: + + - `$last_collected_t` is the unix timestamp of the last data collection + - `$collected_total_raw` is the sum of all the dimensions (their last collected values) + - `$update_every` is the update frequency of the chart + - `$green` and `$red` the threshold defined in alarms (these are per chart - the charts + inherits them from the the first alarm that defined them) + + Chart dimensions define their last calculated (i.e. interpolated) value, exactly as + shown on the charts, but also a variable with their name and suffix `_raw` that resolves + to the last collected value - as collected and another with suffix `_last_collected_t` + that resolves to unix timestamp the dimension was last collected (there may be dimensions + that fail to be collected while others continue normally). + +- **family variables**. Families are used to group charts together. For example all `eth0` + charts, have `family = eth0`. This index includes all local variables, but if there are + overlapping variables, only the first are exposed. + +- **host variables**. All the dimensions of all charts, including all alarms, in fullname. + Fullname is `CHART.VARIABLE`, where `CHART` is either the chart id or the chart name (both + are supported). + +- **special variables\*** are: + + - `$this`, which is resolved to the value of the current alarm. + + - `$status`, which is resolved to the current status of the alarm (the current = the last + status, i.e. before the current database lookup and the evaluation of the `calc` line). + This values can be compared with `$REMOVED`, `$UNINITIALIZED`, `$UNDEFINED`, `$CLEAR`, + `$WARNING`, `$CRITICAL`. These values are incremental, ie. `$status > $CLEAR` works as + expected. + + - `$now`, which is resolved to current unix timestamp. + +## Alarm statuses + +Alarms can have the following statuses: + +- `REMOVED` - the alarm has been deleted (this happens when a SIGUSR2 is sent to Netdata + to reload health configuration) + +- `UNINITIALIZED` - the alarm is not initialized yet + +- `UNDEFINED` - the alarm failed to be calculated (i.e. the database lookup failed, + a division by zero occurred, etc) + +- `CLEAR` - the alarm is not armed / raised (i.e. is OK) + +- `WARNING` - the warning expression resulted in true or non-zero + +- `CRITICAL` - the critical expression resulted in true or non-zero + +The external script will be called for all status changes. + +## Example alarms + +Check the `health/health.d/` directory for all alarms shipped with Netdata. + +Here are a few examples: + +### Example 1 - check server alive + +A simple check if an apache server is alive: + +```yaml +template: apache_last_collected_secs + on: apache.requests + calc: $now - $last_collected_t + every: 10s + warn: $this > ( 5 * $update_every) + crit: $this > (10 * $update_every) +``` + +The above checks that Netdata is able to collect data from apache. In detail: + +```yaml +template: apache_last_collected_secs +``` + +The above defines a **template** named `apache_last_collected_secs`. +The name is important since `$apache_last_collected_secs` resolves to the `calc` line. +So, try to give something descriptive. + +```yaml + on: apache.requests +``` + +The above applies the **template** to all charts that have `context = apache.requests` +(i.e. all your apache servers). + +```yaml + calc: $now - $last_collected_t +``` + +- `$now` is a standard variable that resolves to the current timestamp. + +- `$last_collected_t` is the last data collection timestamp of the chart. + So this calculation gives the number of seconds passed since the last data collection. + +```yaml + every: 10s +``` + +The alarm will be evaluated every 10 seconds. + +```yaml + warn: $this > ( 5 * $update_every) + crit: $this > (10 * $update_every) +``` + +If these result in non-zero or true, they trigger the alarm. + +- `$this` refers to the value of this alarm (i.e. the result of the `calc` line. + We could also use `$apache_last_collected_secs`. + +`$update_every` is the update frequency of the chart, in seconds. + +So, the warning condition checks if we have not collected data from apache for 5 +iterations and the critical condition checks for 10 iterations. + +### Example 2 - disk space + +Check if any of the disks is critically low on disk space: + +```yaml +template: disk_full_percent + on: disk.space + calc: $used * 100 / ($avail + $used) + every: 1m + warn: $this > 80 + crit: $this > 95 + repeat: warning 120s critical 10s +``` + +`$used` and `$avail` are the `used` and `avail` chart dimensions as shown on the dashboard. + +So, the `calc` line finds the percentage of used space. `$this` resolves to this percentage. + +This is a repeating alarm and if the alarm becomes CRITICAL it repeats the notifications every 10 seconds. It also +repeats notifications every 2 minutes if the alarm goes into WARNING mode. + +### Example 3 - disk fill rate + +Predict if any disk will run out of space in the near future. + +We do this in 2 steps: + +Calculate the disk fill rate: + +```yaml + template: disk_fill_rate + on: disk.space + lookup: max -1s at -30m unaligned of avail + calc: ($this - $avail) / (30 * 60) + every: 15s +``` + +In the `calc` line: `$this` is the result of the `lookup` line (i.e. the free space 30 minutes +ago) and `$avail` is the current disk free space. So the `calc` line will either have a positive +number of GB/second if the disk if filling up, or a negative number of GB/second if the disk is +freeing up space. + +There is no `warn` or `crit` lines here. So, this template will just do the calculation and +nothing more. + +Predict the hours after which the disk will run out of space: + +```yaml + template: disk_full_after_hours + on: disk.space + calc: $avail / $disk_fill_rate / 3600 + every: 10s + warn: $this > 0 and $this < 48 + crit: $this > 0 and $this < 24 +``` + +The `calc` line estimates the time in hours, we will run out of disk space. Of course, only +positive values are interesting for this check, so the warning and critical conditions check +for positive values and that we have enough free space for 48 and 24 hours respectively. + +Once this alarm triggers we will receive an email like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/17839993/87872b32-6802-11e6-8e08-b2e4afef93bb.png) + +### Example 4 - dropped packets + +Check if any network interface is dropping packets: + +```yaml +template: 30min_packet_drops + on: net.drops + lookup: sum -30m unaligned absolute + every: 10s + crit: $this > 0 +``` + +The `lookup` line will calculate the sum of the all dropped packets in the last 30 minutes. + +The `crit` line will issue a critical alarm if even a single packet has been dropped. + +Note that the drops chart does not exist if a network interface has never dropped a single packet. +When Netdata detects a dropped packet, it will add the chart and it will automatically attach this +alarm to it. + +### Example 5 - CPU usage + +Check if user or system dimension is using more than 50% of cpu: + +```yaml + alarm: dim_template + on: system.cpu + os: linux +lookup: average -3s percentage foreach system,user + units: % + every: 10s + warn: $this > 50 + crit: $this > 80 +``` + +The `lookup` line will calculate the average CPU usage from system and user in the last 3 seconds. Because we have +the foreach in the `lookup` line, Netdata will create two independent alarms called `dim_template_system` +and `dim_template_user` that will have all the other parameters shared among them. + +### Example 6 - CPU usage + +Check if all dimensions are using more than 50% of cpu: + +```yaml + alarm: dim_template + on: system.cpu + os: linux +lookup: average -3s percentage foreach * + units: % + every: 10s + warn: $this > 50 + crit: $this > 80 +``` + +The `lookup` line will calculate the average of CPU usage from system and user in the last 3 seconds. In this case +Netdata will create alarms for all dimensions of the chart. + +### Example 7 - Z-Score based alarm + +Derive a "[Z Score](https://en.wikipedia.org/wiki/Standard_score)" based alarm on `user` dimension of the `system.cpu` chart: + +```yaml + alarm: cpu_user_mean + on: system.cpu +lookup: mean -60s of user + every: 10s + + alarm: cpu_user_stddev + on: system.cpu +lookup: stddev -60s of user + every: 10s + + alarm: cpu_user_zscore + on: system.cpu +lookup: mean -10s of user + calc: ($this - $cpu_user_mean) / $cpu_user_stddev + every: 10s + warn: $this < -2 or $this > 2 + crit: $this < -3 or $this > 3 +``` + +Since [`z = (x - mean) / stddev`](https://en.wikipedia.org/wiki/Standard_score) we create two input alarms, one for `mean` and one for `stddev` and then use them both as inputs in our final `cpu_user_zscore` alarm. + +### Example 8 - [Anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) based CPU dimensions alarm + +Warning if 5 minute rolling [anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) for any CPU dimension is above 5%, critical if it goes above 20%: + +```yaml +template: ml_5min_cpu_dims + on: system.cpu + os: linux + hosts: * + lookup: average -5m anomaly-bit foreach * + calc: $this + units: % + every: 30s + warn: $this > (($status >= $WARNING) ? (5) : (20)) + crit: $this > (($status == $CRITICAL) ? (20) : (100)) + info: rolling 5min anomaly rate for each system.cpu dimension +``` + +The `lookup` line will calculate the average anomaly rate of each `system.cpu` dimension over the last 5 minues. In this case +Netdata will create alarms for all dimensions of the chart. + +### Example 9 - [Anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) based CPU chart alarm + +Warning if 5 minute rolling [anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) averaged across all CPU dimensions is above 5%, critical if it goes above 20%: + +```yaml +template: ml_5min_cpu_chart + on: system.cpu + os: linux + hosts: * + lookup: average -5m anomaly-bit of * + calc: $this + units: % + every: 30s + warn: $this > (($status >= $WARNING) ? (5) : (20)) + crit: $this > (($status == $CRITICAL) ? (20) : (100)) + info: rolling 5min anomaly rate for system.cpu chart +``` + +The `lookup` line will calculate the average anomaly rate across all `system.cpu` dimensions over the last 5 minues. In this case +Netdata will create one alarm for the chart. + +### Example 10 - [Anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) based node level alarm + +Warning if 5 minute rolling [anomaly rate](https://learn.netdata.cloud/docs/agent/ml#anomaly-rate) averaged across all ML enabled dimensions is above 5%, critical if it goes above 20%: + +```yaml +template: ml_5min_node + on: anomaly_detection.anomaly_rate + os: linux + hosts: * + lookup: average -5m of anomaly_rate + calc: $this + units: % + every: 30s + warn: $this > (($status >= $WARNING) ? (5) : (20)) + crit: $this > (($status == $CRITICAL) ? (20) : (100)) + info: rolling 5min anomaly rate for all ML enabled dims +``` + +The `lookup` line will use the `anomaly_rate` dimension of the `anomaly_detection.anomaly_rate` ML chart to calculate the average [node level anomaly rate](https://learn.netdata.cloud/docs/agent/ml#node-anomaly-rate) over the last 5 minues. + +## Troubleshooting + +You can compile Netdata with [debugging](/daemon/README.md#debugging) and then set in `netdata.conf`: + +```yaml +[global] + debug flags = 0x0000000000800000 +``` + +Then check your `/var/log/netdata/debug.log`. It will show you how it works. Important: this will generate a lot of +output in debug.log. + +You can find the context of charts by looking up the chart in either `http://NODE:19999/netdata.conf` or +`http://NODE:19999/api/v1/charts`, replacing `NODE` with the IP address or hostname for your Agent dashboard. + +You can find how Netdata interpreted the expressions by examining the alarm at +`http://NODE:19999/api/v1/alarms?all`. For each expression, Netdata will return the expression as given in its +config file, and the same expression with additional parentheses added to indicate the evaluation flow of the +expression. + +## Disabling health checks or silencing notifications at runtime + +It's currently not possible to schedule notifications from within the alarm template. For those scenarios where you need +to temporary disable notifications (for instance when running backups triggers a disk alert) you can disable or silence +notifications are runtime. The health checks can be controlled at runtime via the [health management +api](/web/api/health/README.md). + + diff --git a/health/health.c b/health/health.c new file mode 100644 index 0000000..3784e0f --- /dev/null +++ b/health/health.c @@ -0,0 +1,1581 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +#define WORKER_HEALTH_JOB_RRD_LOCK 0 +#define WORKER_HEALTH_JOB_HOST_LOCK 1 +#define WORKER_HEALTH_JOB_DB_QUERY 2 +#define WORKER_HEALTH_JOB_CALC_EVAL 3 +#define WORKER_HEALTH_JOB_WARNING_EVAL 4 +#define WORKER_HEALTH_JOB_CRITICAL_EVAL 5 +#define WORKER_HEALTH_JOB_ALARM_LOG_ENTRY 6 +#define WORKER_HEALTH_JOB_ALARM_LOG_PROCESS 7 +#define WORKER_HEALTH_JOB_DELAYED_INIT_RRDSET 8 +#define WORKER_HEALTH_JOB_DELAYED_INIT_RRDDIM 9 + +#if WORKER_UTILIZATION_MAX_JOB_TYPES < 10 +#error WORKER_UTILIZATION_MAX_JOB_TYPES has to be at least 10 +#endif + +static bool prepare_command(BUFFER *wb, + const char *exec, + const char *recipient, + const char *registry_hostname, + uint32_t unique_id, + uint32_t alarm_id, + uint32_t alarm_event_id, + uint32_t when, + const char *alert_name, + const char *alert_chart_name, + const char *alert_family, + const char *new_status, + const char *old_status, + NETDATA_DOUBLE new_value, + NETDATA_DOUBLE old_value, + const char *alert_source, + uint32_t duration, + uint32_t non_clear_duration, + const char *alert_units, + const char *alert_info, + const char *new_value_string, + const char *old_value_string, + const char *source, + const char *error_msg, + int n_warn, + int n_crit, + const char *warn_alarms, + const char *crit_alarms, + const char *classification, + const char *edit_command, + const char *machine_guid) +{ + char buf[8192]; + size_t n = 8192 - 1; + + buffer_strcat(wb, "exec"); + + if (!sanitize_command_argument_string(buf, exec, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, recipient, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, registry_hostname, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + buffer_sprintf(wb, " '%u'", unique_id); + + buffer_sprintf(wb, " '%u'", alarm_id); + + buffer_sprintf(wb, " '%u'", alarm_event_id); + + buffer_sprintf(wb, " '%u'", when); + + if (!sanitize_command_argument_string(buf, alert_name, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, alert_chart_name, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, alert_family, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, new_status, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, old_status, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + buffer_sprintf(wb, " '" NETDATA_DOUBLE_FORMAT_ZERO "'", new_value); + + buffer_sprintf(wb, " '" NETDATA_DOUBLE_FORMAT_ZERO "'", old_value); + + if (!sanitize_command_argument_string(buf, alert_source, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + buffer_sprintf(wb, " '%u'", duration); + + buffer_sprintf(wb, " '%u'", non_clear_duration); + + if (!sanitize_command_argument_string(buf, alert_units, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, alert_info, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, new_value_string, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, old_value_string, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, source, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, error_msg, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + buffer_sprintf(wb, " '%d'", n_warn); + + buffer_sprintf(wb, " '%d'", n_crit); + + if (!sanitize_command_argument_string(buf, warn_alarms, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, crit_alarms, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, classification, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, edit_command, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + if (!sanitize_command_argument_string(buf, machine_guid, n)) + return false; + buffer_sprintf(wb, " '%s'", buf); + + return true; +} + +unsigned int default_health_enabled = 1; +char *silencers_filename; + +// the queue of executed alarm notifications that haven't been waited for yet +static __thread struct { + ALARM_ENTRY *head; // oldest + ALARM_ENTRY *tail; // latest +} alarm_notifications_in_progress = {NULL, NULL}; + +typedef struct active_alerts { + char *name; + time_t last_status_change; + RRDCALC_STATUS status; +} active_alerts_t; + +static inline void enqueue_alarm_notify_in_progress(ALARM_ENTRY *ae) +{ + ae->prev_in_progress = NULL; + ae->next_in_progress = NULL; + + if (NULL != alarm_notifications_in_progress.tail) { + ae->prev_in_progress = alarm_notifications_in_progress.tail; + alarm_notifications_in_progress.tail->next_in_progress = ae; + } + if (NULL == alarm_notifications_in_progress.head) { + alarm_notifications_in_progress.head = ae; + } + alarm_notifications_in_progress.tail = ae; + +} + +static inline void unlink_alarm_notify_in_progress(ALARM_ENTRY *ae) +{ + struct alarm_entry *prev = ae->prev_in_progress; + struct alarm_entry *next = ae->next_in_progress; + + if (NULL != prev) { + prev->next_in_progress = next; + } + if (NULL != next) { + next->prev_in_progress = prev; + } + if (ae == alarm_notifications_in_progress.head) { + alarm_notifications_in_progress.head = next; + } + if (ae == alarm_notifications_in_progress.tail) { + alarm_notifications_in_progress.tail = prev; + } +} +// ---------------------------------------------------------------------------- +// health initialization + +/** + * User Config directory + * + * Get the config directory for health and return it. + * + * @return a pointer to the user config directory + */ +inline char *health_user_config_dir(void) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%s/health.d", netdata_configured_user_config_dir); + return config_get(CONFIG_SECTION_DIRECTORIES, "health config", buffer); +} + +/** + * Stock Config Directory + * + * Get the Stock config directory and return it. + * + * @return a pointer to the stock config directory. + */ +inline char *health_stock_config_dir(void) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%s/health.d", netdata_configured_stock_config_dir); + return config_get(CONFIG_SECTION_DIRECTORIES, "stock health config", buffer); +} + +/** + * Silencers init + * + * Function used to initialize the silencer structure. + */ +static void health_silencers_init(void) { + FILE *fd = fopen(silencers_filename, "r"); + if (fd) { + fseek(fd, 0 , SEEK_END); + off_t length = (off_t) ftell(fd); + fseek(fd, 0 , SEEK_SET); + + if (length > 0 && length < HEALTH_SILENCERS_MAX_FILE_LEN) { + char *str = mallocz((length+1)* sizeof(char)); + if(str) { + size_t copied; + copied = fread(str, sizeof(char), length, fd); + if (copied == (length* sizeof(char))) { + str[length] = 0x00; + json_parse(str, NULL, health_silencers_json_read_callback); + info("Parsed health silencers file %s", silencers_filename); + } else { + error("Cannot read the data from health silencers file %s", silencers_filename); + } + freez(str); + } + } else { + error( + "Health silencers file %s has the size %" PRId64 " that is out of range[ 1 , %d ]. Aborting read.", + silencers_filename, + (int64_t)length, + HEALTH_SILENCERS_MAX_FILE_LEN); + } + fclose(fd); + } else { + info("Cannot open the file %s, so Netdata will work with the default health configuration.",silencers_filename); + } +} + +/** + * Health Init + * + * Initialize the health thread. + */ +void health_init(void) { + debug(D_HEALTH, "Health configuration initializing"); + + if(!(default_health_enabled = (unsigned int)config_get_boolean(CONFIG_SECTION_HEALTH, "enabled", default_health_enabled))) { + debug(D_HEALTH, "Health is disabled."); + return; + } + + health_silencers_init(); +} + +// ---------------------------------------------------------------------------- +// re-load health configuration + +/** + * Reload host + * + * Reload configuration for a specific host. + * + * @param host the structure of the host that the function will reload the configuration. + */ +static void health_reload_host(RRDHOST *host) { + if(unlikely(!host->health_enabled) && !rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH)) + return; + + log_health("[%s]: Reloading health.", rrdhost_hostname(host)); + + char *user_path = health_user_config_dir(); + char *stock_path = health_stock_config_dir(); + + // free all running alarms + rrdcalc_delete_all(host); + rrdcalctemplate_delete_all(host); + + // invalidate all previous entries in the alarm log + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + ALARM_ENTRY *t; + for(t = host->health_log.alarms ; t ; t = t->next) { + if(t->new_status != RRDCALC_STATUS_REMOVED) + t->flags |= HEALTH_ENTRY_FLAG_UPDATED; + } + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + // reset all thresholds to all charts + RRDSET *st; + rrdset_foreach_read(st, host) { + st->green = NAN; + st->red = NAN; + } + rrdset_foreach_done(st); + + // load the new alarms + health_readdir(host, user_path, stock_path, NULL); + + //Discard alarms with labels that do not apply to host + rrdcalc_delete_alerts_not_matching_host_labels_from_this_host(host); + + // link the loaded alarms to their charts + rrdset_foreach_write(st, host) { + if (rrdset_flag_check(st, RRDSET_FLAG_ARCHIVED)) + continue; + + rrdcalc_link_matching_alerts_to_rrdset(st); + rrdcalctemplate_link_matching_templates_to_rrdset(st); + } + rrdset_foreach_done(st); + host->aclk_alert_reloaded = 1; +} + +/** + * Reload + * + * Reload the host configuration for all hosts. + */ +void health_reload(void) { + sql_refresh_hashes(); + + rrd_rdlock(); + + RRDHOST *host; + rrdhost_foreach_read(host) + health_reload_host(host); + + rrd_unlock(); +} + +// ---------------------------------------------------------------------------- +// health main thread and friends + +static inline RRDCALC_STATUS rrdcalc_value2status(NETDATA_DOUBLE n) { + if(isnan(n) || isinf(n)) return RRDCALC_STATUS_UNDEFINED; + if(n) return RRDCALC_STATUS_RAISED; + return RRDCALC_STATUS_CLEAR; +} + +#define ACTIVE_ALARMS_LIST_EXAMINE 500 +#define ACTIVE_ALARMS_LIST 15 + +static inline int compare_active_alerts(const void * a, const void * b) { + active_alerts_t *active_alerts_a = (active_alerts_t *)a; + active_alerts_t *active_alerts_b = (active_alerts_t *)b; + + return ( active_alerts_b->last_status_change - active_alerts_a->last_status_change ); +} + +static inline void health_alarm_execute(RRDHOST *host, ALARM_ENTRY *ae) { + ae->flags |= HEALTH_ENTRY_FLAG_PROCESSED; + + if(unlikely(ae->new_status < RRDCALC_STATUS_CLEAR)) { + // do not send notifications for internal statuses + debug(D_HEALTH, "Health not sending notification for alarm '%s.%s' status %s (internal statuses)", ae_chart_name(ae), ae_name(ae), rrdcalc_status2string(ae->new_status)); + goto done; + } + + if(unlikely(ae->new_status <= RRDCALC_STATUS_CLEAR && (ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION))) { + // do not send notifications for disabled statuses + debug(D_HEALTH, "Health not sending notification for alarm '%s.%s' status %s (it has no-clear-notification enabled)", ae_chart_name(ae), ae_name(ae), rrdcalc_status2string(ae->new_status)); + log_health("[%s]: Health not sending notification for alarm '%s.%s' status %s (it has no-clear-notification enabled)", rrdhost_hostname(host), ae_chart_name(ae), ae_name(ae), rrdcalc_status2string(ae->new_status)); + // mark it as run, so that we will send the same alarm if it happens again + goto done; + } + + // find the previous notification for the same alarm + // which we have run the exec script + // exception: alarms with HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION set + if(likely(!(ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION))) { + uint32_t id = ae->alarm_id; + ALARM_ENTRY *t; + for(t = ae->next; t ; t = t->next) { + if(t->alarm_id == id && t->flags & HEALTH_ENTRY_FLAG_EXEC_RUN) + break; + } + + if(likely(t)) { + // we have executed this alarm notification in the past + if(t && t->new_status == ae->new_status) { + // don't send the notification for the same status again + debug(D_HEALTH, "Health not sending again notification for alarm '%s.%s' status %s", ae_chart_name(ae), ae_name(ae) + , rrdcalc_status2string(ae->new_status)); + log_health("[%s]: Health not sending again notification for alarm '%s.%s' status %s", rrdhost_hostname(host), ae_chart_name(ae), ae_name(ae) + , rrdcalc_status2string(ae->new_status)); + goto done; + } + } + else { + // we have not executed this alarm notification in the past + // so, don't send CLEAR notifications + if(unlikely(ae->new_status == RRDCALC_STATUS_CLEAR)) { + if((!(ae->flags & HEALTH_ENTRY_RUN_ONCE)) || (ae->flags & HEALTH_ENTRY_RUN_ONCE && ae->old_status < RRDCALC_STATUS_RAISED) ) { + debug(D_HEALTH, "Health not sending notification for first initialization of alarm '%s.%s' status %s" + , ae_chart_name(ae), ae_name(ae), rrdcalc_status2string(ae->new_status)); + goto done; + } + } + } + } + + // Check if alarm notifications are silenced + if (ae->flags & HEALTH_ENTRY_FLAG_SILENCED) { + log_health("[%s]: Health not sending notification for alarm '%s.%s' status %s (command API has disabled notifications)", rrdhost_hostname(host), ae_chart_name(ae), ae_name(ae), rrdcalc_status2string(ae->new_status)); + goto done; + } + + log_health("[%s]: Sending notification for alarm '%s.%s' status %s.", rrdhost_hostname(host), ae_chart_name(ae), ae_name(ae), rrdcalc_status2string(ae->new_status)); + + const char *exec = (ae->exec) ? ae_exec(ae) : string2str(host->health_default_exec); + const char *recipient = (ae->recipient) ? ae_recipient(ae) : string2str(host->health_default_recipient); + + int n_warn=0, n_crit=0; + RRDCALC *rc; + EVAL_EXPRESSION *expr=NULL; + BUFFER *warn_alarms, *crit_alarms; + active_alerts_t *active_alerts = callocz(ACTIVE_ALARMS_LIST_EXAMINE, sizeof(active_alerts_t)); + + warn_alarms = buffer_create(NETDATA_WEB_RESPONSE_INITIAL_SIZE); + crit_alarms = buffer_create(NETDATA_WEB_RESPONSE_INITIAL_SIZE); + + foreach_rrdcalc_in_rrdhost_read(host, rc) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + + if(unlikely((n_warn + n_crit) >= ACTIVE_ALARMS_LIST_EXAMINE)) + break; + + if (unlikely(rc->status == RRDCALC_STATUS_WARNING)) { + if (likely(ae->alarm_id != rc->id) || likely(ae->alarm_event_id != rc->next_event_id - 1)) { + active_alerts[n_warn+n_crit].name = (char *)rrdcalc_name(rc); + active_alerts[n_warn+n_crit].last_status_change = rc->last_status_change; + active_alerts[n_warn+n_crit].status = rc->status; + n_warn++; + } else if (ae->alarm_id == rc->id) + expr = rc->warning; + } else if (unlikely(rc->status == RRDCALC_STATUS_CRITICAL)) { + if (likely(ae->alarm_id != rc->id) || likely(ae->alarm_event_id != rc->next_event_id - 1)) { + active_alerts[n_warn+n_crit].name = (char *)rrdcalc_name(rc); + active_alerts[n_warn+n_crit].last_status_change = rc->last_status_change; + active_alerts[n_warn+n_crit].status = rc->status; + n_crit++; + } else if (ae->alarm_id == rc->id) + expr = rc->critical; + } else if (unlikely(rc->status == RRDCALC_STATUS_CLEAR)) { + if (ae->alarm_id == rc->id) + expr = rc->warning; + } + } + foreach_rrdcalc_in_rrdhost_done(rc); + + if (n_warn+n_crit>1) + qsort (active_alerts, n_warn+n_crit, sizeof(active_alerts_t), compare_active_alerts); + + int count_w = 0, count_c = 0; + while (count_w + count_c < n_warn + n_crit && count_w + count_c < ACTIVE_ALARMS_LIST) { + if (active_alerts[count_w+count_c].status == RRDCALC_STATUS_WARNING) { + if (count_w) + buffer_strcat(warn_alarms, ","); + buffer_strcat(warn_alarms, active_alerts[count_w+count_c].name); + buffer_strcat(warn_alarms, "="); + buffer_snprintf(warn_alarms, 11, "%"PRId64"", (int64_t)active_alerts[count_w+count_c].last_status_change); + count_w++; + } + else if (active_alerts[count_w+count_c].status == RRDCALC_STATUS_CRITICAL) { + if (count_c) + buffer_strcat(crit_alarms, ","); + buffer_strcat(crit_alarms, active_alerts[count_w+count_c].name); + buffer_strcat(crit_alarms, "="); + buffer_snprintf(crit_alarms, 11, "%"PRId64"", (int64_t)active_alerts[count_w+count_c].last_status_change); + count_c++; + } + } + + char *edit_command = ae->source ? health_edit_command_from_source(ae_source(ae)) : strdupz("UNKNOWN=0=UNKNOWN"); + + BUFFER *wb = buffer_create(8192); + bool ok = prepare_command(wb, + exec, + recipient, + rrdhost_registry_hostname(host), + ae->unique_id, + ae->alarm_id, + ae->alarm_event_id, + (unsigned long)ae->when, + ae_name(ae), + ae->chart?ae_chart_name(ae):"NOCHART", + ae->family?ae_family(ae):"NOFAMILY", + rrdcalc_status2string(ae->new_status), + rrdcalc_status2string(ae->old_status), + ae->new_value, + ae->old_value, + ae->source?ae_source(ae):"UNKNOWN", + (uint32_t)ae->duration, + (uint32_t)ae->non_clear_duration, + ae_units(ae), + ae_info(ae), + ae_new_value_string(ae), + ae_old_value_string(ae), + (expr && expr->source)?expr->source:"NOSOURCE", + (expr && expr->error_msg)?buffer_tostring(expr->error_msg):"NOERRMSG", + n_warn, + n_crit, + buffer_tostring(warn_alarms), + buffer_tostring(crit_alarms), + ae->classification?ae_classification(ae):"Unknown", + edit_command, + host != localhost ? host->machine_guid:""); + + const char *command_to_run = buffer_tostring(wb); + if (ok) { + ae->flags |= HEALTH_ENTRY_FLAG_EXEC_RUN; + ae->exec_run_timestamp = now_realtime_sec(); /* will be updated by real time after spawning */ + + debug(D_HEALTH, "executing command '%s'", command_to_run); + ae->flags |= HEALTH_ENTRY_FLAG_EXEC_IN_PROGRESS; + ae->exec_spawn_serial = spawn_enq_cmd(command_to_run); + enqueue_alarm_notify_in_progress(ae); + } else { + error("Failed to format command arguments"); + } + + buffer_free(wb); + freez(edit_command); + buffer_free(warn_alarms); + buffer_free(crit_alarms); + freez(active_alerts); + + return; //health_alarm_wait_for_execution +done: + health_alarm_log_save(host, ae); +} + +static inline void health_alarm_wait_for_execution(ALARM_ENTRY *ae) { + if (!(ae->flags & HEALTH_ENTRY_FLAG_EXEC_IN_PROGRESS)) + return; + + spawn_wait_cmd(ae->exec_spawn_serial, &ae->exec_code, &ae->exec_run_timestamp); + debug(D_HEALTH, "done executing command - returned with code %d", ae->exec_code); + ae->flags &= ~HEALTH_ENTRY_FLAG_EXEC_IN_PROGRESS; + + if(ae->exec_code != 0) + ae->flags |= HEALTH_ENTRY_FLAG_EXEC_FAILED; + + unlink_alarm_notify_in_progress(ae); +} + +static inline void health_process_notifications(RRDHOST *host, ALARM_ENTRY *ae) { + debug(D_HEALTH, "Health alarm '%s.%s' = " NETDATA_DOUBLE_FORMAT_AUTO " - changed status from %s to %s", + ae->chart?ae_chart_name(ae):"NOCHART", ae_name(ae), + ae->new_value, + rrdcalc_status2string(ae->old_status), + rrdcalc_status2string(ae->new_status) + ); + + health_alarm_execute(host, ae); +} + +static inline void health_alarm_log_process(RRDHOST *host) { + uint32_t first_waiting = (host->health_log.alarms)?host->health_log.alarms->unique_id:0; + time_t now = now_realtime_sec(); + + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae && ae->unique_id >= host->health_last_processed_id; ae = ae->next) { + if(likely(!(ae->flags & HEALTH_ENTRY_FLAG_IS_REPEATING))) { + if(unlikely( + !(ae->flags & HEALTH_ENTRY_FLAG_PROCESSED) && + !(ae->flags & HEALTH_ENTRY_FLAG_UPDATED) + )) { + if(unlikely(ae->unique_id < first_waiting)) + first_waiting = ae->unique_id; + + if(likely(now >= ae->delay_up_to_timestamp)) + health_process_notifications(host, ae); + } + } + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + // remember this for the next iteration + host->health_last_processed_id = first_waiting; + + bool cleanup_excess_log_entries = host->health_log.count > host->health_log.max; + + if (!cleanup_excess_log_entries) + return; + + // cleanup excess entries in the log + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *last = NULL; + unsigned int count = host->health_log.max * 2 / 3; + for(ae = host->health_log.alarms; ae && count ; count--, last = ae, ae = ae->next) ; + + if(ae && last && last->next == ae) + last->next = NULL; + else + ae = NULL; + + while(ae) { + debug(D_HEALTH, "Health removing alarm log entry with id: %u", ae->unique_id); + + ALARM_ENTRY *t = ae->next; + + if(likely(!(ae->flags & HEALTH_ENTRY_FLAG_IS_REPEATING))) { + health_alarm_wait_for_execution(ae); + health_alarm_log_free_one_nochecks_nounlink(ae); + host->health_log.count--; + } + + ae = t; + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} + +static inline int rrdcalc_isrunnable(RRDCALC *rc, time_t now, time_t *next_run) { + if(unlikely(!rc->rrdset)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. It is not linked to a chart.", rrdcalc_chart_name(rc), rrdcalc_name(rc)); + return 0; + } + + if(unlikely(rc->next_update > now)) { + if (unlikely(*next_run > rc->next_update)) { + // update the next_run time of the main loop + // to run this alarm precisely the time required + *next_run = rc->next_update; + } + + debug(D_HEALTH, "Health not examining alarm '%s.%s' yet (will do in %d secs).", rrdcalc_chart_name(rc), rrdcalc_name(rc), (int) (rc->next_update - now)); + return 0; + } + + if(unlikely(!rc->update_every)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. It does not have an update frequency", rrdcalc_chart_name(rc), rrdcalc_name(rc)); + return 0; + } + + if(unlikely(rrdset_flag_check(rc->rrdset, RRDSET_FLAG_OBSOLETE))) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. The chart has been marked as obsolete", rrdcalc_chart_name(rc), rrdcalc_name(rc)); + return 0; + } + + if(unlikely(rrdset_flag_check(rc->rrdset, RRDSET_FLAG_ARCHIVED))) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. The chart has been marked as archived", rrdcalc_chart_name(rc), rrdcalc_name(rc)); + return 0; + } + + if(unlikely(!rc->rrdset->last_collected_time.tv_sec || rc->rrdset->counter_done < 2)) { + debug(D_HEALTH, "Health not running alarm '%s.%s'. Chart is not fully collected yet.", rrdcalc_chart_name(rc), rrdcalc_name(rc)); + return 0; + } + + int update_every = rc->rrdset->update_every; + time_t first = rrdset_first_entry_t(rc->rrdset); + time_t last = rrdset_last_entry_t(rc->rrdset); + + if(unlikely(now + update_every < first /* || now - update_every > last */)) { + debug(D_HEALTH + , "Health not examining alarm '%s.%s' yet (wanted time is out of bounds - we need %lu but got %lu - %lu)." + , rrdcalc_chart_name(rc), rrdcalc_name(rc), (unsigned long) now, (unsigned long) first + , (unsigned long) last); + return 0; + } + + if(RRDCALC_HAS_DB_LOOKUP(rc)) { + time_t needed = now + rc->before + rc->after; + + if(needed + update_every < first || needed - update_every > last) { + debug(D_HEALTH + , "Health not examining alarm '%s.%s' yet (not enough data yet - we need %lu but got %lu - %lu)." + , rrdcalc_chart_name(rc), rrdcalc_name(rc), (unsigned long) needed, (unsigned long) first + , (unsigned long) last); + return 0; + } + } + + return 1; +} + +static inline int check_if_resumed_from_suspension(void) { + static __thread usec_t last_realtime = 0, last_monotonic = 0; + usec_t realtime = now_realtime_usec(), monotonic = now_monotonic_usec(); + int ret = 0; + + // detect if monotonic and realtime have twice the difference + // in which case we assume the system was just waken from hibernation + + if(last_realtime && last_monotonic && realtime - last_realtime > 2 * (monotonic - last_monotonic)) + ret = 1; + + last_realtime = realtime; + last_monotonic = monotonic; + + return ret; +} + +static void health_thread_cleanup(void *ptr) { + worker_unregister(); + + struct health_state *h = ptr; + h->host->health_spawn = 0; + + netdata_thread_cancel(netdata_thread_self()); + log_health("[%s]: Health thread ended.", rrdhost_hostname(h->host)); + debug(D_HEALTH, "HEALTH %s: Health thread ended.", rrdhost_hostname(h->host)); +} + +static void initialize_health(RRDHOST *host, int is_localhost) { + if(!host->health_enabled || rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH)) return; + rrdhost_flag_set(host, RRDHOST_FLAG_INITIALIZED_HEALTH); + + log_health("[%s]: Initializing health.", rrdhost_hostname(host)); + + host->health_default_warn_repeat_every = config_get_duration(CONFIG_SECTION_HEALTH, "default repeat warning", "never"); + host->health_default_crit_repeat_every = config_get_duration(CONFIG_SECTION_HEALTH, "default repeat critical", "never"); + + host->health_log.next_log_id = 1; + host->health_log.next_alarm_id = 1; + host->health_log.max = 1000; + host->health_log.next_log_id = (uint32_t)now_realtime_sec(); + host->health_log.next_alarm_id = 0; + + long n = config_get_number(CONFIG_SECTION_HEALTH, "in memory max health log entries", host->health_log.max); + if(n < 10) { + error("Host '%s': health configuration has invalid max log entries %ld. Using default %u", rrdhost_hostname(host), n, host->health_log.max); + config_set_number(CONFIG_SECTION_HEALTH, "in memory max health log entries", (long)host->health_log.max); + } + else + host->health_log.max = (unsigned int)n; + + netdata_rwlock_init(&host->health_log.alarm_log_rwlock); + + char filename[FILENAME_MAX + 1]; + + if(!is_localhost) { + int r = mkdir(host->varlib_dir, 0775); + if (r != 0 && errno != EEXIST) + error("Host '%s': cannot create directory '%s'", rrdhost_hostname(host), host->varlib_dir); + } + + { + snprintfz(filename, FILENAME_MAX, "%s/health", host->varlib_dir); + int r = mkdir(filename, 0775); + if(r != 0 && errno != EEXIST) + error("Host '%s': cannot create directory '%s'", rrdhost_hostname(host), filename); + } + snprintfz(filename, FILENAME_MAX, "%s/health/health-log.db", host->varlib_dir); + host->health_log_filename = strdupz(filename); + + snprintfz(filename, FILENAME_MAX, "%s/alarm-notify.sh", netdata_configured_primary_plugins_dir); + host->health_default_exec = string_strdupz(config_get(CONFIG_SECTION_HEALTH, "script to execute on alarm", filename)); + host->health_default_recipient = string_strdupz("root"); + + if (!file_is_migrated(host->health_log_filename)) { + int rc = sql_create_health_log_table(host); + if (unlikely(rc)) { + log_health("[%s]: Failed to create health log table in the database", rrdhost_hostname(host)); + health_alarm_log_load(host); + health_alarm_log_open(host); + } + else { + health_alarm_log_load(host); + add_migrated_file(host->health_log_filename, 0); + } + } else { + // TODO: This needs to go to the metadata thread + // Health should wait before accessing the table (needs to be created by the metadata thread) + sql_create_health_log_table(host); + sql_health_alarm_log_load(host); + } + + // ------------------------------------------------------------------------ + // load health configuration + + health_readdir(host, health_user_config_dir(), health_stock_config_dir(), NULL); + + // link the loaded alarms to their charts + RRDSET *st; + rrdset_foreach_write(st, host) { + if (rrdset_flag_check(st, RRDSET_FLAG_ARCHIVED)) + continue; + + rrdcalc_link_matching_alerts_to_rrdset(st); + rrdcalctemplate_link_matching_templates_to_rrdset(st); + } + rrdset_foreach_done(st); + + //Discard alarms with labels that do not apply to host + rrdcalc_delete_alerts_not_matching_host_labels_from_this_host(host); + + health_silencers_init(); +} + +static void health_sleep(time_t next_run, unsigned int loop __maybe_unused, RRDHOST *host) { + time_t now = now_realtime_sec(); + if(now < next_run) { + worker_is_idle(); + debug(D_HEALTH, "Health monitoring iteration no %u done. Next iteration in %d secs", loop, (int) (next_run - now)); + while (now < next_run && host->health_enabled && !netdata_exit) { + sleep_usec(USEC_PER_SEC); + now = now_realtime_sec(); + } + } + else { + debug(D_HEALTH, "Health monitoring iteration no %u done. Next iteration now", loop); + } +} + +static SILENCE_TYPE check_silenced(RRDCALC *rc, const char *host, SILENCERS *silencers) { + SILENCER *s; + debug(D_HEALTH, "Checking if alarm was silenced via the command API. Alarm info name:%s context:%s chart:%s host:%s family:%s", + rrdcalc_name(rc), (rc->rrdset)?rrdset_context(rc->rrdset):"", rrdcalc_chart_name(rc), host, (rc->rrdset)?rrdset_family(rc->rrdset):""); + + for (s = silencers->silencers; s!=NULL; s=s->next){ + if ( + (!s->alarms_pattern || (rc->name && s->alarms_pattern && simple_pattern_matches(s->alarms_pattern, rrdcalc_name(rc)))) && + (!s->contexts_pattern || (rc->rrdset && rc->rrdset->context && s->contexts_pattern && simple_pattern_matches(s->contexts_pattern, rrdset_context(rc->rrdset)))) && + (!s->hosts_pattern || (host && s->hosts_pattern && simple_pattern_matches(s->hosts_pattern,host))) && + (!s->charts_pattern || (rc->chart && s->charts_pattern && simple_pattern_matches(s->charts_pattern, rrdcalc_chart_name(rc)))) && + (!s->families_pattern || (rc->rrdset && rc->rrdset->family && s->families_pattern && simple_pattern_matches(s->families_pattern, rrdset_family(rc->rrdset)))) + ) { + debug(D_HEALTH, "Alarm matches command API silence entry %s:%s:%s:%s:%s", s->alarms,s->charts, s->contexts, s->hosts, s->families); + if (unlikely(silencers->stype == STYPE_NONE)) { + debug(D_HEALTH, "Alarm %s matched a silence entry, but no SILENCE or DISABLE command was issued via the command API. The match has no effect.", rrdcalc_name(rc)); + } else { + debug(D_HEALTH, "Alarm %s via the command API - name:%s context:%s chart:%s host:%s family:%s" + , (silencers->stype == STYPE_DISABLE_ALARMS)?"Disabled":"Silenced" + , rrdcalc_name(rc) + , (rc->rrdset)?rrdset_context(rc->rrdset):"" + , rrdcalc_chart_name(rc) + , host + , (rc->rrdset)?rrdset_family(rc->rrdset):"" + ); + } + return silencers->stype; + } + } + return STYPE_NONE; +} + +/** + * Update Disabled Silenced + * + * Update the variable rrdcalc_flags of the structure RRDCALC according with the values of the host structure + * + * @param host structure that contains information about the host monitored. + * @param rc structure with information about the alarm + * + * @return It returns 1 case rrdcalc_flags is DISABLED or 0 otherwise + */ +static int update_disabled_silenced(RRDHOST *host, RRDCALC *rc) { + uint32_t rrdcalc_flags_old = rc->run_flags; + // Clear the flags + rc->run_flags &= ~(RRDCALC_FLAG_DISABLED | RRDCALC_FLAG_SILENCED); + if (unlikely(silencers->all_alarms)) { + if (silencers->stype == STYPE_DISABLE_ALARMS) rc->run_flags |= RRDCALC_FLAG_DISABLED; + else if (silencers->stype == STYPE_SILENCE_NOTIFICATIONS) rc->run_flags |= RRDCALC_FLAG_SILENCED; + } else { + SILENCE_TYPE st = check_silenced(rc, rrdhost_hostname(host), silencers); + if (st == STYPE_DISABLE_ALARMS) rc->run_flags |= RRDCALC_FLAG_DISABLED; + else if (st == STYPE_SILENCE_NOTIFICATIONS) rc->run_flags |= RRDCALC_FLAG_SILENCED; + } + + if (rrdcalc_flags_old != rc->run_flags) { + info("Alarm silencing changed for host '%s' alarm '%s': Disabled %s->%s Silenced %s->%s", + rrdhost_hostname(host), + rrdcalc_name(rc), + (rrdcalc_flags_old & RRDCALC_FLAG_DISABLED)?"true":"false", + (rc->run_flags & RRDCALC_FLAG_DISABLED)?"true":"false", + (rrdcalc_flags_old & RRDCALC_FLAG_SILENCED)?"true":"false", + (rc->run_flags & RRDCALC_FLAG_SILENCED)?"true":"false" + ); + } + if (rc->run_flags & RRDCALC_FLAG_DISABLED) + return 1; + else + return 0; +} + +static void health_execute_delayed_initializations(RRDHOST *host) { + RRDSET *st; + + if (!rrdhost_flag_check(host, RRDHOST_FLAG_PENDING_HEALTH_INITIALIZATION)) return; + rrdhost_flag_clear(host, RRDHOST_FLAG_PENDING_HEALTH_INITIALIZATION); + + rrdset_foreach_reentrant(st, host) { + if(!rrdset_flag_check(st, RRDSET_FLAG_PENDING_HEALTH_INITIALIZATION)) continue; + rrdset_flag_clear(st, RRDSET_FLAG_PENDING_HEALTH_INITIALIZATION); + + worker_is_busy(WORKER_HEALTH_JOB_DELAYED_INIT_RRDSET); + + if(!st->rrdfamily) + st->rrdfamily = rrdfamily_add_and_acquire(host, rrdset_family(st)); + + if(!st->rrdvars) + st->rrdvars = rrdvariables_create(); + + rrddimvar_index_init(st); + + rrdsetvar_add_and_leave_released(st, "last_collected_t", RRDVAR_TYPE_TIME_T, &st->last_collected_time.tv_sec, RRDVAR_FLAG_NONE); + rrdsetvar_add_and_leave_released(st, "green", RRDVAR_TYPE_CALCULATED, &st->green, RRDVAR_FLAG_NONE); + rrdsetvar_add_and_leave_released(st, "red", RRDVAR_TYPE_CALCULATED, &st->red, RRDVAR_FLAG_NONE); + rrdsetvar_add_and_leave_released(st, "update_every", RRDVAR_TYPE_INT, &st->update_every, RRDVAR_FLAG_NONE); + + rrdcalc_link_matching_alerts_to_rrdset(st); + rrdcalctemplate_link_matching_templates_to_rrdset(st); + + RRDDIM *rd; + rrddim_foreach_read(rd, st) { + if(!rrddim_flag_check(rd, RRDDIM_FLAG_PENDING_HEALTH_INITIALIZATION)) continue; + rrddim_flag_clear(rd, RRDDIM_FLAG_PENDING_HEALTH_INITIALIZATION); + + worker_is_busy(WORKER_HEALTH_JOB_DELAYED_INIT_RRDDIM); + + rrddimvar_add_and_leave_released(rd, RRDVAR_TYPE_CALCULATED, NULL, NULL, &rd->last_stored_value, RRDVAR_FLAG_NONE); + rrddimvar_add_and_leave_released(rd, RRDVAR_TYPE_COLLECTED, NULL, "_raw", &rd->last_collected_value, RRDVAR_FLAG_NONE); + rrddimvar_add_and_leave_released(rd, RRDVAR_TYPE_TIME_T, NULL, "_last_collected_t", &rd->last_collected_time.tv_sec, RRDVAR_FLAG_NONE); + + RRDCALCTEMPLATE *rt; + foreach_rrdcalctemplate_read(host, rt) { + if(!rt->foreach_dimension_pattern) + continue; + + if(rrdcalctemplate_check_rrdset_conditions(rt, st, host)) + rrdcalctemplate_check_rrddim_conditions_and_link(rt, st, rd, host); + } + foreach_rrdcalctemplate_done(rt); + } + rrddim_foreach_done(rd); + } + rrdset_foreach_done(st); +} + +/** + * Health Main + * + * The main thread of the health system. In this function all the alarms will be processed. + * + * @param ptr is a pointer to the netdata_static_thread structure. + * + * @return It always returns NULL + */ + +void *health_main(void *ptr) { + worker_register("HEALTH"); + worker_register_job_name(WORKER_HEALTH_JOB_RRD_LOCK, "rrd lock"); + worker_register_job_name(WORKER_HEALTH_JOB_HOST_LOCK, "host lock"); + worker_register_job_name(WORKER_HEALTH_JOB_DB_QUERY, "db lookup"); + worker_register_job_name(WORKER_HEALTH_JOB_CALC_EVAL, "calc eval"); + worker_register_job_name(WORKER_HEALTH_JOB_WARNING_EVAL, "warning eval"); + worker_register_job_name(WORKER_HEALTH_JOB_CRITICAL_EVAL, "critical eval"); + worker_register_job_name(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY, "alarm log entry"); + worker_register_job_name(WORKER_HEALTH_JOB_ALARM_LOG_PROCESS, "alarm log process"); + worker_register_job_name(WORKER_HEALTH_JOB_DELAYED_INIT_RRDSET, "rrdset init"); + worker_register_job_name(WORKER_HEALTH_JOB_DELAYED_INIT_RRDDIM, "rrddim init"); + + struct health_state *h = ptr; + netdata_thread_cleanup_push(health_thread_cleanup, ptr); + + RRDHOST *host = h->host; + initialize_health(host, host == localhost); + + int min_run_every = (int)config_get_number(CONFIG_SECTION_HEALTH, "run at least every seconds", 10); + if(min_run_every < 1) min_run_every = 1; + + int cleanup_sql_every_loop = 7200 / min_run_every; + + time_t now = now_realtime_sec(); + time_t hibernation_delay = config_get_number(CONFIG_SECTION_HEALTH, "postpone alarms during hibernation for seconds", 60); + + bool health_running_logged = false; + + rrdhost_rdlock(host); //CHECK + rrdcalc_delete_alerts_not_matching_host_labels_from_this_host(host); + rrdhost_unlock(host); + + unsigned int loop = 0; +#ifdef ENABLE_ACLK + unsigned int marked_aclk_reload_loop = 0; +#endif + while(!netdata_exit && host->health_enabled) { + loop++; + debug(D_HEALTH, "Health monitoring iteration no %u started", loop); + + now = now_realtime_sec(); + int runnable = 0, apply_hibernation_delay = 0; + time_t next_run = now + min_run_every; + RRDCALC *rc; + + if (unlikely(check_if_resumed_from_suspension())) { + apply_hibernation_delay = 1; + + log_health( + "[%s]: Postponing alarm checks for %"PRId64" seconds, " + "because it seems that the system was just resumed from suspension.", + rrdhost_hostname(host), + (int64_t)hibernation_delay); + } + + if (unlikely(silencers->all_alarms && silencers->stype == STYPE_DISABLE_ALARMS)) { + static __thread int logged=0; + if (!logged) { + log_health("[%s]: Skipping health checks, because all alarms are disabled via a %s command.", + rrdhost_hostname(host), + HEALTH_CMDAPI_CMD_DISABLEALL); + logged = 1; + } + } + +#ifdef ENABLE_ACLK + if (host->aclk_alert_reloaded && !marked_aclk_reload_loop) + marked_aclk_reload_loop = loop; +#endif + + if (unlikely(apply_hibernation_delay)) { + log_health( + "[%s]: Postponing health checks for %"PRId64" seconds.", + rrdhost_hostname(host), + (int64_t)hibernation_delay); + + host->health_delay_up_to = now + hibernation_delay; + next_run = now + hibernation_delay; + health_sleep(next_run, loop, host); + } + + if (unlikely(host->health_delay_up_to)) { + if (unlikely(now < host->health_delay_up_to)) { + next_run = host->health_delay_up_to; + health_sleep(next_run, loop, host); + continue; + } + + log_health("[%s]: Resuming health checks after delay.", rrdhost_hostname(host)); + host->health_delay_up_to = 0; + } + + // wait until cleanup of obsolete charts on children is complete + if (host != localhost) { + if (unlikely(host->trigger_chart_obsoletion_check == 1)) { + log_health("[%s]: Waiting for chart obsoletion check.", rrdhost_hostname(host)); + health_sleep(next_run, loop, host); + continue; + } + } + + if (!health_running_logged) { + log_health("[%s]: Health is running.", rrdhost_hostname(host)); + health_running_logged = true; + } + + if(likely(!host->health_log_fp) && (loop == 1 || loop % cleanup_sql_every_loop == 0)) + sql_health_alarm_log_cleanup(host); + + health_execute_delayed_initializations(host); + + worker_is_busy(WORKER_HEALTH_JOB_HOST_LOCK); + + // the first loop is to lookup values from the db + foreach_rrdcalc_in_rrdhost_read(host, rc) { + + rrdcalc_update_info_using_rrdset_labels(rc); + + if (update_disabled_silenced(host, rc)) + continue; + + // create an alert removed event if the chart is obsolete and + // has stopped being collected for 60 seconds + if (unlikely(rc->rrdset && rc->status != RRDCALC_STATUS_REMOVED && + rrdset_flag_check(rc->rrdset, RRDSET_FLAG_OBSOLETE) && + now > (rc->rrdset->last_collected_time.tv_sec + 60))) { + if (!rrdcalc_isrepeating(rc)) { + worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); + time_t now = now_realtime_sec(); + + ALARM_ENTRY *ae = health_create_alarm_entry( + host, + rc->id, + rc->next_event_id++, + rc->config_hash_id, + now, + rc->name, + rc->rrdset->id, + rc->rrdset->context, + rc->rrdset->family, + rc->classification, + rc->component, + rc->type, + rc->exec, + rc->recipient, + now - rc->last_status_change, + rc->value, + NAN, + rc->status, + RRDCALC_STATUS_REMOVED, + rc->source, + rc->units, + rc->info, + 0, + rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0); + + if (ae) { + health_alarm_log_add_entry(host, ae); + rc->old_status = rc->status; + rc->status = RRDCALC_STATUS_REMOVED; + rc->last_status_change = now; + rc->last_updated = now; + rc->value = NAN; + +#ifdef ENABLE_ACLK + if (netdata_cloud_setting && likely(!host->aclk_alert_reloaded)) + sql_queue_alarm_to_aclk(host, ae, 1); +#endif + } + } + } + + if (unlikely(!rrdcalc_isrunnable(rc, now, &next_run))) { + if (unlikely(rc->run_flags & RRDCALC_FLAG_RUNNABLE)) + rc->run_flags &= ~RRDCALC_FLAG_RUNNABLE; + continue; + } + + runnable++; + rc->old_value = rc->value; + rc->run_flags |= RRDCALC_FLAG_RUNNABLE; + + // ------------------------------------------------------------ + // if there is database lookup, do it + + if (unlikely(RRDCALC_HAS_DB_LOOKUP(rc))) { + worker_is_busy(WORKER_HEALTH_JOB_DB_QUERY); + + /* time_t old_db_timestamp = rc->db_before; */ + int value_is_null = 0; + + int ret = rrdset2value_api_v1(rc->rrdset, NULL, &rc->value, rrdcalc_dimensions(rc), 1, + rc->after, rc->before, rc->group, NULL, + 0, rc->options, + &rc->db_after,&rc->db_before, + NULL, NULL, NULL, + &value_is_null, NULL, 0, 0, + QUERY_SOURCE_HEALTH); + + if (unlikely(ret != 200)) { + // database lookup failed + rc->value = NAN; + rc->run_flags |= RRDCALC_FLAG_DB_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup returned error %d", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), ret + ); + } else + rc->run_flags &= ~RRDCALC_FLAG_DB_ERROR; + + /* - RRDCALC_FLAG_DB_STALE not currently used + if (unlikely(old_db_timestamp == rc->db_before)) { + // database is stale + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database is stale", host->hostname, rc->chart?rc->chart:"NOCHART", rc->name); + + if (unlikely(!(rc->rrdcalc_flags & RRDCALC_FLAG_DB_STALE))) { + rc->rrdcalc_flags |= RRDCALC_FLAG_DB_STALE; + error("Health on host '%s', alarm '%s.%s': database is stale", host->hostname, rc->chart?rc->chart:"NOCHART", rc->name); + } + } + else if (unlikely(rc->rrdcalc_flags & RRDCALC_FLAG_DB_STALE)) + rc->rrdcalc_flags &= ~RRDCALC_FLAG_DB_STALE; + */ + + if (unlikely(value_is_null)) { + // collected value is null + rc->value = NAN; + rc->run_flags |= RRDCALC_FLAG_DB_NAN; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': database lookup returned empty value (possibly value is not collected yet)", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc) + ); + } else + rc->run_flags &= ~RRDCALC_FLAG_DB_NAN; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup gave value " NETDATA_DOUBLE_FORMAT, + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), rc->value + ); + } + + // ------------------------------------------------------------ + // if there is calculation expression, run it + + if (unlikely(rc->calculation)) { + worker_is_busy(WORKER_HEALTH_JOB_CALC_EVAL); + + if (unlikely(!expression_evaluate(rc->calculation))) { + // calculation failed + rc->value = NAN; + rc->run_flags |= RRDCALC_FLAG_CALC_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' failed: %s", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), + rc->calculation->parsed_as, buffer_tostring(rc->calculation->error_msg) + ); + } else { + rc->run_flags &= ~RRDCALC_FLAG_CALC_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' gave value " + NETDATA_DOUBLE_FORMAT + ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), + rc->calculation->parsed_as, rc->calculation->result, + buffer_tostring(rc->calculation->error_msg), rrdcalc_source(rc) + ); + + rc->value = rc->calculation->result; + } + } + } + foreach_rrdcalc_in_rrdhost_done(rc); + + if (unlikely(runnable && !netdata_exit)) { + foreach_rrdcalc_in_rrdhost_read(host, rc) { + if (unlikely(!(rc->run_flags & RRDCALC_FLAG_RUNNABLE))) + continue; + + if (rc->run_flags & RRDCALC_FLAG_DISABLED) { + continue; + } + RRDCALC_STATUS warning_status = RRDCALC_STATUS_UNDEFINED; + RRDCALC_STATUS critical_status = RRDCALC_STATUS_UNDEFINED; + + // -------------------------------------------------------- + // check the warning expression + + if (likely(rc->warning)) { + worker_is_busy(WORKER_HEALTH_JOB_WARNING_EVAL); + + if (unlikely(!expression_evaluate(rc->warning))) { + // calculation failed + rc->run_flags |= RRDCALC_FLAG_WARN_ERROR; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': warning expression failed with error: %s", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), + buffer_tostring(rc->warning->error_msg) + ); + } else { + rc->run_flags &= ~RRDCALC_FLAG_WARN_ERROR; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': warning expression gave value " + NETDATA_DOUBLE_FORMAT + ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), + rrdcalc_name(rc), rc->warning->result, buffer_tostring(rc->warning->error_msg), rrdcalc_source(rc) + ); + warning_status = rrdcalc_value2status(rc->warning->result); + } + } + + // -------------------------------------------------------- + // check the critical expression + + if (likely(rc->critical)) { + worker_is_busy(WORKER_HEALTH_JOB_CRITICAL_EVAL); + + if (unlikely(!expression_evaluate(rc->critical))) { + // calculation failed + rc->run_flags |= RRDCALC_FLAG_CRIT_ERROR; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': critical expression failed with error: %s", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), + buffer_tostring(rc->critical->error_msg) + ); + } else { + rc->run_flags &= ~RRDCALC_FLAG_CRIT_ERROR; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': critical expression gave value " + NETDATA_DOUBLE_FORMAT + ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), + rrdcalc_name(rc), rc->critical->result, buffer_tostring(rc->critical->error_msg), + rrdcalc_source(rc) + ); + critical_status = rrdcalc_value2status(rc->critical->result); + } + } + + // -------------------------------------------------------- + // decide the final alarm status + + RRDCALC_STATUS status = RRDCALC_STATUS_UNDEFINED; + + switch (warning_status) { + case RRDCALC_STATUS_CLEAR: + status = RRDCALC_STATUS_CLEAR; + break; + + case RRDCALC_STATUS_RAISED: + status = RRDCALC_STATUS_WARNING; + break; + + default: + break; + } + + switch (critical_status) { + case RRDCALC_STATUS_CLEAR: + if (status == RRDCALC_STATUS_UNDEFINED) + status = RRDCALC_STATUS_CLEAR; + break; + + case RRDCALC_STATUS_RAISED: + status = RRDCALC_STATUS_CRITICAL; + break; + + default: + break; + } + + // -------------------------------------------------------- + // check if the new status and the old differ + + if (status != rc->status) { + worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); + int delay = 0; + + // apply trigger hysteresis + + if (now > rc->delay_up_to_timestamp) { + rc->delay_up_current = rc->delay_up_duration; + rc->delay_down_current = rc->delay_down_duration; + rc->delay_last = 0; + rc->delay_up_to_timestamp = 0; + } else { + rc->delay_up_current = (int) (rc->delay_up_current * rc->delay_multiplier); + if (rc->delay_up_current > rc->delay_max_duration) + rc->delay_up_current = rc->delay_max_duration; + + rc->delay_down_current = (int) (rc->delay_down_current * rc->delay_multiplier); + if (rc->delay_down_current > rc->delay_max_duration) + rc->delay_down_current = rc->delay_max_duration; + } + + if (status > rc->status) + delay = rc->delay_up_current; + else + delay = rc->delay_down_current; + + // COMMENTED: because we do need to send raising alarms + // if(now + delay < rc->delay_up_to_timestamp) + // delay = (int)(rc->delay_up_to_timestamp - now); + + rc->delay_last = delay; + rc->delay_up_to_timestamp = now + delay; + + ALARM_ENTRY *ae = health_create_alarm_entry( + host, + rc->id, + rc->next_event_id++, + rc->config_hash_id, + now, + rc->name, + rc->rrdset->id, + rc->rrdset->context, + rc->rrdset->family, + rc->classification, + rc->component, + rc->type, + rc->exec, + rc->recipient, + now - rc->last_status_change, + rc->old_value, + rc->value, + rc->status, + status, + rc->source, + rc->units, + rc->info, + rc->delay_last, + ( + ((rc->options & RRDCALC_OPTION_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | + ((rc->run_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) | + (rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0) + ) + ); + + health_alarm_log_add_entry(host, ae); + + log_health("[%s]: Alert event for [%s.%s], value [%s], status [%s].", rrdhost_hostname(host), ae_chart_name(ae), ae_name(ae), ae_new_value_string(ae), rrdcalc_status2string(ae->new_status)); + + rc->last_status_change = now; + rc->old_status = rc->status; + rc->status = status; + } + + rc->last_updated = now; + rc->next_update = now + rc->update_every; + + if (next_run > rc->next_update) + next_run = rc->next_update; + } + foreach_rrdcalc_in_rrdhost_done(rc); + + // process repeating alarms + foreach_rrdcalc_in_rrdhost_read(host, rc) { + int repeat_every = 0; + if(unlikely(rrdcalc_isrepeating(rc) && rc->delay_up_to_timestamp <= now)) { + if(unlikely(rc->status == RRDCALC_STATUS_WARNING)) { + rc->run_flags &= ~RRDCALC_FLAG_RUN_ONCE; + repeat_every = rc->warn_repeat_every; + } else if(unlikely(rc->status == RRDCALC_STATUS_CRITICAL)) { + rc->run_flags &= ~RRDCALC_FLAG_RUN_ONCE; + repeat_every = rc->crit_repeat_every; + } else if(unlikely(rc->status == RRDCALC_STATUS_CLEAR)) { + if(!(rc->run_flags & RRDCALC_FLAG_RUN_ONCE)) { + if(rc->old_status == RRDCALC_STATUS_CRITICAL) { + repeat_every = 1; + } else if (rc->old_status == RRDCALC_STATUS_WARNING) { + repeat_every = 1; + } + } + } + } else { + continue; + } + + if(unlikely(repeat_every > 0 && (rc->last_repeat + repeat_every) <= now)) { + worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); + rc->last_repeat = now; + if (likely(rc->times_repeat < UINT32_MAX)) rc->times_repeat++; + + ALARM_ENTRY *ae = health_create_alarm_entry( + host, + rc->id, + rc->next_event_id++, + rc->config_hash_id, + now, + rc->name, + rc->rrdset->id, + rc->rrdset->context, + rc->rrdset->family, + rc->classification, + rc->component, + rc->type, + rc->exec, + rc->recipient, + now - rc->last_status_change, + rc->old_value, + rc->value, + rc->old_status, + rc->status, + rc->source, + rc->units, + rc->info, + rc->delay_last, + ( + ((rc->options & RRDCALC_OPTION_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | + ((rc->run_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) | + (rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0) + ) + ); + + ae->last_repeat = rc->last_repeat; + if (!(rc->run_flags & RRDCALC_FLAG_RUN_ONCE) && rc->status == RRDCALC_STATUS_CLEAR) { + ae->flags |= HEALTH_ENTRY_RUN_ONCE; + } + rc->run_flags |= RRDCALC_FLAG_RUN_ONCE; + health_process_notifications(host, ae); + debug(D_HEALTH, "Notification sent for the repeating alarm %u.", ae->alarm_id); + health_alarm_wait_for_execution(ae); + health_alarm_log_free_one_nochecks_nounlink(ae); + } + } + foreach_rrdcalc_in_rrdhost_done(rc); + } + + if (unlikely(netdata_exit)) + break; + + // execute notifications + // and cleanup + worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_PROCESS); + health_alarm_log_process(host); + + if (unlikely(netdata_exit)) { + // wait for all notifications to finish before allowing health to be cleaned up + ALARM_ENTRY *ae; + while (NULL != (ae = alarm_notifications_in_progress.head)) { + health_alarm_wait_for_execution(ae); + } + break; + } + + // wait for all notifications to finish before allowing health to be cleaned up + ALARM_ENTRY *ae; + while (NULL != (ae = alarm_notifications_in_progress.head)) { + health_alarm_wait_for_execution(ae); + } + +#ifdef ENABLE_ACLK + if (netdata_cloud_setting && unlikely(host->aclk_alert_reloaded) && loop > (marked_aclk_reload_loop + 2)) { + sql_queue_removed_alerts_to_aclk(host); + host->aclk_alert_reloaded = 0; + marked_aclk_reload_loop = 0; + } +#endif + + if(unlikely(netdata_exit)) + break; + + health_sleep(next_run, loop, host); + + } // forever + + netdata_thread_cleanup_pop(1); + return NULL; +} + +void health_add_host_labels(void) { + DICTIONARY *labels = localhost->rrdlabels; + + int is_ephemeral = appconfig_get_boolean(&netdata_config, CONFIG_SECTION_HEALTH, "is ephemeral", CONFIG_BOOLEAN_NO); + rrdlabels_add(labels, "_is_ephemeral", is_ephemeral ? "true" : "false", RRDLABEL_SRC_CONFIG); + + int has_unstable_connection = appconfig_get_boolean(&netdata_config, CONFIG_SECTION_HEALTH, "has unstable connection", CONFIG_BOOLEAN_NO); + rrdlabels_add(labels, "_has_unstable_connection", has_unstable_connection ? "true" : "false", RRDLABEL_SRC_CONFIG); +} + +void health_thread_spawn(RRDHOST * host) { + if(!host->health_spawn) { + char tag[NETDATA_THREAD_TAG_MAX + 1]; + snprintfz(tag, NETDATA_THREAD_TAG_MAX, "HEALTH[%s]", rrdhost_hostname(host)); + struct health_state *health = callocz(1, sizeof(*health)); + health->host = host; + + if(netdata_thread_create(&host->health_thread, tag, NETDATA_THREAD_OPTION_JOINABLE, health_main, (void *) health)) { + log_health("[%s]: Failed to create new thread for client.", rrdhost_hostname(host)); + error("HEALTH [%s]: Failed to create new thread for client.", rrdhost_hostname(host)); + } + else { + log_health("[%s]: Created new thread for client.", rrdhost_hostname(host)); + host->health_spawn = 1; + host->aclk_alert_reloaded = 1; + } + } +} diff --git a/health/health.d/adaptec_raid.conf b/health/health.d/adaptec_raid.conf new file mode 100644 index 0000000..1d823ad --- /dev/null +++ b/health/health.d/adaptec_raid.conf @@ -0,0 +1,30 @@ + +# logical device status check + + template: adaptec_raid_ld_status + on: adaptec_raid.ld_status + class: Errors + type: System +component: RAID + lookup: max -10s foreach * + units: bool + every: 10s + crit: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: logical device status is failed or degraded + to: sysadmin + +# physical device state check + + template: adaptec_raid_pd_state + on: adaptec_raid.pd_state + class: Errors + type: System +component: RAID + lookup: max -10s foreach * + units: bool + every: 10s + crit: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: physical device state is not online + to: sysadmin diff --git a/health/health.d/anomalies.conf b/health/health.d/anomalies.conf new file mode 100644 index 0000000..269ae54 --- /dev/null +++ b/health/health.d/anomalies.conf @@ -0,0 +1,23 @@ +# raise a warning alarm if an anomaly probability is consistently above 50% + + template: anomalies_anomaly_probabilities + on: anomalies.probability + class: Errors + type: Netdata +component: ML + lookup: average -2m foreach * + every: 1m + warn: $this > 50 + info: average anomaly probability over the last 2 minutes + +# raise a warning alarm if an anomaly flag is consistently firing + + template: anomalies_anomaly_flags + on: anomalies.anomaly + class: Errors + type: Netdata +component: ML + lookup: sum -2m foreach * + every: 1m + warn: $this > 10 + info: number of anomalies in the last 2 minutes diff --git a/health/health.d/apcupsd.conf b/health/health.d/apcupsd.conf new file mode 100644 index 0000000..65f1a69 --- /dev/null +++ b/health/health.d/apcupsd.conf @@ -0,0 +1,49 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: apcupsd_10min_ups_load + on: apcupsd.load + class: Utilization + type: Power Supply +component: UPS + os: * + hosts: * + lookup: average -10m unaligned of percentage + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 10m multiplier 1.5 max 1h + info: average UPS load over the last 10 minutes + to: sitemgr + +# Discussion in https://github.com/netdata/netdata/pull/3928: +# Fire the alarm as soon as it's going on battery (99% charge) and clear only when full. + template: apcupsd_ups_charge + on: apcupsd.charge + class: Errors + type: Power Supply +component: UPS + os: * + hosts: * + lookup: average -60s unaligned of charge + units: % + every: 60s + warn: $this < 100 + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 10m multiplier 1.5 max 1h + info: average UPS charge over the last minute + to: sitemgr + + template: apcupsd_last_collected_secs + on: apcupsd.load + class: Latency + type: Power Supply +component: UPS device + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/bcache.conf b/health/health.d/bcache.conf new file mode 100644 index 0000000..49cb5ad --- /dev/null +++ b/health/health.d/bcache.conf @@ -0,0 +1,30 @@ + + template: bcache_cache_errors + on: disk.bcache_cache_read_races + class: Errors + type: System +component: Disk + lookup: sum -1m unaligned absolute + units: errors + every: 1m + warn: $this > 0 + delay: up 2m down 1h multiplier 1.5 max 2h + info: number of times data was read from the cache, \ + the bucket was reused and invalidated in the last 10 minutes \ + (when this occurs the data is reread from the backing device) + to: sysadmin + + template: bcache_cache_dirty + on: disk.bcache_cache_alloc + class: Utilization + type: System +component: Disk + calc: $dirty + $metadata + $undefined + units: % + every: 1m + warn: $this > ( ($status >= $WARNING ) ? ( 70 ) : ( 90 ) ) + crit: $this > ( ($status == $CRITICAL) ? ( 90 ) : ( 95 ) ) + delay: up 1m down 1h multiplier 1.5 max 2h + info: percentage of cache space used for dirty data and metadata \ + (this usually means your SSD cache is too small) + to: sysadmin diff --git a/health/health.d/beanstalkd.conf b/health/health.d/beanstalkd.conf new file mode 100644 index 0000000..13ac8c1 --- /dev/null +++ b/health/health.d/beanstalkd.conf @@ -0,0 +1,41 @@ +# get the number of buried jobs in all queues + + template: beanstalk_server_buried_jobs + on: beanstalk.current_jobs + class: Workload + type: Messaging +component: Beanstalk + calc: $buried + units: jobs + every: 10s + warn: $this > 0 + crit: $this > 10 + delay: up 0 down 5m multiplier 1.2 max 1h + info: number of buried jobs across all tubes. \ + You need to manually kick them so they can be processed. \ + Presence of buried jobs in a tube does not affect new jobs. + to: sysadmin + +# get the number of buried jobs per queue + +#template: beanstalk_tube_buried_jobs +# on: beanstalk.jobs +# calc: $buried +# units: jobs +# every: 10s +# warn: $this > 0 +# crit: $this > 10 +# delay: up 0 down 5m multiplier 1.2 max 1h +# info: the number of jobs buried per tube +# to: sysadmin + +# get the current number of tubes + +#template: beanstalk_number_of_tubes +# on: beanstalk.current_tubes +# calc: $tubes +# every: 10s +# warn: $this < 5 +# delay: up 0 down 5m multiplier 1.2 max 1h +# info: the current number of tubes on the server +# to: sysadmin diff --git a/health/health.d/bind_rndc.conf b/health/health.d/bind_rndc.conf new file mode 100644 index 0000000..7c09225 --- /dev/null +++ b/health/health.d/bind_rndc.conf @@ -0,0 +1,12 @@ + template: bind_rndc_stats_file_size + on: bind_rndc.stats_size + class: Utilization + type: DNS +component: BIND + units: megabytes + every: 60 + calc: $stats_size + warn: $this > 512 + crit: $this > 1024 + info: BIND statistics-file size + to: sysadmin diff --git a/health/health.d/boinc.conf b/health/health.d/boinc.conf new file mode 100644 index 0000000..7d7a4fd --- /dev/null +++ b/health/health.d/boinc.conf @@ -0,0 +1,74 @@ +# Alarms for various BOINC issues. + +# Warn on any compute errors encountered. + template: boinc_compute_errors + on: boinc.states + class: Errors + type: Computing +component: BOINC + os: * + hosts: * + families: * + lookup: average -10m unaligned of comperror + units: tasks + every: 1m + warn: $this > 0 + crit: $this > 1 + delay: up 1m down 5m multiplier 1.5 max 1h + info: average number of compute errors over the last 10 minutes + to: sysadmin + +# Warn on lots of upload errors + template: boinc_upload_errors + on: boinc.states + class: Errors + type: Computing +component: BOINC + os: * + hosts: * + families: * + lookup: average -10m unaligned of upload_failed + units: tasks + every: 1m + warn: $this > 0 + crit: $this > 1 + delay: up 1m down 5m multiplier 1.5 max 1h + info: average number of failed uploads over the last 10 minutes + to: sysadmin + +# Warn on the task queue being empty + template: boinc_total_tasks + on: boinc.tasks + class: Utilization + type: Computing +component: BOINC + os: * + hosts: * + families: * + lookup: average -10m unaligned of total + units: tasks + every: 1m + warn: $this < 1 + crit: $this < 0.1 + delay: up 5m down 10m multiplier 1.5 max 1h + info: average number of total tasks over the last 10 minutes + to: sysadmin + +# Warn on no active tasks with a non-empty queue + template: boinc_active_tasks + on: boinc.tasks + class: Utilization + type: Computing +component: BOINC + os: * + hosts: * + families: * + lookup: average -10m unaligned of active + calc: ($boinc_total_tasks >= 1) ? ($this) : (inf) + units: tasks + every: 1m + warn: $this < 1 + crit: $this < 0.1 + delay: up 5m down 10m multiplier 1.5 max 1h + info: average number of active tasks over the last 10 minutes + to: sysadmin diff --git a/health/health.d/btrfs.conf b/health/health.d/btrfs.conf new file mode 100644 index 0000000..8d197aa --- /dev/null +++ b/health/health.d/btrfs.conf @@ -0,0 +1,68 @@ + + template: btrfs_allocated + on: btrfs.disk + class: Utilization + type: System +component: File system + os: * + hosts: * + families: * + calc: 100 - ($unallocated * 100 / ($unallocated + $data_used + $data_free + $meta_used + $meta_free + $sys_used + $sys_free)) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) + crit: $this > (($status == $CRITICAL) ? (95) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: percentage of allocated BTRFS physical disk space + to: sysadmin + + template: btrfs_data + on: btrfs.data + class: Utilization + type: System +component: File system + os: * + hosts: * + families: * + calc: $used * 100 / ($used + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: utilization of BTRFS data space + to: sysadmin + + template: btrfs_metadata + on: btrfs.metadata + class: Utilization + type: System +component: File system + os: * + hosts: * + families: * + calc: ($used + $reserved) * 100 / ($used + $free + $reserved) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: utilization of BTRFS metadata space + to: sysadmin + + template: btrfs_system + on: btrfs.system + class: Utilization + type: System +component: File system + os: * + hosts: * + families: * + calc: $used * 100 / ($used + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 + crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 + delay: up 1m down 15m multiplier 1.5 max 1h + info: utilization of BTRFS system space + to: sysadmin diff --git a/health/health.d/ceph.conf b/health/health.d/ceph.conf new file mode 100644 index 0000000..1f9da25 --- /dev/null +++ b/health/health.d/ceph.conf @@ -0,0 +1,15 @@ +# low ceph disk available + + template: ceph_cluster_space_usage + on: ceph.general_usage + class: Utilization + type: Storage +component: Ceph + calc: $used * 100 / ($used + $avail) + units: % + every: 1m + warn: $this > (($status >= $WARNING ) ? (85) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 5m multiplier 1.2 max 1h + info: cluster disk space utilization + to: sysadmin diff --git a/health/health.d/cgroups.conf b/health/health.d/cgroups.conf new file mode 100644 index 0000000..4bfe38b --- /dev/null +++ b/health/health.d/cgroups.conf @@ -0,0 +1,141 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + template: cgroup_10min_cpu_usage + on: cgroup.cpu_limit + class: Utilization + type: Cgroups +component: CPU + os: linux + hosts: * + lookup: average -10m unaligned + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average cgroup CPU utilization over the last 10 minutes + to: sysadmin + + template: cgroup_ram_in_use + on: cgroup.mem_usage + class: Utilization + type: Cgroups +component: Memory + os: linux + hosts: * + calc: ($ram) * 100 / $memory_limit + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: cgroup memory utilization + to: sysadmin + +# ----------------------------------------------------------------------------- +# check for packet storms + +# 1. calculate the rate packets are received in 1m: 1m_received_packets_rate +# 2. do the same for the last 10s +# 3. raise an alarm if the later is 10x or 20x the first +# we assume the minimum packet storm should at least have +# 10000 packets/s, average of the last 10 seconds + + template: cgroup_1m_received_packets_rate + on: cgroup.net_packets + class: Workload + type: Cgroups +component: Network + hosts: * + lookup: average -1m unaligned of received + units: packets + every: 10s + info: average number of packets received by the network interface $family over the last minute + + template: cgroup_10s_received_packets_storm + on: cgroup.net_packets + class: Workload + type: Cgroups +component: Network + hosts: * + lookup: average -10s unaligned of received + calc: $this * 100 / (($1m_received_packets_rate < 1000)?(1000):($1m_received_packets_rate)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(200):(5000)) + crit: $this > (($status == $CRITICAL)?(5000):(6000)) + options: no-clear-notification + info: ratio of average number of received packets for the network interface $family over the last 10 seconds, \ + compared to the rate over the last minute + to: sysadmin + +# ---------------------------------K8s containers-------------------------------------------- + + template: k8s_cgroup_10min_cpu_usage + on: k8s.cgroup.cpu_limit + class: Utilization + type: Cgroups +component: CPU + os: linux + hosts: * + lookup: average -10m unaligned + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average cgroup CPU utilization over the last 10 minutes + to: sysadmin + + template: k8s_cgroup_ram_in_use + on: k8s.cgroup.mem_usage + class: Utilization + type: Cgroups +component: Memory + os: linux + hosts: * + calc: ($ram) * 100 / $memory_limit + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: cgroup memory utilization + to: sysadmin + +# check for packet storms + +# 1. calculate the rate packets are received in 1m: 1m_received_packets_rate +# 2. do the same for the last 10s +# 3. raise an alarm if the later is 10x or 20x the first +# we assume the minimum packet storm should at least have +# 10000 packets/s, average of the last 10 seconds + + template: k8s_cgroup_1m_received_packets_rate + on: k8s.cgroup.net_packets + class: Workload + type: Cgroups +component: Network + hosts: * + lookup: average -1m unaligned of received + units: packets + every: 10s + info: average number of packets received by the network interface $family over the last minute + + template: k8s_cgroup_10s_received_packets_storm + on: k8s.cgroup.net_packets + class: Workload + type: Cgroups +component: Network + hosts: * + lookup: average -10s unaligned of received + calc: $this * 100 / (($k8s_cgroup_10s_received_packets_storm < 1000)?(1000):($k8s_cgroup_10s_received_packets_storm)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(200):(5000)) + crit: $this > (($status == $CRITICAL)?(5000):(6000)) + options: no-clear-notification + info: ratio of average number of received packets for the network interface $family over the last 10 seconds, \ + compared to the rate over the last minute + to: sysadmin diff --git a/health/health.d/cockroachdb.conf b/health/health.d/cockroachdb.conf new file mode 100644 index 0000000..1f22784 --- /dev/null +++ b/health/health.d/cockroachdb.conf @@ -0,0 +1,73 @@ + +# Capacity + + template: cockroachdb_used_storage_capacity + on: cockroachdb.storage_used_capacity_percentage + class: Utilization + type: Database +component: CockroachDB + calc: $capacity_used_percent + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: storage capacity utilization + to: dba + + template: cockroachdb_used_usable_storage_capacity + on: cockroachdb.storage_used_capacity_percentage + class: Utilization + type: Database +component: CockroachDB + calc: $capacity_usable_used_percent + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: storage usable space utilization + to: dba + +# Replication + + template: cockroachdb_unavailable_ranges + on: cockroachdb.ranges_replication_problem + class: Errors + type: Database +component: CockroachDB + calc: $ranges_unavailable + units: num + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of ranges with fewer live replicas than needed for quorum + to: dba + + template: cockroachdb_underreplicated_ranges + on: cockroachdb.ranges_replication_problem + class: Errors + type: Database +component: CockroachDB + calc: $ranges_underreplicated + units: num + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of ranges with fewer live replicas than the replication target + to: dba + +# FD + + template: cockroachdb_open_file_descriptors_limit + on: cockroachdb.process_file_descriptors + class: Utilization + type: Database +component: CockroachDB + calc: $sys_fd_open/$sys_fd_softlimit * 100 + units: % + every: 10s + warn: $this > 80 + delay: down 15m multiplier 1.5 max 1h + info: open file descriptors utilization (against softlimit) + to: dba diff --git a/health/health.d/cpu.conf b/health/health.d/cpu.conf new file mode 100644 index 0000000..ad69528 --- /dev/null +++ b/health/health.d/cpu.conf @@ -0,0 +1,67 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + template: 10min_cpu_usage + on: system.cpu + class: Utilization + type: System +component: CPU + os: linux + hosts: * + lookup: average -10m unaligned of user,system,softirq,irq,guest + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average CPU utilization over the last 10 minutes (excluding iowait, nice and steal) + to: sysadmin + + template: 10min_cpu_iowait + on: system.cpu + class: Utilization + type: System +component: CPU + os: linux + hosts: * + lookup: average -10m unaligned of iowait + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (20) : (40)) + crit: $this > (($status == $CRITICAL) ? (40) : (50)) + delay: down 15m multiplier 1.5 max 1h + info: average CPU iowait time over the last 10 minutes + to: sysadmin + + template: 20min_steal_cpu + on: system.cpu + class: Latency + type: System +component: CPU + os: linux + hosts: * + lookup: average -20m unaligned of steal + units: % + every: 5m + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (20) : (30)) + delay: down 1h multiplier 1.5 max 2h + info: average CPU steal time over the last 20 minutes + to: sysadmin + +## FreeBSD + template: 10min_cpu_usage + on: system.cpu + class: Utilization + type: System +component: CPU + os: freebsd + hosts: * + lookup: average -10m unaligned of user,system,interrupt + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average CPU utilization over the last 10 minutes (excluding nice) + to: sysadmin diff --git a/health/health.d/dbengine.conf b/health/health.d/dbengine.conf new file mode 100644 index 0000000..65c41b8 --- /dev/null +++ b/health/health.d/dbengine.conf @@ -0,0 +1,64 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: 10min_dbengine_global_fs_errors + on: netdata.dbengine_global_errors + class: Errors + type: Netdata +component: DB engine + os: linux freebsd macos + hosts: * + lookup: sum -10m unaligned of fs_errors + units: errors + every: 10s + crit: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of filesystem errors in the last 10 minutes (too many open files, wrong permissions, etc) + to: sysadmin + + alarm: 10min_dbengine_global_io_errors + on: netdata.dbengine_global_errors + class: Errors + type: Netdata +component: DB engine + os: linux freebsd macos + hosts: * + lookup: sum -10m unaligned of io_errors + units: errors + every: 10s + crit: $this > 0 + delay: down 1h multiplier 1.5 max 3h + info: number of IO errors in the last 10 minutes (CRC errors, out of space, bad disk, etc) + to: sysadmin + + alarm: 10min_dbengine_global_flushing_warnings + on: netdata.dbengine_global_errors + class: Errors + type: Netdata +component: DB engine + os: linux freebsd macos + hosts: * + lookup: sum -10m unaligned of pg_cache_over_half_dirty_events + units: errors + every: 10s + warn: $this > 0 + delay: down 1h multiplier 1.5 max 3h + info: number of times when dbengine dirty pages were over 50% of the instance's page cache in the last 10 minutes. \ + Metric data are at risk of not being stored in the database. To remedy, reduce disk load or use faster disks. + to: sysadmin + + alarm: 10min_dbengine_global_flushing_errors + on: netdata.dbengine_long_term_page_stats + class: Errors + type: Netdata +component: DB engine + os: linux freebsd macos + hosts: * + lookup: sum -10m unaligned of flushing_pressure_deletions + units: pages + every: 10s + crit: $this != 0 + delay: down 1h multiplier 1.5 max 3h + info: number of pages deleted due to failure to flush data to disk in the last 10 minutes. \ + Metric data were lost to unblock data collection. To fix, reduce disk load or use faster disks. + to: sysadmin diff --git a/health/health.d/disks.conf b/health/health.d/disks.conf new file mode 100644 index 0000000..5daff61 --- /dev/null +++ b/health/health.d/disks.conf @@ -0,0 +1,173 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + +# ----------------------------------------------------------------------------- +# low disk space + +# checking the latest collected values +# raise an alarm if the disk is low on +# available disk space + + template: disk_space_usage + on: disk.space + class: Utilization + type: System +component: Disk + os: linux freebsd + hosts: * + families: !/dev !/dev/* !/run !/run/* * + calc: $used * 100 / ($avail + $used) + units: % + every: 1m + warn: $this > (($status >= $WARNING ) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: disk $family space utilization + to: sysadmin + + template: disk_inode_usage + on: disk.inodes + class: Utilization + type: System +component: Disk + os: linux freebsd + hosts: * + families: !/dev !/dev/* !/run !/run/* * + calc: $used * 100 / ($avail + $used) + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 1m down 15m multiplier 1.5 max 1h + info: disk $family inode utilization + to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk fill rate + +# calculate the rate the disk fills +# use as base, the available space change +# during the last hour + +# this is just a calculation - it has no alarm +# we will use it in the next template to find +# the hours remaining + +# template: disk_fill_rate +# on: disk.space +# os: linux freebsd +# hosts: * +# families: * +# lookup: min -10m at -50m unaligned of avail +# calc: ($this - $avail) / (($now - $after) / 3600) +# every: 1m +# units: GB/hour +# info: average rate the disk fills up (positive), or frees up (negative) space, for the last hour + + +# calculate the hours remaining +# if the disk continues to fill +# in this rate + +# template: out_of_disk_space_time +# on: disk.space +# os: linux freebsd +# hosts: * +# families: * +# calc: ($disk_fill_rate > 0) ? ($avail / $disk_fill_rate) : (inf) +# units: hours +# every: 10s +# warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) +# crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) +# delay: down 15m multiplier 1.2 max 1h +# info: estimated time the disk will run out of space, if the system continues to add data with the rate of the last hour +# to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk inode fill rate + +# calculate the rate the disk inodes are allocated +# use as base, the available inodes change +# during the last hour + +# this is just a calculation - it has no alarm +# we will use it in the next template to find +# the hours remaining + +# template: disk_inode_rate +# on: disk.inodes +# os: linux freebsd +# hosts: * +# families: * +# lookup: min -10m at -50m unaligned of avail +# calc: ($this - $avail) / (($now - $after) / 3600) +# every: 1m +# units: inodes/hour +# info: average rate at which disk inodes are allocated (positive), or freed (negative), for the last hour + +# calculate the hours remaining +# if the disk inodes are allocated +# in this rate + +# template: out_of_disk_inodes_time +# on: disk.inodes +# os: linux freebsd +# hosts: * +# families: * +# calc: ($disk_inode_rate > 0) ? ($avail / $disk_inode_rate) : (inf) +# units: hours +# every: 10s +# warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) +# crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) +# delay: down 15m multiplier 1.2 max 1h +# info: estimated time the disk will run out of inodes, if the system continues to allocate inodes with the rate of the last hour +# to: sysadmin + + +# ----------------------------------------------------------------------------- +# disk congestion + +# raise an alarm if the disk is congested +# by calculating the average disk utilization +# for the last 10 minutes + + template: 10min_disk_utilization + on: disk.util + class: Utilization + type: System +component: Disk + os: linux freebsd + hosts: * + families: * + lookup: average -10m unaligned + units: % + every: 1m + warn: $this > 98 * (($status >= $WARNING) ? (0.7) : (1)) + delay: down 15m multiplier 1.2 max 1h + info: average percentage of time $family disk was busy over the last 10 minutes + to: silent + + +# raise an alarm if the disk backlog +# is above 1000ms (1s) per second +# for 10 minutes +# (i.e. the disk cannot catch up) + + template: 10min_disk_backlog + on: disk.backlog + class: Latency + type: System +component: Disk + os: linux + hosts: * + families: * + lookup: average -10m unaligned + units: ms + every: 1m + warn: $this > 5000 * (($status >= $WARNING) ? (0.7) : (1)) + delay: down 15m multiplier 1.2 max 1h + info: average backlog size of the $family disk over the last 10 minutes + to: silent diff --git a/health/health.d/dns_query.conf b/health/health.d/dns_query.conf new file mode 100644 index 0000000..b9d6c23 --- /dev/null +++ b/health/health.d/dns_query.conf @@ -0,0 +1,14 @@ +# detect dns query failure + + template: dns_query_query_status + on: dns_query.query_status + class: Errors + type: DNS +component: DNS + calc: $success + units: status + every: 10s + warn: $this != nan && $this != 1 + delay: up 30s down 5m multiplier 1.5 max 1h + info: DNS request type $label:record_type to server $label:server is unsuccessful + to: sysadmin diff --git a/health/health.d/dnsmasq_dhcp.conf b/health/health.d/dnsmasq_dhcp.conf new file mode 100644 index 0000000..010b945 --- /dev/null +++ b/health/health.d/dnsmasq_dhcp.conf @@ -0,0 +1,15 @@ +# dhcp-range utilization + + template: dnsmasq_dhcp_dhcp_range_utilization + on: dnsmasq_dhcp.dhcp_range_utilization + class: Utilization + type: DHCP +component: Dnsmasq + every: 10s + units: % + calc: $used + warn: $this > ( ($status >= $WARNING ) ? ( 80 ) : ( 90 ) ) + crit: $this > ( ($status == $CRITICAL) ? ( 90 ) : ( 95 ) ) + delay: down 5m + info: DHCP range utilization + to: sysadmin diff --git a/health/health.d/dockerd.conf b/health/health.d/dockerd.conf new file mode 100644 index 0000000..220ddd6 --- /dev/null +++ b/health/health.d/dockerd.conf @@ -0,0 +1,11 @@ + template: docker_unhealthy_containers + on: docker.unhealthy_containers + class: Errors + type: Containers +component: Docker + units: unhealthy containers + every: 10s + lookup: average -10s + crit: $this > 0 + info: average number of unhealthy docker containers over the last 10 seconds + to: sysadmin diff --git a/health/health.d/entropy.conf b/health/health.d/entropy.conf new file mode 100644 index 0000000..13b0fcd --- /dev/null +++ b/health/health.d/entropy.conf @@ -0,0 +1,19 @@ + +# check if entropy is too low +# the alarm is checked every 1 minute +# and examines the last hour of data + + alarm: lowest_entropy + on: system.entropy + class: Utilization + type: System +component: Cryptography + os: linux + hosts: * + lookup: min -5m unaligned + units: entries + every: 5m + warn: $this < (($status >= $WARNING) ? (200) : (100)) + delay: down 1h multiplier 1.5 max 2h + info: minimum number of entries in the random numbers pool in the last 5 minutes + to: silent diff --git a/health/health.d/exporting.conf b/health/health.d/exporting.conf new file mode 100644 index 0000000..06f398c --- /dev/null +++ b/health/health.d/exporting.conf @@ -0,0 +1,29 @@ + + template: exporting_last_buffering + families: * + on: exporting_data_size + class: Latency + type: Netdata +component: Exporting engine + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful buffering of exporting data + to: dba + + template: exporting_metrics_sent + families: * + on: exporting_data_size + class: Workload + type: Netdata +component: Exporting engine + units: % + calc: abs($sent) * 100 / abs($buffered) + every: 10s + warn: $this != 100 + delay: down 5m multiplier 1.5 max 1h + info: percentage of metrics sent to the external database server + to: dba diff --git a/health/health.d/fping.conf b/health/health.d/fping.conf new file mode 100644 index 0000000..bb22419 --- /dev/null +++ b/health/health.d/fping.conf @@ -0,0 +1,64 @@ + + template: fping_last_collected_secs + families: * + on: fping.latency + class: Latency + type: Other +component: Network + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sysadmin + + template: fping_host_reachable + families: * + on: fping.latency + class: Errors + type: Other +component: Network + calc: $average != nan + units: up/down + every: 10s + crit: $this == 0 + delay: down 30m multiplier 1.5 max 2h + info: reachability status of the network host (0: unreachable, 1: reachable) + to: sysadmin + + template: fping_host_latency + families: * + on: fping.latency + class: Latency + type: Other +component: Network + lookup: average -10s unaligned of average + units: ms + every: 10s + green: 500 + red: 1000 + warn: $this > $green OR $max > $red + crit: $this > $red + delay: down 30m multiplier 1.5 max 2h + info: average latency to the network host over the last 10 seconds + to: sysadmin + + template: fping_packet_loss + families: * + on: fping.quality + class: Errors + type: System +component: Network + lookup: average -10m unaligned of returned + calc: 100 - $this + green: 1 + red: 10 + units: % + every: 10s + warn: $this > $green + crit: $this > $red + delay: down 30m multiplier 1.5 max 2h + info: packet loss ratio to the network host over the last 10 minutes + to: sysadmin diff --git a/health/health.d/gearman.conf b/health/health.d/gearman.conf new file mode 100644 index 0000000..14010d4 --- /dev/null +++ b/health/health.d/gearman.conf @@ -0,0 +1,14 @@ + + template: gearman_workers_queued + on: gearman.single_job + class: Latency + type: Computing +component: Gearman + lookup: average -10m unaligned match-names of Pending + units: workers + every: 10s + warn: $this > 30000 + crit: $this > 100000 + delay: down 5m multiplier 1.5 max 1h + info: average number of queued jobs over the last 10 minutes + to: sysadmin diff --git a/health/health.d/geth.conf b/health/health.d/geth.conf new file mode 100644 index 0000000..dd1eb47 --- /dev/null +++ b/health/health.d/geth.conf @@ -0,0 +1,12 @@ +#chainhead_header is expected momenterarily to be ahead. If its considerably ahead (e.g more than 5 blocks), then the node is definitely out of sync. + template: geth_chainhead_diff_between_header_block + on: geth.chainhead + class: Workload + type: ethereum_node +component: geth + every: 10s + calc: $chain_head_block - $chain_head_header + units: blocks + warn: $this != 0 + crit: $this > 5 + delay: down 1m multiplier 1.5 max 1h diff --git a/health/health.d/go.d.plugin.conf b/health/health.d/go.d.plugin.conf new file mode 100644 index 0000000..cd87fe0 --- /dev/null +++ b/health/health.d/go.d.plugin.conf @@ -0,0 +1,17 @@ + +# make sure go.d.plugin data collection job is running + + template: go.d_job_last_collected_secs + on: netdata.go_plugin_execution_time + class: Errors + type: Netdata +component: go.d.plugin + module: !* * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster diff --git a/health/health.d/haproxy.conf b/health/health.d/haproxy.conf new file mode 100644 index 0000000..a0ab52b --- /dev/null +++ b/health/health.d/haproxy.conf @@ -0,0 +1,23 @@ + template: haproxy_backend_server_status + on: haproxy_hs.down + class: Errors + type: Web Proxy +component: HAProxy + units: failed servers + every: 10s + lookup: average -10s + crit: $this > 0 + info: average number of failed haproxy backend servers over the last 10 seconds + to: sysadmin + + template: haproxy_backend_status + on: haproxy_hb.down + class: Errors + type: Web Proxy +component: HAProxy + units: failed backend + every: 10s + lookup: average -10s + crit: $this > 0 + info: average number of failed haproxy backends over the last 10 seconds + to: sysadmin diff --git a/health/health.d/hdfs.conf b/health/health.d/hdfs.conf new file mode 100644 index 0000000..ca8df31 --- /dev/null +++ b/health/health.d/hdfs.conf @@ -0,0 +1,76 @@ + +# Common + + template: hdfs_capacity_usage + on: hdfs.capacity + class: Utilization + type: Storage +component: HDFS + calc: ($used) * 100 / ($used + $remaining) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (80) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: summary datanodes space capacity utilization + to: sysadmin + + +# NameNode + + template: hdfs_missing_blocks + on: hdfs.blocks + class: Errors + type: Storage +component: HDFS + calc: $missing + units: missing blocks + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of missing blocks + to: sysadmin + + + template: hdfs_stale_nodes + on: hdfs.data_nodes + class: Errors + type: Storage +component: HDFS + calc: $stale + units: dead nodes + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of datanodes marked stale due to delayed heartbeat + to: sysadmin + + + template: hdfs_dead_nodes + on: hdfs.data_nodes + class: Errors + type: Storage +component: HDFS + calc: $dead + units: dead nodes + every: 10s + crit: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of datanodes which are currently dead + to: sysadmin + + +# DataNode + + template: hdfs_num_failed_volumes + on: hdfs.num_failed_volumes + class: Errors + type: Storage +component: HDFS + calc: $fsds_num_failed_volumes + units: failed volumes + every: 10s + warn: $this > 0 + delay: down 15m multiplier 1.5 max 1h + info: number of failed volumes + to: sysadmin diff --git a/health/health.d/httpcheck.conf b/health/health.d/httpcheck.conf new file mode 100644 index 0000000..599c47a --- /dev/null +++ b/health/health.d/httpcheck.conf @@ -0,0 +1,112 @@ + +# This is a fast-reacting no-notification alarm ideal for custom dashboards or badges + template: httpcheck_web_service_up + families: * + on: httpcheck.status + class: Utilization + type: Web Server +component: HTTP endpoint + lookup: average -1m unaligned percentage of success + calc: ($this < 75) ? (0) : ($this) + every: 5s + units: up/down + info: average ratio of successful HTTP requests over the last minute (at least 75%) + to: silent + + template: httpcheck_web_service_bad_content + families: * + on: httpcheck.status + class: Workload + type: Web Server +component: HTTP endpoint + lookup: average -5m unaligned percentage of bad_content + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average ratio of HTTP responses with unexpected content over the last 5 minutes + options: no-clear-notification + to: webmaster + + template: httpcheck_web_service_bad_status + families: * + on: httpcheck.status + class: Workload + type: Web Server +component: HTTP endpoint + lookup: average -5m unaligned percentage of bad_status + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average ratio of HTTP responses with unexpected status over the last 5 minutes + options: no-clear-notification + to: webmaster + + template: httpcheck_web_service_timeouts + families: * + on: httpcheck.status + class: Latency + type: Web Server +component: HTTP endpoint + lookup: average -5m unaligned percentage of timeout + every: 10s + units: % + info: average ratio of HTTP request timeouts over the last 5 minutes + + template: httpcheck_no_web_service_connections + families: * + on: httpcheck.status + class: Errors + type: Other +component: HTTP endpoint + lookup: average -5m unaligned percentage of no_connection + every: 10s + units: % + info: average ratio of failed requests during the last 5 minutes + +# combined timeout & no connection alarm + template: httpcheck_web_service_unreachable + families: * + on: httpcheck.status + class: Errors + type: Web Server +component: HTTP endpoint + calc: ($httpcheck_no_web_service_connections >= $httpcheck_web_service_timeouts) ? ($httpcheck_no_web_service_connections) : ($httpcheck_web_service_timeouts) + units: % + every: 10s + warn: ($httpcheck_no_web_service_connections >= 10 OR $httpcheck_web_service_timeouts >= 10) AND ($httpcheck_no_web_service_connections < 40 OR $httpcheck_web_service_timeouts < 40) + crit: $httpcheck_no_web_service_connections >= 40 OR $httpcheck_web_service_timeouts >= 40 + delay: down 5m multiplier 1.5 max 1h + info: ratio of failed requests either due to timeouts or no connection over the last 5 minutes + options: no-clear-notification + to: webmaster + + template: httpcheck_1h_web_service_response_time + families: * + on: httpcheck.responsetime + class: Latency + type: Other +component: HTTP endpoint + lookup: average -1h unaligned of time + every: 30s + units: ms + info: average HTTP response time over the last hour + + template: httpcheck_web_service_slow + families: * + on: httpcheck.responsetime + class: Latency + type: Web Server +component: HTTP endpoint + lookup: average -3m unaligned of time + units: ms + every: 10s + warn: ($this > ($httpcheck_1h_web_service_response_time * 2) ) + crit: ($this > ($httpcheck_1h_web_service_response_time * 3) ) + delay: down 5m multiplier 1.5 max 1h + info: average HTTP response time over the last 3 minutes, compared to the average over the last hour + options: no-clear-notification + to: webmaster diff --git a/health/health.d/ioping.conf b/health/health.d/ioping.conf new file mode 100644 index 0000000..8b498ad --- /dev/null +++ b/health/health.d/ioping.conf @@ -0,0 +1,16 @@ + template: ioping_disk_latency + families: * + on: ioping.latency + class: Latency + type: System +component: Disk + lookup: average -10s unaligned of latency + units: microseconds + every: 10s + green: 5000 + red: 10000 + warn: $this > $green + crit: $this > $red + delay: down 30m multiplier 1.5 max 2h + info: average I/O latency over the last 10 seconds + to: sysadmin diff --git a/health/health.d/ipc.conf b/health/health.d/ipc.conf new file mode 100644 index 0000000..c178a41 --- /dev/null +++ b/health/health.d/ipc.conf @@ -0,0 +1,34 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: semaphores_used + on: system.ipc_semaphores + class: Utilization + type: System +component: IPC + os: linux + hosts: * + calc: $semaphores * 100 / $ipc_semaphores_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (70) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: IPC semaphore utilization + to: sysadmin + + alarm: semaphore_arrays_used + on: system.ipc_semaphore_arrays + class: Utilization + type: System +component: IPC + os: linux + hosts: * + calc: $arrays * 100 / $ipc_semaphores_arrays_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (70) : (90)) + delay: down 5m multiplier 1.5 max 1h + info: IPC semaphore arrays utilization + to: sysadmin diff --git a/health/health.d/ipfs.conf b/health/health.d/ipfs.conf new file mode 100644 index 0000000..a514ddf --- /dev/null +++ b/health/health.d/ipfs.conf @@ -0,0 +1,14 @@ + + template: ipfs_datastore_usage + on: ipfs.repo_size + class: Utilization + type: Data Sharing +component: IPFS + calc: $size * 100 / $avail + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: IPFS datastore utilization + to: sysadmin diff --git a/health/health.d/ipmi.conf b/health/health.d/ipmi.conf new file mode 100644 index 0000000..feadba1 --- /dev/null +++ b/health/health.d/ipmi.conf @@ -0,0 +1,26 @@ + alarm: ipmi_sensors_states + on: ipmi.sensors_states + class: Errors + type: System +component: IPMI + calc: $warning + $critical + units: sensors + every: 10s + warn: $this > 0 + crit: $critical > 0 + delay: up 5m down 15m multiplier 1.5 max 1h + info: number of IPMI sensors in non-nominal state + to: sysadmin + + alarm: ipmi_events + on: ipmi.events + class: Utilization + type: System +component: IPMI + calc: $events + units: events + every: 10s + warn: $this > 0 + delay: up 5m down 15m multiplier 1.5 max 1h + info: number of events in the IPMI System Event Log (SEL) + to: sysadmin diff --git a/health/health.d/isc_dhcpd.conf b/health/health.d/isc_dhcpd.conf new file mode 100644 index 0000000..d1f9396 --- /dev/null +++ b/health/health.d/isc_dhcpd.conf @@ -0,0 +1,10 @@ +# template: isc_dhcpd_leases_size +# on: isc_dhcpd.leases_total +# units: KB +# every: 60 +# calc: $leases_size +# warn: $this > 3072 +# crit: $this > 6144 +# delay: up 2m down 5m +# info: dhcpd.leases file too big! Module can slow down your server. +# to: sysadmin diff --git a/health/health.d/kubelet.conf b/health/health.d/kubelet.conf new file mode 100644 index 0000000..c2778cc --- /dev/null +++ b/health/health.d/kubelet.conf @@ -0,0 +1,145 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- + +# True (1) if the node is experiencing a configuration-related error, false (0) otherwise. + + template: kubelet_node_config_error + on: k8s_kubelet.kubelet_node_config_error + class: Errors + type: Kubernetes +component: Kubelet + calc: $kubelet_node_config_error + units: bool + every: 10s + warn: $this == 1 + delay: down 1m multiplier 1.5 max 2h + info: the node is experiencing a configuration-related error (0: false, 1: true) + to: sysadmin + +# Failed Token() requests to the alternate token source + + template: kubelet_token_requests + lookup: sum -10s of token_fail_count + on: k8s_kubelet.kubelet_token_requests + class: Errors + type: Kubernetes +component: Kubelet + units: failed requests + every: 10s + warn: $this > 0 + delay: down 1m multiplier 1.5 max 2h + info: number of failed Token() requests to the alternate token source + to: sysadmin + +# Docker and runtime operation errors + + template: kubelet_operations_error + lookup: sum -1m + on: k8s_kubelet.kubelet_operations_errors + class: Errors + type: Kubernetes +component: Kubelet + units: errors + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (20)) + delay: up 30s down 1m multiplier 1.5 max 2h + info: number of Docker or runtime operation errors + to: sysadmin + +# ----------------------------------------------------------------------------- + +# Pod Lifecycle Event Generator Relisting Latency + +# 1. calculate the pleg relisting latency for 1m (quantile 0.5, quantile 0.9, quantile 0.99) +# 2. do the same for the last 10s +# 3. raise an alarm if the later is: +# - 2x the first for quantile 0.5 +# - 4x the first for quantile 0.9 +# - 8x the first for quantile 0.99 +# +# we assume the minimum latency is 1000 microseconds + +# quantile 0.5 + + template: kubelet_1m_pleg_relist_latency_quantile_05 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + class: Latency + type: Kubernetes +component: Kubelet + lookup: average -1m unaligned of kubelet_pleg_relist_latency_05 + units: microseconds + every: 10s + info: average Pod Lifecycle Event Generator relisting latency over the last minute (quantile 0.5) + + template: kubelet_10s_pleg_relist_latency_quantile_05 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + class: Latency + type: Kubernetes +component: Kubelet + lookup: average -10s unaligned of kubelet_pleg_relist_latency_05 + calc: $this * 100 / (($kubelet_1m_pleg_relist_latency_quantile_05 < 1000)?(1000):($kubelet_1m_pleg_relist_latency_quantile_05)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(100):(200)) + crit: $this > (($status >= $WARNING)?(200):(400)) + delay: down 1m multiplier 1.5 max 2h + info: ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ + compared to the last minute (quantile 0.5) + to: sysadmin + +# quantile 0.9 + + template: kubelet_1m_pleg_relist_latency_quantile_09 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + class: Latency + type: Kubernetes +component: Kubelet + lookup: average -1m unaligned of kubelet_pleg_relist_latency_09 + units: microseconds + every: 10s + info: average Pod Lifecycle Event Generator relisting latency over the last minute (quantile 0.9) + + template: kubelet_10s_pleg_relist_latency_quantile_09 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + class: Latency + type: Kubernetes +component: Kubelet + lookup: average -10s unaligned of kubelet_pleg_relist_latency_09 + calc: $this * 100 / (($kubelet_1m_pleg_relist_latency_quantile_09 < 1000)?(1000):($kubelet_1m_pleg_relist_latency_quantile_09)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(200):(400)) + crit: $this > (($status >= $WARNING)?(400):(800)) + delay: down 1m multiplier 1.5 max 2h + info: ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ + compared to the last minute (quantile 0.9) + to: sysadmin + +# quantile 0.99 + + template: kubelet_1m_pleg_relist_latency_quantile_099 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + class: Latency + type: Kubernetes +component: Kubelet + lookup: average -1m unaligned of kubelet_pleg_relist_latency_099 + units: microseconds + every: 10s + info: average Pod Lifecycle Event Generator relisting latency over the last minute (quantile 0.99) + + template: kubelet_10s_pleg_relist_latency_quantile_099 + on: k8s_kubelet.kubelet_pleg_relist_latency_microseconds + class: Latency + type: Kubernetes +component: Kubelet + lookup: average -10s unaligned of kubelet_pleg_relist_latency_099 + calc: $this * 100 / (($kubelet_1m_pleg_relist_latency_quantile_099 < 1000)?(1000):($kubelet_1m_pleg_relist_latency_quantile_099)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(400):(800)) + crit: $this > (($status >= $WARNING)?(800):(1200)) + delay: down 1m multiplier 1.5 max 2h + info: ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ + compared to the last minute (quantile 0.99) + to: sysadmin diff --git a/health/health.d/linux_power_supply.conf b/health/health.d/linux_power_supply.conf new file mode 100644 index 0000000..c0bc6de --- /dev/null +++ b/health/health.d/linux_power_supply.conf @@ -0,0 +1,15 @@ +# Alert on low battery capacity. + + template: linux_power_supply_capacity + on: powersupply.capacity + class: Utilization + type: Power Supply +component: Battery + calc: $capacity + units: % + every: 10s + warn: $this < 10 + crit: $this < 5 + delay: up 30s down 5m multiplier 1.2 max 1h + info: percentage of remaining power supply capacity + to: sysadmin diff --git a/health/health.d/load.conf b/health/health.d/load.conf new file mode 100644 index 0000000..0bd872f --- /dev/null +++ b/health/health.d/load.conf @@ -0,0 +1,66 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# Calculate the base trigger point for the load average alarms. +# This is the maximum number of CPU's in the system over the past 1 +# minute, with a special case for a single CPU of setting the trigger at 2. + alarm: load_cpu_number + on: system.load + class: Utilization + type: System +component: Load + os: linux + hosts: * + calc: ($active_processors == nan or $active_processors == inf or $active_processors < 2) ? ( 2 ) : ( $active_processors ) + units: cpus + every: 1m + info: number of active CPU cores in the system + +# Send alarms if the load average is unusually high. +# These intentionally _do not_ calculate the average over the sampled +# time period because the values being checked already are averages. + + alarm: load_average_15 + on: system.load + class: Utilization + type: System +component: Load + os: linux + hosts: * + lookup: max -1m unaligned of load15 + units: load + every: 1m + warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 175 : 200) + delay: down 15m multiplier 1.5 max 1h + info: system fifteen-minute load average + to: sysadmin + + alarm: load_average_5 + on: system.load + class: Utilization + type: System +component: Load + os: linux + hosts: * + lookup: max -1m unaligned of load5 + units: load + every: 1m + warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 350 : 400) + delay: down 15m multiplier 1.5 max 1h + info: system five-minute load average + to: sysadmin + + alarm: load_average_1 + on: system.load + class: Utilization + type: System +component: Load + os: linux + hosts: * + lookup: max -1m unaligned of load1 + units: load + every: 1m + warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 700 : 800) + delay: down 15m multiplier 1.5 max 1h + info: system one-minute load average + to: sysadmin diff --git a/health/health.d/mdstat.conf b/health/health.d/mdstat.conf new file mode 100644 index 0000000..cedaa00 --- /dev/null +++ b/health/health.d/mdstat.conf @@ -0,0 +1,52 @@ + template: mdstat_last_collected + on: md.disks + class: Latency + type: System +component: RAID + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin + + template: mdstat_disks + on: md.disks + class: Errors + type: System +component: RAID + units: failed devices + every: 10s + calc: $down + crit: $this > 0 + info: number of devices in the down state for the $family array. \ + Any number > 0 indicates that the array is degraded. + to: sysadmin + + template: mdstat_mismatch_cnt + on: md.mismatch_cnt + class: Errors + type: System +component: RAID + families: !*(raid1) !*(raid10) * + units: unsynchronized blocks + calc: $count + every: 60s + warn: $this > 1024 + delay: up 30m + info: number of unsynchronized blocks for the $family array + to: sysadmin + + template: mdstat_nonredundant_last_collected + on: md.nonredundant + class: Latency + type: System +component: RAID + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + info: number of seconds since the last successful data collection + to: sysadmin diff --git a/health/health.d/megacli.conf b/health/health.d/megacli.conf new file mode 100644 index 0000000..9fbcfdb --- /dev/null +++ b/health/health.d/megacli.conf @@ -0,0 +1,71 @@ + +## Adapters (controllers) + + template: megacli_adapter_state + on: megacli.adapter_degraded + class: Errors + type: System +component: RAID + lookup: max -10s foreach * + units: boolean + every: 10s + crit: $this > 0 + delay: down 5m multiplier 2 max 10m + info: adapter is in the degraded state (0: false, 1: true) + to: sysadmin + +## Physical Disks + + template: megacli_pd_predictive_failures + on: megacli.pd_predictive_failure + class: Errors + type: System +component: RAID + lookup: sum -10s foreach * + units: predictive failures + every: 10s + warn: $this > 0 + delay: up 1m down 5m multiplier 2 max 10m + info: number of physical drive predictive failures + to: sysadmin + + template: megacli_pd_media_errors + on: megacli.pd_media_error + class: Errors + type: System +component: RAID + lookup: sum -10s foreach * + units: media errors + every: 10s + warn: $this > 0 + delay: up 1m down 5m multiplier 2 max 10m + info: number of physical drive media errors + to: sysadmin + +## Battery Backup Units (BBU) + + template: megacli_bbu_relative_charge + on: megacli.bbu_relative_charge + class: Workload + type: System +component: RAID + lookup: average -10s + units: percent + every: 10s + warn: $this <= (($status >= $WARNING) ? (85) : (80)) + crit: $this <= (($status == $CRITICAL) ? (50) : (40)) + info: average battery backup unit (BBU) relative state of charge over the last 10 seconds + to: sysadmin + + template: megacli_bbu_cycle_count + on: megacli.bbu_cycle_count + class: Workload + type: System +component: RAID + lookup: average -10s + units: cycles + every: 10s + warn: $this >= 100 + crit: $this >= 500 + info: average battery backup unit (BBU) charge cycles count over the last 10 seconds + to: sysadmin diff --git a/health/health.d/memcached.conf b/health/health.d/memcached.conf new file mode 100644 index 0000000..2a2fe4b --- /dev/null +++ b/health/health.d/memcached.conf @@ -0,0 +1,48 @@ + +# detect if memcached cache is full + + template: memcached_cache_memory_usage + on: memcached.cache + class: Utilization + type: KV Storage +component: Memcached + calc: $used * 100 / ($used + $available) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: up 0 down 15m multiplier 1.5 max 1h + info: cache memory utilization + to: dba + + +# find the rate memcached cache is filling + + template: memcached_cache_fill_rate + on: memcached.cache + class: Utilization + type: KV Storage +component: Memcached + lookup: min -10m at -50m unaligned of available + calc: ($this - $available) / (($now - $after) / 3600) + units: KB/hour + every: 1m + info: average rate the cache fills up (positive), or frees up (negative) space over the last hour + + +# find the hours remaining until memcached cache is full + + template: memcached_out_of_cache_space_time + on: memcached.cache + class: Utilization + type: KV Storage +component: Memcached + calc: ($memcached_cache_fill_rate > 0) ? ($available / $memcached_cache_fill_rate) : (inf) + units: hours + every: 10s + warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) + crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) + delay: down 15m multiplier 1.5 max 1h + info: estimated time the cache will run out of space \ + if the system continues to add data at the same rate as the past hour + to: dba diff --git a/health/health.d/memory.conf b/health/health.d/memory.conf new file mode 100644 index 0000000..010cbbd --- /dev/null +++ b/health/health.d/memory.conf @@ -0,0 +1,47 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: 1hour_ecc_memory_correctable + on: mem.ecc_ce + class: Errors + type: System +component: Memory + os: linux + hosts: * + lookup: sum -10m unaligned + units: errors + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: number of ECC correctable errors in the last 10 minutes + to: sysadmin + + alarm: 1hour_ecc_memory_uncorrectable + on: mem.ecc_ue + class: Errors + type: System +component: Memory + os: linux + hosts: * + lookup: sum -10m unaligned + units: errors + every: 1m + crit: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: number of ECC uncorrectable errors in the last 10 minutes + to: sysadmin + + alarm: 1hour_memory_hw_corrupted + on: mem.hwcorrupt + class: Errors + type: System +component: Memory + os: linux + hosts: * + calc: $HardwareCorrupted + units: MB + every: 10s + warn: $this > 0 + delay: down 1h multiplier 1.5 max 1h + info: amount of memory corrupted due to a hardware failure + to: sysadmin diff --git a/health/health.d/ml.conf b/health/health.d/ml.conf new file mode 100644 index 0000000..6836ce7 --- /dev/null +++ b/health/health.d/ml.conf @@ -0,0 +1,53 @@ +# below are some examples of using the `anomaly-bit` option to define alerts based on anomaly +# rates as opposed to raw metric values. You can read more about the anomaly-bit and Netdata's +# native anomaly detection here: +# https://learn.netdata.cloud/docs/agent/ml#anomaly-bit---100--anomalous-0--normal + +# examples below are commented, you would need to uncomment and adjust as desired to enable them. + +# node level anomaly rate example +# https://learn.netdata.cloud/docs/agent/ml#node-anomaly-rate +# if node level anomaly rate is between 1-5% then warning (pick your own threshold that works best via tial and error). +# if node level anomaly rate is above 5% then critical (pick your own threshold that works best via tial and error). +# template: ml_1min_node_ar +# on: anomaly_detection.anomaly_rate +# os: linux +# hosts: * +# lookup: average -1m foreach anomaly_rate +# calc: $this +# units: % +# every: 30s +# warn: $this > (($status >= $WARNING) ? (1) : (5)) +# crit: $this > (($status == $CRITICAL) ? (5) : (100)) +# info: rolling 1min node level anomaly rate + +# alert per dimension example +# if anomaly rate is between 5-20% then warning (pick your own threshold that works best via tial and error). +# if anomaly rate is above 20% then critical (pick your own threshold that works best via tial and error). +# template: ml_5min_cpu_dims +# on: system.cpu +# os: linux +# hosts: * +# lookup: average -5m anomaly-bit foreach * +# calc: $this +# units: % +# every: 30s +# warn: $this > (($status >= $WARNING) ? (5) : (20)) +# crit: $this > (($status == $CRITICAL) ? (20) : (100)) +# info: rolling 5min anomaly rate for each system.cpu dimension + +# alert per chart example +# if anomaly rate is between 5-20% then warning (pick your own threshold that works best via tial and error). +# if anomaly rate is above 20% then critical (pick your own threshold that works best via tial and error). +# template: ml_5min_cpu_chart +# on: system.cpu +# os: linux +# hosts: * +# lookup: average -5m anomaly-bit of * +# calc: $this +# units: % +# every: 30s +# warn: $this > (($status >= $WARNING) ? (5) : (20)) +# crit: $this > (($status == $CRITICAL) ? (20) : (100)) +# info: rolling 5min anomaly rate for system.cpu chart + diff --git a/health/health.d/mysql.conf b/health/health.d/mysql.conf new file mode 100644 index 0000000..3941c71 --- /dev/null +++ b/health/health.d/mysql.conf @@ -0,0 +1,176 @@ + +# slow queries + + template: mysql_10s_slow_queries + on: mysql.queries + class: Latency + type: Database +component: MySQL + lookup: sum -10s of slow_queries + units: slow queries + every: 10s + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (10) : (20)) + delay: down 5m multiplier 1.5 max 1h + info: number of slow queries in the last 10 seconds + to: dba + + +# ----------------------------------------------------------------------------- +# lock waits + + template: mysql_10s_table_locks_immediate + on: mysql.table_locks + class: Utilization + type: Database +component: MySQL + lookup: sum -10s absolute of immediate + units: immediate locks + every: 10s + info: number of table immediate locks in the last 10 seconds + to: dba + + template: mysql_10s_table_locks_waited + on: mysql.table_locks + class: Latency + type: Database +component: MySQL + lookup: sum -10s absolute of waited + units: waited locks + every: 10s + info: number of table waited locks in the last 10 seconds + to: dba + + template: mysql_10s_waited_locks_ratio + on: mysql.table_locks + class: Latency + type: Database +component: MySQL + calc: ( ($mysql_10s_table_locks_waited + $mysql_10s_table_locks_immediate) > 0 ) ? (($mysql_10s_table_locks_waited * 100) / ($mysql_10s_table_locks_waited + $mysql_10s_table_locks_immediate)) : 0 + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (10) : (25)) + crit: $this > (($status == $CRITICAL) ? (25) : (50)) + delay: down 30m multiplier 1.5 max 1h + info: ratio of waited table locks over the last 10 seconds + to: dba + + +# ----------------------------------------------------------------------------- +# connections + + template: mysql_connections + on: mysql.connections_active + class: Utilization + type: Database +component: MySQL + calc: $active * 100 / $limit + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (60) : (70)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: down 15m multiplier 1.5 max 1h + info: client connections utilization + to: dba + + +# ----------------------------------------------------------------------------- +# replication + + template: mysql_replication + on: mysql.slave_status + class: Errors + type: Database +component: MySQL + calc: ($sql_running <= 0 OR $io_running <= 0)?0:1 + units: ok/failed + every: 10s + crit: $this == 0 + delay: down 5m multiplier 1.5 max 1h + info: replication status (0: stopped, 1: working) + to: dba + + template: mysql_replication_lag + on: mysql.slave_behind + class: Latency + type: Database +component: MySQL + calc: $seconds + units: seconds + every: 10s + warn: $this > (($status >= $WARNING) ? (5) : (10)) + crit: $this > (($status == $CRITICAL) ? (10) : (30)) + delay: down 15m multiplier 1.5 max 1h + info: difference between the timestamp of the latest transaction processed by the SQL thread and \ + the timestamp of the same transaction when it was processed on the master + to: dba + + +# ----------------------------------------------------------------------------- +# galera cluster size + + template: mysql_galera_cluster_size_max_2m + on: mysql.galera_cluster_size + class: Utilization + type: Database +component: MySQL + lookup: max -2m at -1m unaligned + units: nodes + every: 10s + info: maximum galera cluster size in the last 2 minutes starting one minute ago + to: dba + + template: mysql_galera_cluster_size + on: mysql.galera_cluster_size + class: Utilization + type: Database +component: MySQL + calc: $nodes + units: nodes + every: 10s + warn: $this > $mysql_galera_cluster_size_max_2m + crit: $this < $mysql_galera_cluster_size_max_2m + delay: up 20s down 5m multiplier 1.5 max 1h + info: current galera cluster size, compared to the maximum size in the last 2 minutes + to: dba + +# galera node state + + template: mysql_galera_cluster_state_warn + on: mysql.galera_cluster_state + class: Errors + type: Database +component: MySQL + calc: $donor + $joined + every: 10s + warn: $this != nan AND $this != 0 + delay: up 30s down 5m multiplier 1.5 max 1h + info: galera node state is either Donor/Desynced or Joined. + to: dba + + template: mysql_galera_cluster_state_crit + on: mysql.galera_cluster_state + class: Errors + type: Database +component: MySQL + calc: $undefined + $joining + $error + every: 10s + crit: $this != nan AND $this != 0 + delay: up 30s down 5m multiplier 1.5 max 1h + info: galera node state is either Undefined or Joining or Error. + to: dba + +# galera node status + + template: mysql_galera_cluster_status + on: mysql.galera_cluster_status + class: Errors + type: Database +component: MySQL + calc: $primary + every: 10s + crit: $this != nan AND $this != 1 + delay: up 30s down 5m multiplier 1.5 max 1h + info: galera node is part of a nonoperational component. \ + This occurs in cases of multiple membership changes that result in a loss of Quorum or in cases of split-brain situations. + to: dba diff --git a/health/health.d/net.conf b/health/health.d/net.conf new file mode 100644 index 0000000..9d5b3b8 --- /dev/null +++ b/health/health.d/net.conf @@ -0,0 +1,256 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- +# net traffic overflow + + template: interface_speed + on: net.net + class: Latency + type: System +component: Network + os: * + hosts: * + families: * + calc: ( $nic_speed_max > 0 ) ? ( $nic_speed_max) : ( nan ) + units: Mbit + every: 10s + info: network interface $family current speed + + template: 1m_received_traffic_overflow + on: net.net + class: Workload + type: System +component: Network + os: linux + hosts: * + families: * + lookup: average -1m unaligned absolute of received + calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed)) : ( nan ) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (85) : (90)) + delay: up 1m down 1m multiplier 1.5 max 1h + info: average inbound utilization for the network interface $family over the last minute + to: sysadmin + + template: 1m_sent_traffic_overflow + on: net.net + class: Workload + type: System +component: Network + os: linux + hosts: * + families: * + lookup: average -1m unaligned absolute of sent + calc: ($interface_speed > 0) ? ($this * 100 / ($interface_speed)) : ( nan ) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (85) : (90)) + delay: up 1m down 1m multiplier 1.5 max 1h + info: average outbound utilization for the network interface $family over the last minute + to: sysadmin + +# ----------------------------------------------------------------------------- +# dropped packets + +# check if an interface is dropping packets +# the alarm is checked every 1 minute +# and examines the last 10 minutes of data +# +# it is possible to have expected packet drops on an interface for some network configurations +# look at the Monitoring Network Interfaces section in the proc.plugin documentation for more information + + template: inbound_packets_dropped + on: net.drops + class: Errors + type: System +component: Network + os: linux + hosts: * + families: * + lookup: sum -10m unaligned absolute of inbound + units: packets + every: 1m + info: number of inbound dropped packets for the network interface $family in the last 10 minutes + + template: outbound_packets_dropped + on: net.drops + class: Errors + type: System +component: Network + os: linux + hosts: * + families: * + lookup: sum -10m unaligned absolute of outbound + units: packets + every: 1m + info: number of outbound dropped packets for the network interface $family in the last 10 minutes + + template: inbound_packets_dropped_ratio + on: net.packets + class: Errors + type: System +component: Network + os: linux + hosts: * + families: !wl* * + lookup: sum -10m unaligned absolute of received + calc: (($inbound_packets_dropped != nan AND $this > 10000) ? ($inbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 2 + delay: up 1m down 1h multiplier 1.5 max 2h + info: ratio of inbound dropped packets for the network interface $family over the last 10 minutes + to: sysadmin + + template: outbound_packets_dropped_ratio + on: net.packets + class: Errors + type: System +component: Network + os: linux + hosts: * + families: !wl* * + lookup: sum -10m unaligned absolute of sent + calc: (($outbound_packets_dropped != nan AND $this > 1000) ? ($outbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 2 + delay: up 1m down 1h multiplier 1.5 max 2h + info: ratio of outbound dropped packets for the network interface $family over the last 10 minutes + to: sysadmin + + template: wifi_inbound_packets_dropped_ratio + on: net.packets + class: Errors + type: System +component: Network + os: linux + hosts: * + families: wl* + lookup: sum -10m unaligned absolute of received + calc: (($inbound_packets_dropped != nan AND $this > 10000) ? ($inbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 10 + delay: up 1m down 1h multiplier 1.5 max 2h + info: ratio of inbound dropped packets for the network interface $family over the last 10 minutes + to: sysadmin + + template: wifi_outbound_packets_dropped_ratio + on: net.packets + class: Errors + type: System +component: Network + os: linux + hosts: * + families: wl* + lookup: sum -10m unaligned absolute of sent + calc: (($outbound_packets_dropped != nan AND $this > 1000) ? ($outbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 10 + delay: up 1m down 1h multiplier 1.5 max 2h + info: ratio of outbound dropped packets for the network interface $family over the last 10 minutes + to: sysadmin + +# ----------------------------------------------------------------------------- +# interface errors + + template: interface_inbound_errors + on: net.errors + class: Errors + type: System +component: Network + os: freebsd + hosts: * + families: * + lookup: sum -10m unaligned absolute of inbound + units: errors + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: number of inbound errors for the network interface $family in the last 10 minutes + to: sysadmin + + template: interface_outbound_errors + on: net.errors + class: Errors + type: System +component: Network + os: freebsd + hosts: * + families: * + lookup: sum -10m unaligned absolute of outbound + units: errors + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: number of outbound errors for the network interface $family in the last 10 minutes + to: sysadmin + +# ----------------------------------------------------------------------------- +# FIFO errors + +# check if an interface is having FIFO +# buffer errors +# the alarm is checked every 1 minute +# and examines the last 10 minutes of data + + template: 10min_fifo_errors + on: net.fifo + class: Errors + type: System +component: Network + os: linux + hosts: * + families: * + lookup: sum -10m unaligned absolute + units: errors + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 2h + info: number of FIFO errors for the network interface $family in the last 10 minutes + to: sysadmin + +# ----------------------------------------------------------------------------- +# check for packet storms + +# 1. calculate the rate packets are received in 1m: 1m_received_packets_rate +# 2. do the same for the last 10s +# 3. raise an alarm if the later is 10x or 20x the first +# we assume the minimum packet storm should at least have +# 10000 packets/s, average of the last 10 seconds + + template: 1m_received_packets_rate + on: net.packets + class: Workload + type: System +component: Network + os: linux freebsd + hosts: * + families: * + lookup: average -1m unaligned of received + units: packets + every: 10s + info: average number of packets received by the network interface $family over the last minute + + template: 10s_received_packets_storm + on: net.packets + class: Workload + type: System +component: Network + os: linux freebsd + hosts: * + families: * + lookup: average -10s unaligned of received + calc: $this * 100 / (($1m_received_packets_rate < 1000)?(1000):($1m_received_packets_rate)) + every: 10s + units: % + warn: $this > (($status >= $WARNING)?(200):(5000)) + crit: $this > (($status == $CRITICAL)?(5000):(6000)) + options: no-clear-notification + info: ratio of average number of received packets for the network interface $family over the last 10 seconds, \ + compared to the rate over the last minute + to: sysadmin diff --git a/health/health.d/netfilter.conf b/health/health.d/netfilter.conf new file mode 100644 index 0000000..7de383f --- /dev/null +++ b/health/health.d/netfilter.conf @@ -0,0 +1,19 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: netfilter_conntrack_full + on: netfilter.conntrack_sockets + class: Workload + type: System +component: Network + os: linux + hosts: * + lookup: max -10s unaligned of connections + calc: $this * 100 / $netfilter_conntrack_max + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (85) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (95)) + delay: down 5m multiplier 1.5 max 1h + info: netfilter connection tracker table size utilization + to: sysadmin diff --git a/health/health.d/nut.conf b/health/health.d/nut.conf new file mode 100644 index 0000000..6231dd9 --- /dev/null +++ b/health/health.d/nut.conf @@ -0,0 +1,47 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: nut_10min_ups_load + on: nut.load + class: Utilization + type: Power Supply +component: UPS + os: * + hosts: * + lookup: average -10m unaligned of load + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 10m multiplier 1.5 max 1h + info: average UPS load over the last 10 minutes + to: sitemgr + + template: nut_ups_charge + on: nut.charge + class: Errors + type: Power Supply +component: UPS + os: * + hosts: * + lookup: average -60s unaligned of battery_charge + units: % + every: 60s + warn: $this < 100 + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 10m multiplier 1.5 max 1h + info: average UPS charge over the last minute + to: sitemgr + + template: nut_last_collected_secs + on: nut.load + class: Latency + type: Power Supply +component: UPS device + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/nvme.conf b/health/health.d/nvme.conf new file mode 100644 index 0000000..5f729d5 --- /dev/null +++ b/health/health.d/nvme.conf @@ -0,0 +1,15 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: nvme_device_critical_warnings_state + families: * + on: nvme.device_critical_warnings_state + class: Errors + type: System +component: Disk + lookup: max -30s unaligned + units: state + every: 10s + crit: $this != nan AND $this != 0 + delay: down 5m multiplier 1.5 max 2h + info: NVMe device $label:device has critical warnings + to: sysadmin diff --git a/health/health.d/pihole.conf b/health/health.d/pihole.conf new file mode 100644 index 0000000..ee6c57c --- /dev/null +++ b/health/health.d/pihole.conf @@ -0,0 +1,32 @@ + +# Blocklist last update time. +# Default update interval is a week. + + template: pihole_blocklist_last_update + on: pihole.blocklist_last_update + class: Errors + type: Ad Filtering +component: Pi-hole + every: 10s + units: seconds + calc: $ago + warn: $this > 60 * 60 * 24 * 8 + crit: $this > 60 * 60 * 24 * 8 * 2 + info: gravity.list (blocklist) file last update time + to: sysadmin + +# Pi-hole's ability to block unwanted domains. +# Should be enabled. The whole point of Pi-hole! + + template: pihole_status + on: pihole.unwanted_domains_blocking_status + class: Errors + type: Ad Filtering +component: Pi-hole + every: 10s + units: status + calc: $disabled + warn: $this != nan AND $this == 1 + delay: up 2m down 5m + info: unwanted domains blocking is disabled + to: sysadmin diff --git a/health/health.d/ping.conf b/health/health.d/ping.conf new file mode 100644 index 0000000..cbe7c30 --- /dev/null +++ b/health/health.d/ping.conf @@ -0,0 +1,50 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: ping_host_reachable + families: * + on: ping.host_packet_loss + class: Errors + type: Other +component: Network + lookup: average -30s unaligned of loss + calc: $this != nan AND $this < 100 + units: up/down + every: 10s + crit: $this == 0 + delay: down 30m multiplier 1.5 max 2h + info: network host $label:host reachability status + to: sysadmin + + template: ping_packet_loss + families: * + on: ping.host_packet_loss + class: Errors + type: Other +component: Network + lookup: average -10m unaligned of loss + green: 5 + red: 10 + units: % + every: 10s + warn: $this > $green + crit: $this > $red + delay: down 30m multiplier 1.5 max 2h + info: packet loss percentage to the network host $label:host over the last 10 minutes + to: sysadmin + + template: ping_host_latency + families: * + on: ping.host_rtt + class: Latency + type: Other +component: Network + lookup: average -10s unaligned of avg + units: ms + every: 10s + green: 500 + red: 1000 + warn: $this > $green OR $max > $red + crit: $this > $red + delay: down 30m multiplier 1.5 max 2h + info: average latency to the network host $label:host over the last 10 seconds + to: sysadmin diff --git a/health/health.d/portcheck.conf b/health/health.d/portcheck.conf new file mode 100644 index 0000000..8cbd772 --- /dev/null +++ b/health/health.d/portcheck.conf @@ -0,0 +1,44 @@ + +# This is a fast-reacting no-notification alarm ideal for custom dashboards or badges + template: portcheck_service_reachable + families: * + on: portcheck.status + class: Workload + type: Other +component: TCP endpoint + lookup: average -1m unaligned percentage of success + calc: ($this < 75) ? (0) : ($this) + every: 5s + units: up/down + info: average ratio of successful connections over the last minute (at least 75%) + to: silent + + template: portcheck_connection_timeouts + families: * + on: portcheck.status + class: Errors + type: Other +component: TCP endpoint + lookup: average -5m unaligned percentage of timeout + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average ratio of timeouts over the last 5 minutes + to: sysadmin + + template: portcheck_connection_fails + families: * + on: portcheck.status + class: Errors + type: Other +component: TCP endpoint + lookup: average -5m unaligned percentage of no_connection,failed + every: 10s + units: % + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: average ratio of failed connections over the last 5 minutes + to: sysadmin diff --git a/health/health.d/postgres.conf b/health/health.d/postgres.conf new file mode 100644 index 0000000..66d034c --- /dev/null +++ b/health/health.d/postgres.conf @@ -0,0 +1,214 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: postgres_total_connection_utilization + on: postgres.connections_utilization + class: Utilization + type: Database +component: PostgreSQL + hosts: * + lookup: average -1m unaligned of used + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (80) : (90)) + delay: down 15m multiplier 1.5 max 1h + info: average total connection utilization over the last minute + to: dba + + template: postgres_acquired_locks_utilization + on: postgres.locks_utilization + class: Utilization + type: Database +component: PostgreSQL + hosts: * + lookup: average -1m unaligned of used + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (15) : (20)) + delay: down 15m multiplier 1.5 max 1h + info: average acquired locks utilization over the last minute + to: dba + + template: postgres_txid_exhaustion_perc + on: postgres.txid_exhaustion_perc + class: Utilization + type: Database +component: PostgreSQL + hosts: * + calc: $txid_exhaustion + units: % + every: 1m + warn: $this > 90 + delay: down 15m multiplier 1.5 max 1h + info: percent towards TXID wraparound + to: dba + +# Database alarms + + template: postgres_db_cache_io_ratio + on: postgres.db_cache_io_ratio + class: Workload + type: Database +component: PostgreSQL + hosts: * + lookup: average -1m unaligned of miss + calc: 100 - $this + units: % + every: 1m + warn: $this < (($status >= $WARNING) ? (70) : (60)) + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 15m multiplier 1.5 max 1h + info: average cache hit ratio in db $label:database over the last minute + to: dba + + template: postgres_db_transactions_rollback_ratio + on: postgres.db_transactions_ratio + class: Workload + type: Database +component: PostgreSQL + hosts: * + lookup: average -5m unaligned of rollback + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (2)) + delay: down 15m multiplier 1.5 max 1h + info: average aborted transactions percentage in db $label:database over the last five minutes + to: dba + + template: postgres_db_deadlocks_rate + on: postgres.db_deadlocks_rate + class: Errors + type: Database +component: PostgreSQL + hosts: * + lookup: sum -1m unaligned of deadlocks + units: deadlocks + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (10)) + delay: down 15m multiplier 1.5 max 1h + info: number of deadlocks detected in db $label:database in the last minute + to: dba + +# Table alarms + + template: postgres_table_cache_io_ratio + on: postgres.table_cache_io_ratio + class: Workload + type: Database +component: PostgreSQL + hosts: * + lookup: average -1m unaligned of miss + calc: 100 - $this + units: % + every: 1m + warn: $this < (($status >= $WARNING) ? (70) : (60)) + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 15m multiplier 1.5 max 1h + info: average cache hit ratio in db $label:database table $label:table over the last minute + to: dba + + template: postgres_table_index_cache_io_ratio + on: postgres.table_index_cache_io_ratio + class: Workload + type: Database +component: PostgreSQL + hosts: * + lookup: average -1m unaligned of miss + calc: 100 - $this + units: % + every: 1m + warn: $this < (($status >= $WARNING) ? (70) : (60)) + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 15m multiplier 1.5 max 1h + info: average index cache hit ratio in db $label:database table $label:table over the last minute + to: dba + + template: postgres_table_toast_cache_io_ratio + on: postgres.table_toast_cache_io_ratio + class: Workload + type: Database +component: PostgreSQL + hosts: * + lookup: average -1m unaligned of miss + calc: 100 - $this + units: % + every: 1m + warn: $this < (($status >= $WARNING) ? (70) : (60)) + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 15m multiplier 1.5 max 1h + info: average TOAST hit ratio in db $label:database table $label:table over the last minute + to: dba + + template: postgres_table_toast_index_cache_io_ratio + on: postgres.table_toast_index_cache_io_ratio + class: Workload + type: Database +component: PostgreSQL + hosts: * + lookup: average -1m unaligned of miss + calc: 100 - $this + units: % + every: 1m + warn: $this < (($status >= $WARNING) ? (70) : (60)) + crit: $this < (($status == $CRITICAL) ? (60) : (50)) + delay: down 15m multiplier 1.5 max 1h + info: average index TOAST hit ratio in db $label:database table $label:table over the last minute + to: dba + + template: postgres_table_bloat_size_perc + on: postgres.table_bloat_size_perc + class: Errors + type: Database +component: PostgreSQL + hosts: * + calc: $bloat + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (60) : (70)) + crit: $this > (($status == $CRITICAL) ? (70) : (80)) + delay: down 15m multiplier 1.5 max 1h + info: bloat size percentage in db $label:database table $label:table + to: dba + + template: postgres_table_last_autovacuum_time + on: postgres.table_autovacuum_since_time + class: Errors + type: Database +component: PostgreSQL + hosts: !* + calc: $time + units: seconds + every: 1m + warn: $this != nan AND $this > (60 * 60 * 24 * 7) + info: time elapsed since db $label:database table $label:table was vacuumed by the autovacuum daemon + to: dba + + template: postgres_table_last_autoanalyze_time + on: postgres.table_autoanalyze_since_time + class: Errors + type: Database +component: PostgreSQL + hosts: !* + calc: $time + units: seconds + every: 1m + warn: $this != nan AND $this > (60 * 60 * 24 * 7) + info: time elapsed since db $label:database table $label:table was analyzed by the autovacuum daemon + to: dba + +# Index alarms + + template: postgres_index_bloat_size_perc + on: postgres.index_bloat_size_perc + class: Errors + type: Database +component: PostgreSQL + hosts: * + calc: $bloat + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (60) : (70)) + crit: $this > (($status == $CRITICAL) ? (70) : (80)) + delay: down 15m multiplier 1.5 max 1h + info: bloat size percentage in db $label:database table $label:table index $label:index + to: dba diff --git a/health/health.d/processes.conf b/health/health.d/processes.conf new file mode 100644 index 0000000..2929ee3 --- /dev/null +++ b/health/health.d/processes.conf @@ -0,0 +1,16 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: active_processes + on: system.active_processes + class: Workload + type: System +component: Processes + hosts: * + calc: $active * 100 / $pidmax + units: % + every: 5s + warn: $this > (($status >= $WARNING) ? (85) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (95)) + delay: down 5m multiplier 1.5 max 1h + info: system process IDs (PID) space utilization + to: sysadmin diff --git a/health/health.d/python.d.plugin.conf b/health/health.d/python.d.plugin.conf new file mode 100644 index 0000000..0e81a48 --- /dev/null +++ b/health/health.d/python.d.plugin.conf @@ -0,0 +1,17 @@ + +# make sure python.d.plugin data collection job is running + + template: python.d_job_last_collected_secs + on: netdata.pythond_runtime + class: Errors + type: Netdata +component: python.d.plugin + module: !* * + calc: $now - $last_collected_t + units: seconds ago + every: 10s + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + info: number of seconds since the last successful data collection + to: webmaster diff --git a/health/health.d/qos.conf b/health/health.d/qos.conf new file mode 100644 index 0000000..7290d15 --- /dev/null +++ b/health/health.d/qos.conf @@ -0,0 +1,18 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# check if a QoS class is dropping packets +# the alarm is checked every 10 seconds +# and examines the last minute of data + +#template: 10min_qos_packet_drops +# on: tc.qos_dropped +# os: linux +# hosts: * +# lookup: sum -10m unaligned absolute +# every: 30s +# warn: $this > 0 +# delay: up 0 down 30m multiplier 1.5 max 1h +# units: packets +# info: dropped packets in the last 30 minutes +# to: sysadmin diff --git a/health/health.d/ram.conf b/health/health.d/ram.conf new file mode 100644 index 0000000..ab382c4 --- /dev/null +++ b/health/health.d/ram.conf @@ -0,0 +1,80 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: ram_in_use + on: system.ram + class: Utilization + type: System +component: Memory + os: linux + hosts: * + calc: $used * 100 / ($used + $cached + $free + $buffers) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: system memory utilization + to: sysadmin + + alarm: ram_available + on: mem.available + class: Utilization + type: System +component: Memory + os: linux + hosts: * + calc: $avail * 100 / ($system.ram.used + $system.ram.cached + $system.ram.free + $system.ram.buffers) + units: % + every: 10s + warn: $this < (($status >= $WARNING) ? (15) : (10)) + crit: $this < (($status == $CRITICAL) ? (10) : ( 5)) + delay: down 15m multiplier 1.5 max 1h + info: percentage of estimated amount of RAM available for userspace processes, without causing swapping + to: sysadmin + + alarm: oom_kill + on: mem.oom_kill + os: linux + hosts: * + lookup: sum -30m unaligned + units: kills + every: 5m + warn: $this > 0 + delay: down 10m +host labels: _is_k8s_node = false + info: number of out of memory kills in the last 30 minutes + to: sysadmin + +## FreeBSD + alarm: ram_in_use + on: system.ram + class: Utilization + type: System +component: Memory + os: freebsd + hosts: * + calc: ($active + $wired + $laundry + $buffers) * 100 / ($active + $wired + $laundry + $buffers + $cache + $free + $inactive) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: system memory utilization + to: sysadmin + + alarm: ram_available + on: mem.available + class: Utilization + type: System +component: Memory + os: freebsd + hosts: * + calc: $avail * 100 / ($system.ram.free + $system.ram.active + $system.ram.inactive + $system.ram.wired + $system.ram.cache + $system.ram.laundry + $system.ram.buffers) + units: % + every: 10s + warn: $this < (($status >= $WARNING) ? (15) : (10)) + crit: $this < (($status == $CRITICAL) ? (10) : ( 5)) + delay: down 15m multiplier 1.5 max 1h + info: percentage of estimated amount of RAM available for userspace processes, without causing swapping + to: sysadmin diff --git a/health/health.d/redis.conf b/health/health.d/redis.conf new file mode 100644 index 0000000..34d00b5 --- /dev/null +++ b/health/health.d/redis.conf @@ -0,0 +1,57 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: redis_connections_rejected + families: * + on: redis.connections + class: Errors + type: KV Storage +component: Redis + lookup: sum -1m unaligned of rejected + every: 10s + units: connections + warn: $this > 0 + info: connections rejected because of maxclients limit in the last minute + delay: down 5m multiplier 1.5 max 1h + to: dba + + template: redis_bgsave_broken + families: * + on: redis.bgsave_health + class: Errors + type: KV Storage +component: Redis + every: 10s + crit: $last_bgsave != nan AND $last_bgsave != 0 + units: ok/failed + info: status of the last RDB save operation (0: ok, 1: error) + delay: down 5m multiplier 1.5 max 1h + to: dba + + template: redis_bgsave_slow + families: * + on: redis.bgsave_now + class: Latency + type: KV Storage +component: Redis + every: 10s + calc: $current_bgsave_time + warn: $this > 600 + crit: $this > 1200 + units: seconds + info: duration of the on-going RDB save operation + delay: down 5m multiplier 1.5 max 1h + to: dba + + template: redis_master_link_down + families: * + on: redis.master_link_down_since_time + class: Errors + type: KV Storage +component: Redis + every: 10s + calc: $time + units: seconds + crit: $this != nan AND $this > 0 + info: time elapsed since the link between master and slave is down + delay: down 5m multiplier 1.5 max 1h + to: dba diff --git a/health/health.d/retroshare.conf b/health/health.d/retroshare.conf new file mode 100644 index 0000000..14aa76b --- /dev/null +++ b/health/health.d/retroshare.conf @@ -0,0 +1,16 @@ + +# make sure the DHT is fine when active + + template: retroshare_dht_working + on: retroshare.dht + class: Utilization + type: Data Sharing +component: Retroshare + calc: $dht_size_all + units: peers + every: 1m + warn: $this < (($status >= $WARNING) ? (120) : (100)) + crit: $this < (($status == $CRITICAL) ? (10) : (1)) + delay: up 0 down 15m multiplier 1.5 max 1h + info: number of DHT peers + to: sysadmin diff --git a/health/health.d/riakkv.conf b/health/health.d/riakkv.conf new file mode 100644 index 0000000..261fd48 --- /dev/null +++ b/health/health.d/riakkv.conf @@ -0,0 +1,93 @@ + +# Warn if a list keys operation is running. + template: riakkv_list_keys_active + on: riak.core.fsm_active + class: Utilization + type: Database +component: Riak KV + calc: $list_fsm_active + units: state machines + every: 10s + warn: $list_fsm_active > 0 + info: number of currently running list keys finite state machines + to: dba + + +## Timing healthchecks +# KV GET + template: riakkv_1h_kv_get_mean_latency + on: riak.kv.latency.get + class: Latency + type: Database +component: Riak KV + calc: $node_get_fsm_time_mean + lookup: average -1h unaligned of time + every: 30s + units: ms + info: average time between reception of client GET request and \ + subsequent response to client over the last hour + + template: riakkv_kv_get_slow + on: riak.kv.latency.get + class: Latency + type: Database +component: Riak KV + calc: $mean + lookup: average -3m unaligned of time + units: ms + every: 10s + warn: ($this > ($riakkv_1h_kv_get_mean_latency * 2) ) + crit: ($this > ($riakkv_1h_kv_get_mean_latency * 3) ) + info: average time between reception of client GET request and \ + subsequent response to the client over the last 3 minutes, \ + compared to the average over the last hour + delay: down 5m multiplier 1.5 max 1h + to: dba + +# KV PUT + template: riakkv_1h_kv_put_mean_latency + on: riak.kv.latency.put + class: Latency + type: Database +component: Riak KV + calc: $node_put_fsm_time_mean + lookup: average -1h unaligned of time + every: 30s + units: ms + info: average time between reception of client PUT request and \ + subsequent response to the client over the last hour + + template: riakkv_kv_put_slow + on: riak.kv.latency.put + class: Latency + type: Database +component: Riak KV + calc: $mean + lookup: average -3m unaligned of time + units: ms + every: 10s + warn: ($this > ($riakkv_1h_kv_put_mean_latency * 2) ) + crit: ($this > ($riakkv_1h_kv_put_mean_latency * 3) ) + info: average time between reception of client PUT request and \ + subsequent response to the client over the last 3 minutes, \ + compared to the average over the last hour + delay: down 5m multiplier 1.5 max 1h + to: dba + + +## VM healthchecks + +# Default Erlang VM process limit: 262144 +# On systems observed, this is < 2000, but may grow depending on load. + template: riakkv_vm_high_process_count + on: riak.vm + class: Utilization + type: Database +component: Riak KV + calc: $sys_process_count + units: processes + every: 10s + warn: $this > 10000 + crit: $this > 100000 + info: number of processes running in the Erlang VM + to: dba diff --git a/health/health.d/scaleio.conf b/health/health.d/scaleio.conf new file mode 100644 index 0000000..ab110bf --- /dev/null +++ b/health/health.d/scaleio.conf @@ -0,0 +1,31 @@ + +# make sure Storage Pool capacity utilization is under limit + + template: scaleio_storage_pool_capacity_utilization + on: scaleio.storage_pool_capacity_utilization + class: Utilization + type: Storage +component: ScaleIO + calc: $used + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: storage pool capacity utilization + to: sysadmin + + +# make sure Sdc is connected to MDM + + template: scaleio_sdc_mdm_connection_state + on: scaleio.sdc_mdm_connection_state + class: Utilization + type: Storage +component: ScaleIO + calc: $connected + every: 10s + warn: $this != 1 + delay: up 30s down 5m multiplier 1.5 max 1h + info: Data Client (SDC) to Metadata Manager (MDM) connection state (0: disconnected, 1: connected) + to: sysadmin diff --git a/health/health.d/softnet.conf b/health/health.d/softnet.conf new file mode 100644 index 0000000..345f875 --- /dev/null +++ b/health/health.d/softnet.conf @@ -0,0 +1,54 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# check for common /proc/net/softnet_stat errors + + alarm: 1min_netdev_backlog_exceeded + on: system.softnet_stat + class: Errors + type: System +component: Network + os: linux + hosts: * + lookup: average -1m unaligned absolute of dropped + units: packets + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (10)) + delay: down 1h multiplier 1.5 max 2h + info: average number of dropped packets in the last minute \ + due to exceeded net.core.netdev_max_backlog + to: sysadmin + + alarm: 1min_netdev_budget_ran_outs + on: system.softnet_stat + class: Errors + type: System +component: Network + os: linux + hosts: * + lookup: average -1m unaligned absolute of squeezed + units: events + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (10)) + delay: down 1h multiplier 1.5 max 2h + info: average number of times ksoftirq ran out of sysctl net.core.netdev_budget or \ + net.core.netdev_budget_usecs with work remaining over the last minute \ + (this can be a cause for dropped packets) + to: silent + + alarm: 10min_netisr_backlog_exceeded + on: system.softnet_stat + class: Errors + type: System +component: Network + os: freebsd + hosts: * + lookup: average -1m unaligned absolute of qdrops + units: packets + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (10)) + delay: down 1h multiplier 1.5 max 2h + info: average number of drops in the last minute \ + due to exceeded sysctl net.route.netisr_maxqlen \ + (this can be a cause for dropped packets) + to: sysadmin diff --git a/health/health.d/swap.conf b/health/health.d/swap.conf new file mode 100644 index 0000000..d30c74c --- /dev/null +++ b/health/health.d/swap.conf @@ -0,0 +1,35 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + + alarm: 30min_ram_swapped_out + on: system.swapio + class: Workload + type: System +component: Memory + os: linux freebsd + hosts: * + lookup: sum -30m unaligned absolute of out + # we have to convert KB to MB by dividing $this (i.e. the result of the lookup) with 1024 + calc: $this / 1024 * 100 / ( $system.ram.used + $system.ram.cached + $system.ram.free ) + units: % of RAM + every: 1m + warn: $this > (($status >= $WARNING) ? (20) : (30)) + delay: down 15m multiplier 1.5 max 1h + info: percentage of the system RAM swapped in the last 30 minutes + to: sysadmin + + alarm: used_swap + on: system.swap + class: Utilization + type: System +component: Memory + os: linux freebsd + hosts: * + calc: (($used + $free) > 0) ? ($used * 100 / ($used + $free)) : 0 + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: up 30s down 15m multiplier 1.5 max 1h + info: swap memory utilization + to: sysadmin diff --git a/health/health.d/synchronization.conf b/health/health.d/synchronization.conf new file mode 100644 index 0000000..417624a --- /dev/null +++ b/health/health.d/synchronization.conf @@ -0,0 +1,12 @@ + alarm: sync_freq + on: mem.sync + lookup: sum -1m of sync + units: calls + plugin: ebpf.plugin + every: 1m + warn: $this > 6 + delay: up 1m down 10m multiplier 1.5 max 1h + info: number of sync() system calls. \ + Every call causes all pending modifications to filesystem metadata and \ + cached file data to be written to the underlying filesystems. + to: sysadmin diff --git a/health/health.d/systemdunits.conf b/health/health.d/systemdunits.conf new file mode 100644 index 0000000..531d62f --- /dev/null +++ b/health/health.d/systemdunits.conf @@ -0,0 +1,141 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + +## Service units + template: systemd_service_unit_failed_state + on: systemd.service_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd service unit in the failed state + to: sysadmin + +## Socket units + template: systemd_socket_unit_failed_state + on: systemd.socket_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd socket unit in the failed state + to: sysadmin + +## Target units + template: systemd_target_unit_failed_state + on: systemd.target_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd target unit in the failed state + to: sysadmin + +## Path units + template: systemd_path_unit_failed_state + on: systemd.path_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd path unit in the failed state + to: sysadmin + +## Device units + template: systemd_device_unit_failed_state + on: systemd.device_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd device unit in the failed state + to: sysadmin + +## Mount units + template: systemd_mount_unit_failed_state + on: systemd.mount_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd mount units in the failed state + to: sysadmin + +## Automount units + template: systemd_automount_unit_failed_state + on: systemd.automount_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd automount unit in the failed state + to: sysadmin + +## Swap units + template: systemd_swap_unit_failed_state + on: systemd.swap_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd swap units in the failed state + to: sysadmin + +## Scope units + template: systemd_scope_unit_failed_state + on: systemd.scope_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd scope units in the failed state + to: sysadmin + +## Slice units + template: systemd_slice_unit_failed_state + on: systemd.slice_unit_state + class: Errors + type: Linux +component: Systemd units + calc: $failed + units: state + every: 10s + warn: $this != nan AND $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: systemd slice units in the failed state + to: sysadmin diff --git a/health/health.d/tcp_conn.conf b/health/health.d/tcp_conn.conf new file mode 100644 index 0000000..67b3bee --- /dev/null +++ b/health/health.d/tcp_conn.conf @@ -0,0 +1,22 @@ + +# +# ${tcp_max_connections} may be nan or -1 if the system +# supports dynamic threshold for TCP connections. +# In this case, the alarm will always be zero. +# + + alarm: tcp_connections + on: ipv4.tcpsock + class: Workload + type: System +component: Network + os: linux + hosts: * + calc: (${tcp_max_connections} > 0) ? ( ${connections} * 100 / ${tcp_max_connections} ) : 0 + units: % + every: 10s + warn: $this > (($status >= $WARNING ) ? ( 60 ) : ( 80 )) + crit: $this > (($status == $CRITICAL) ? ( 80 ) : ( 90 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: IPv4 TCP connections utilization + to: sysadmin diff --git a/health/health.d/tcp_listen.conf b/health/health.d/tcp_listen.conf new file mode 100644 index 0000000..d4bcfa2 --- /dev/null +++ b/health/health.d/tcp_listen.conf @@ -0,0 +1,96 @@ +# +# There are two queues involved when incoming TCP connections are handled +# (both at the kernel): +# +# SYN queue +# The SYN queue tracks TCP handshakes until connections are fully established. +# It overflows when too many incoming TCP connection requests hang in the +# half-open state and the server is not configured to fall back to SYN cookies. +# Overflows are usually caused by SYN flood DoS attacks (i.e. someone sends +# lots of SYN packets and never completes the handshakes). +# +# Accept queue +# The accept queue holds fully established TCP connections waiting to be handled +# by the listening application. It overflows when the server application fails +# to accept new connections at the rate they are coming in. +# +# +# ----------------------------------------------------------------------------- +# tcp accept queue (at the kernel) + + alarm: 1m_tcp_accept_queue_overflows + on: ip.tcp_accept_queue + class: Workload + type: System +component: Network + os: linux + hosts: * + lookup: average -60s unaligned absolute of ListenOverflows + units: overflows + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (1) : (5)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: average number of overflows in the TCP accept queue over the last minute + to: sysadmin + +# THIS IS TOO GENERIC +# CHECK: https://github.com/netdata/netdata/issues/3234#issuecomment-423935842 + alarm: 1m_tcp_accept_queue_drops + on: ip.tcp_accept_queue + class: Workload + type: System +component: Network + os: linux + hosts: * + lookup: average -60s unaligned absolute of ListenDrops + units: drops + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (1) : (5)) + delay: up 0 down 5m multiplier 1.5 max 1h + info: average number of dropped packets in the TCP accept queue over the last minute + to: sysadmin + + +# ----------------------------------------------------------------------------- +# tcp SYN queue (at the kernel) + +# When the SYN queue is full, either TcpExtTCPReqQFullDoCookies or +# TcpExtTCPReqQFullDrop is incremented, depending on whether SYN cookies are +# enabled or not. In both cases this probably indicates a SYN flood attack, +# so i guess a notification should be sent. + + alarm: 1m_tcp_syn_queue_drops + on: ip.tcp_syn_queue + class: Workload + type: System +component: Network + os: linux + hosts: * + lookup: average -60s unaligned absolute of TCPReqQFullDrop + units: drops + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (0) : (5)) + delay: up 10 down 5m multiplier 1.5 max 1h + info: average number of SYN requests was dropped due to the full TCP SYN queue over the last minute \ + (SYN cookies were not enabled) + to: sysadmin + + alarm: 1m_tcp_syn_queue_cookies + on: ip.tcp_syn_queue + class: Workload + type: System +component: Network + os: linux + hosts: * + lookup: average -60s unaligned absolute of TCPReqQFullDoCookies + units: cookies + every: 10s + warn: $this > 1 + crit: $this > (($status == $CRITICAL) ? (0) : (5)) + delay: up 10 down 5m multiplier 1.5 max 1h + info: average number of sent SYN cookies due to the full TCP SYN queue over the last minute + to: sysadmin + diff --git a/health/health.d/tcp_mem.conf b/health/health.d/tcp_mem.conf new file mode 100644 index 0000000..318be20 --- /dev/null +++ b/health/health.d/tcp_mem.conf @@ -0,0 +1,23 @@ +# +# check +# http://blog.tsunanet.net/2011/03/out-of-socket-memory.html +# +# We give a warning when TCP is under memory pressure +# and a critical when TCP is 90% of its upper memory limit +# + + alarm: tcp_memory + on: ipv4.sockstat_tcp_mem + class: Utilization + type: System +component: Network + os: linux + hosts: * + calc: ${mem} * 100 / ${tcp_mem_high} + units: % + every: 10s + warn: ${mem} > (($status >= $WARNING ) ? ( ${tcp_mem_pressure} * 0.8 ) : ( ${tcp_mem_pressure} )) + crit: ${mem} > (($status == $CRITICAL ) ? ( ${tcp_mem_pressure} ) : ( ${tcp_mem_high} * 0.9 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: TCP memory utilization + to: sysadmin diff --git a/health/health.d/tcp_orphans.conf b/health/health.d/tcp_orphans.conf new file mode 100644 index 0000000..cbd628d --- /dev/null +++ b/health/health.d/tcp_orphans.conf @@ -0,0 +1,24 @@ + +# +# check +# http://blog.tsunanet.net/2011/03/out-of-socket-memory.html +# +# The kernel may penalize orphans by 2x or even 4x +# so we alarm warning at 25% and critical at 50% +# + + alarm: tcp_orphans + on: ipv4.sockstat_tcp_sockets + class: Errors + type: System +component: Network + os: linux + hosts: * + calc: ${orphan} * 100 / ${tcp_max_orphans} + units: % + every: 10s + warn: $this > (($status >= $WARNING ) ? ( 20 ) : ( 25 )) + crit: $this > (($status == $CRITICAL) ? ( 25 ) : ( 50 )) + delay: up 0 down 5m multiplier 1.5 max 1h + info: orphan IPv4 TCP sockets utilization + to: sysadmin diff --git a/health/health.d/tcp_resets.conf b/health/health.d/tcp_resets.conf new file mode 100644 index 0000000..ff116db --- /dev/null +++ b/health/health.d/tcp_resets.conf @@ -0,0 +1,69 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- +# tcp resets this host sends + + alarm: 1m_ipv4_tcp_resets_sent + on: ipv4.tcphandshake + class: Errors + type: System +component: Network + os: linux + hosts: * + lookup: average -1m at -10s unaligned absolute of OutRsts + units: tcp resets/s + every: 10s + info: average number of sent TCP RESETS over the last minute + + alarm: 10s_ipv4_tcp_resets_sent + on: ipv4.tcphandshake + class: Errors + type: System +component: Network + os: linux + hosts: * + lookup: average -10s unaligned absolute of OutRsts + units: tcp resets/s + every: 10s + warn: $netdata.uptime.uptime > (1 * 60) AND $this > ((($1m_ipv4_tcp_resets_sent < 5)?(5):($1m_ipv4_tcp_resets_sent)) * (($status >= $WARNING) ? (1) : (10))) + delay: up 20s down 60m multiplier 1.2 max 2h + options: no-clear-notification + info: average number of sent TCP RESETS over the last 10 seconds. \ + This can indicate a port scan, \ + or that a service running on this host has crashed. \ + Netdata will not send a clear notification for this alarm. + to: sysadmin + +# ----------------------------------------------------------------------------- +# tcp resets this host receives + + alarm: 1m_ipv4_tcp_resets_received + on: ipv4.tcphandshake + class: Errors + type: System +component: Network + os: linux freebsd + hosts: * + lookup: average -1m at -10s unaligned absolute of AttemptFails + units: tcp resets/s + every: 10s + info: average number of received TCP RESETS over the last minute + + alarm: 10s_ipv4_tcp_resets_received + on: ipv4.tcphandshake + class: Errors + type: System +component: Network + os: linux freebsd + hosts: * + lookup: average -10s unaligned absolute of AttemptFails + units: tcp resets/s + every: 10s + warn: $netdata.uptime.uptime > (1 * 60) AND $this > ((($1m_ipv4_tcp_resets_received < 5)?(5):($1m_ipv4_tcp_resets_received)) * (($status >= $WARNING) ? (1) : (10))) + delay: up 20s down 60m multiplier 1.2 max 2h + options: no-clear-notification + info: average number of received TCP RESETS over the last 10 seconds. \ + This can be an indication that a service this host needs has crashed. \ + Netdata will not send a clear notification for this alarm. + to: sysadmin diff --git a/health/health.d/timex.conf b/health/health.d/timex.conf new file mode 100644 index 0000000..2e9b1a3 --- /dev/null +++ b/health/health.d/timex.conf @@ -0,0 +1,17 @@ + +# It can take several minutes before ntpd selects a server to synchronize with; +# try checking after 17 minutes (1024 seconds). + + alarm: system_clock_sync_state + on: system.clock_sync_state + os: linux + class: Errors + type: System +component: Clock + calc: $state + units: synchronization state + every: 10s + warn: $system.uptime.uptime > 17 * 60 AND $this == 0 + delay: down 5m + info: when set to 0, the system kernel believes the system clock is not properly synchronized to a reliable server + to: silent diff --git a/health/health.d/udp_errors.conf b/health/health.d/udp_errors.conf new file mode 100644 index 0000000..64f47df --- /dev/null +++ b/health/health.d/udp_errors.conf @@ -0,0 +1,38 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# ----------------------------------------------------------------------------- +# UDP receive buffer errors + + alarm: 1m_ipv4_udp_receive_buffer_errors + on: ipv4.udperrors + class: Errors + type: System +component: Network + os: linux freebsd + hosts: * + lookup: average -1m unaligned absolute of RcvbufErrors + units: errors + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (10)) + info: average number of UDP receive buffer errors over the last minute + delay: up 1m down 60m multiplier 1.2 max 2h + to: sysadmin + +# ----------------------------------------------------------------------------- +# UDP send buffer errors + + alarm: 1m_ipv4_udp_send_buffer_errors + on: ipv4.udperrors + class: Errors + type: System +component: Network + os: linux + hosts: * + lookup: average -1m unaligned absolute of SndbufErrors + units: errors + every: 10s + warn: $this > (($status >= $WARNING) ? (0) : (10)) + info: average number of UDP send buffer errors over the last minute + delay: up 1m down 60m multiplier 1.2 max 2h + to: sysadmin diff --git a/health/health.d/unbound.conf b/health/health.d/unbound.conf new file mode 100644 index 0000000..4e8d164 --- /dev/null +++ b/health/health.d/unbound.conf @@ -0,0 +1,28 @@ + +# make sure there is no overwritten/dropped queries in the request-list + + template: unbound_request_list_overwritten + on: unbound.request_list_jostle_list + class: Errors + type: DNS +component: Unbound + lookup: average -60s unaligned absolute match-names of overwritten + units: queries + every: 10s + warn: $this > 5 + delay: up 10 down 5m multiplier 1.5 max 1h + info: number of overwritten queries in the request-list + to: sysadmin + + template: unbound_request_list_dropped + on: unbound.request_list_jostle_list + class: Errors + type: DNS +component: Unbound + lookup: average -60s unaligned absolute match-names of dropped + units: queries + every: 10s + warn: $this > 0 + delay: up 10 down 5m multiplier 1.5 max 1h + info: number of dropped queries in the request-list + to: sysadmin diff --git a/health/health.d/vcsa.conf b/health/health.d/vcsa.conf new file mode 100644 index 0000000..a9cc7ce --- /dev/null +++ b/health/health.d/vcsa.conf @@ -0,0 +1,141 @@ + +# Overall system health: +# - 0: all components are healthy. +# - 1: one or more components might become overloaded soon. +# - 2: one or more components in the appliance might be degraded. +# - 3: one or more components might be in an unusable status and the appliance might become unresponsive soon. +# - 4: no health data is available. + + template: vcsa_system_health + on: vcsa.system_health + class: Errors + type: Virtual Machine +component: VMware vCenter + lookup: max -10s unaligned of system + units: status + every: 10s + warn: ($this == 1) || ($this == 2) + crit: $this == 3 + delay: down 1m multiplier 1.5 max 1h + info: overall system health status \ + (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + to: sysadmin + +# Components health: +# - 0: healthy. +# - 1: healthy, but may have some problems. +# - 2: degraded, and may have serious problems. +# - 3: unavailable, or will stop functioning soon. +# - 4: no health data is available. + + template: vcsa_swap_health + on: vcsa.components_health + class: Errors + type: Virtual Machine +component: VMware vCenter + lookup: max -10s unaligned of swap + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: swap health status \ + (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + to: sysadmin + + template: vcsa_storage_health + on: vcsa.components_health + class: Errors + type: Virtual Machine +component: VMware vCenter + lookup: max -10s unaligned of storage + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: storage health status \ + (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + to: sysadmin + + template: vcsa_mem_health + on: vcsa.components_health + class: Errors + type: Virtual Machine +component: VMware vCenter + lookup: max -10s unaligned of mem + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: memory health status \ + (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + to: sysadmin + + template: vcsa_load_health + on: vcsa.components_health + class: Utilization + type: Virtual Machine +component: VMware vCenter + lookup: max -10s unaligned of load + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: load health status \ + (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + to: sysadmin + + template: vcsa_database_storage_health + on: vcsa.components_health + class: Errors + type: Virtual Machine +component: VMware vCenter + lookup: max -10s unaligned of database_storage + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: database storage health status \ + (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + to: sysadmin + + template: vcsa_applmgmt_health + on: vcsa.components_health + class: Errors + type: Virtual Machine +component: VMware vCenter + lookup: max -10s unaligned of applmgmt + units: status + every: 10s + warn: $this == 1 + crit: ($this == 2) || ($this == 3) + delay: down 1m multiplier 1.5 max 1h + info: applmgmt health status \ + (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + to: sysadmin + + +# Software updates health: +# - 0: no updates available. +# - 2: non-security updates are available. +# - 3: security updates are available. +# - 4: an error retrieving information on software updates. + + template: vcsa_software_updates_health + on: vcsa.software_updates_health + class: Errors + type: Virtual Machine +component: VMware vCenter + lookup: max -10s unaligned of software_packages + units: status + every: 10s + warn: $this == 4 + crit: $this == 3 + delay: down 1m multiplier 1.5 max 1h + info: software updates availability status \ + (-1: unknown, 0: green, 2: orange, 3: red, 4: grey) + to: sysadmin diff --git a/health/health.d/vernemq.conf b/health/health.d/vernemq.conf new file mode 100644 index 0000000..cfbe2a5 --- /dev/null +++ b/health/health.d/vernemq.conf @@ -0,0 +1,365 @@ + +# Socket errors + + template: vernemq_socket_errors + on: vernemq.socket_errors + class: Errors + type: Messaging +component: VerneMQ + lookup: sum -1m unaligned absolute of socket_error + units: errors + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of socket errors in the last minute + to: sysadmin + +# Queues dropped/expired/unhandled PUBLISH messages + + template: vernemq_queue_message_drop + on: vernemq.queue_undelivered_messages + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute of queue_message_drop + units: dropped messages + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of dropped messaged due to full queues in the last minute + to: sysadmin + + template: vernemq_queue_message_expired + on: vernemq.queue_undelivered_messages + class: Latency + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute of queue_message_expired + units: expired messages + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of messages which expired before delivery in the last minute + to: sysadmin + + template: vernemq_queue_message_unhandled + on: vernemq.queue_undelivered_messages + class: Latency + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute of queue_message_unhandled + units: unhandled messages + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of unhandled messages (connections with clean session=true) in the last minute + to: sysadmin + +# Erlang VM + + template: vernemq_average_scheduler_utilization + on: vernemq.average_scheduler_utilization + class: Utilization + type: Messaging +component: VerneMQ + lookup: average -10m unaligned + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average scheduler utilization over the last 10 minutes + to: sysadmin + +# Cluster communication and netsplits + + template: vernemq_cluster_dropped + on: vernemq.cluster_dropped + class: Errors + type: Messaging +component: VerneMQ + lookup: sum -1m unaligned + units: KiB + every: 1m + warn: $this > 0 + delay: up 5m down 5m multiplier 1.5 max 1h + info: amount of traffic dropped during communication with the cluster nodes in the last minute + to: sysadmin + + template: vernemq_netsplits + on: vernemq.netsplits + class: Workload + type: Messaging +component: VerneMQ + lookup: sum -1m unaligned absolute of netsplit_detected + units: netsplits + every: 10s + warn: $this > 0 + delay: down 5m multiplier 1.5 max 2h + info: number of detected netsplits (split brain situation) in the last minute + to: sysadmin + +# Unsuccessful CONNACK + + template: vernemq_mqtt_connack_sent_reason_unsuccessful + on: vernemq.mqtt_connack_sent_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of sent unsuccessful v3/v5 CONNACK packets in the last minute + to: sysadmin + +# Not normal DISCONNECT + + template: vernemq_mqtt_disconnect_received_reason_not_normal + on: vernemq.mqtt_disconnect_received_reason + class: Workload + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !normal_disconnect,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of received not normal v5 DISCONNECT packets in the last minute + to: sysadmin + + template: vernemq_mqtt_disconnect_sent_reason_not_normal + on: vernemq.mqtt_disconnect_sent_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !normal_disconnect,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of sent not normal v5 DISCONNECT packets in the last minute + to: sysadmin + +# SUBSCRIBE errors and unauthorized attempts + + template: vernemq_mqtt_subscribe_error + on: vernemq.mqtt_subscribe_error + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute + units: failed ops + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of failed v3/v5 SUBSCRIBE operations in the last minute + to: sysadmin + + template: vernemq_mqtt_subscribe_auth_error + on: vernemq.mqtt_subscribe_auth_error + class: Workload + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute + units: attempts + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of unauthorized v3/v5 SUBSCRIBE attempts in the last minute + to: sysadmin + +# UNSUBSCRIBE errors + + template: vernemq_mqtt_unsubscribe_error + on: vernemq.mqtt_unsubscribe_error + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute + units: failed ops + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of failed v3/v5 UNSUBSCRIBE operations in the last minute + to: sysadmin + +# PUBLISH errors and unauthorized attempts + + template: vernemq_mqtt_publish_errors + on: vernemq.mqtt_publish_errors + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute + units: failed ops + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of failed v3/v5 PUBLISH operations in the last minute + to: sysadmin + + template: vernemq_mqtt_publish_auth_errors + on: vernemq.mqtt_publish_auth_errors + class: Workload + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute + units: attempts + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of unauthorized v3/v5 PUBLISH attempts in the last minute + to: sysadmin + +# Unsuccessful and unexpected PUBACK + + template: vernemq_mqtt_puback_received_reason_unsuccessful + on: vernemq.mqtt_puback_received_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of received unsuccessful v5 PUBACK packets in the last minute + to: sysadmin + + template: vernemq_mqtt_puback_sent_reason_unsuccessful + on: vernemq.mqtt_puback_sent_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of sent unsuccessful v5 PUBACK packets in the last minute + to: sysadmin + + template: vernemq_mqtt_puback_unexpected + on: vernemq.mqtt_puback_invalid_error + class: Workload + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute + units: messages + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of received unexpected v3/v5 PUBACK packets in the last minute + to: sysadmin + +# Unsuccessful and unexpected PUBREC + + template: vernemq_mqtt_pubrec_received_reason_unsuccessful + on: vernemq.mqtt_pubrec_received_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of received unsuccessful v5 PUBREC packets in the last minute + to: sysadmin + + template: vernemq_mqtt_pubrec_sent_reason_unsuccessful + on: vernemq.mqtt_pubrec_sent_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of sent unsuccessful v5 PUBREC packets in the last minute + to: sysadmin + + template: vernemq_mqtt_pubrec_invalid_error + on: vernemq.mqtt_pubrec_invalid_error + class: Workload + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute + units: messages + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of received unexpected v3 PUBREC packets in the last minute + to: sysadmin + +# Unsuccessful PUBREL + + template: vernemq_mqtt_pubrel_received_reason_unsuccessful + on: vernemq.mqtt_pubrel_received_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of received unsuccessful v5 PUBREL packets in the last minute + to: sysadmin + + template: vernemq_mqtt_pubrel_sent_reason_unsuccessful + on: vernemq.mqtt_pubrel_sent_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of sent unsuccessful v5 PUBREL packets in the last minute + to: sysadmin + +# Unsuccessful and unexpected PUBCOMP + + template: vernemq_mqtt_pubcomp_received_reason_unsuccessful + on: vernemq.mqtt_pubcomp_received_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of received unsuccessful v5 PUBCOMP packets in the last minute + to: sysadmin + + template: vernemq_mqtt_pubcomp_sent_reason_unsuccessful + on: vernemq.mqtt_pubcomp_sent_reason + class: Errors + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute match-names of !success,* + units: packets + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of sent unsuccessful v5 PUBCOMP packets in the last minute + to: sysadmin + + template: vernemq_mqtt_pubcomp_unexpected + on: vernemq.mqtt_pubcomp_invalid_error + class: Workload + type: Messaging +component: VerneMQ + lookup: average -1m unaligned absolute + units: messages + every: 1m + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: up 2m down 5m multiplier 1.5 max 2h + info: number of received unexpected v3/v5 PUBCOMP packets in the last minute + to: sysadmin diff --git a/health/health.d/vsphere.conf b/health/health.d/vsphere.conf new file mode 100644 index 0000000..d8fc899 --- /dev/null +++ b/health/health.d/vsphere.conf @@ -0,0 +1,174 @@ + +# you can disable an alarm notification by setting the 'to' line to: silent + +# -----------------------------------------------VM Specific------------------------------------------------------------ +# Memory + + template: vsphere_vm_mem_usage + on: vsphere.vm_mem_usage_percentage + class: Utilization + type: Virtual Machine +component: Memory + hosts: * + calc: $used + units: % + every: 20s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: virtual machine memory utilization + +# -----------------------------------------------HOST Specific---------------------------------------------------------- +# Memory + + template: vsphere_host_mem_usage + on: vsphere.host_mem_usage_percentage + class: Utilization + type: Virtual Machine +component: Memory + hosts: * + calc: $used + units: % + every: 20s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: host memory utilization + +# Network errors + + template: vsphere_inbound_packets_errors + on: vsphere.net_errors_total + class: Errors + type: Virtual Machine +component: Network + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of rx + units: packets + every: 1m + info: number of inbound errors for the network interface in the last 10 minutes + + template: vsphere_outbound_packets_errors + on: vsphere.net_errors_total + class: Errors + type: Virtual Machine +component: Network + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of tx + units: packets + every: 1m + info: number of outbound errors for the network interface in the last 10 minutes + +# Network errors ratio + + template: vsphere_inbound_packets_errors_ratio + on: vsphere.net_packets_total + class: Errors + type: Virtual Machine +component: Network + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of rx + calc: (($vsphere_inbound_packets_errors != nan AND $this > 1000) ? ($vsphere_inbound_packets_errors * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 2 + delay: up 1m down 1h multiplier 1.5 max 2h + info: ratio of inbound errors for the network interface over the last 10 minutes + to: sysadmin + + template: vsphere_outbound_packets_errors_ratio + on: vsphere.net_packets_total + class: Errors + type: Virtual Machine +component: Network + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of tx + calc: (($vsphere_outbound_packets_errors != nan AND $this > 1000) ? ($vsphere_outbound_packets_errors * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 2 + delay: up 1m down 1h multiplier 1.5 max 2h + info: ratio of outbound errors for the network interface over the last 10 minutes + to: sysadmin + +# -----------------------------------------------Common------------------------------------------------------------------- +# CPU + + template: vsphere_cpu_usage + on: vsphere.cpu_usage_total + class: Utilization + type: Virtual Machine +component: CPU + hosts: * + lookup: average -10m unaligned match-names of used + units: % + every: 20s + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average CPU utilization + to: sysadmin + +# Network drops + + template: vsphere_inbound_packets_dropped + on: vsphere.net_drops_total + class: Errors + type: Virtual Machine +component: Network + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of rx + units: packets + every: 1m + info: number of inbound dropped packets for the network interface in the last 10 minutes + + template: vsphere_outbound_packets_dropped + on: vsphere.net_drops_total + class: Errors + type: Virtual Machine +component: Network + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of tx + units: packets + every: 1m + info: number of outbound dropped packets for the network interface in the last 10 minutes + +# Network drops ratio + + template: vsphere_inbound_packets_dropped_ratio + on: vsphere.net_packets_total + class: Errors + type: Virtual Machine +component: Network + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of rx + calc: (($vsphere_inbound_packets_dropped != nan AND $this > 1000) ? ($vsphere_inbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 2 + delay: up 1m down 1h multiplier 1.5 max 2h + info: ratio of inbound dropped packets for the network interface over the last 10 minutes + to: sysadmin + + template: vsphere_outbound_packets_dropped_ratio + on: vsphere.net_packets_total + class: Errors + type: Virtual Machine +component: Network + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of tx + calc: (($vsphere_outbound_packets_dropped != nan AND $this > 1000) ? ($vsphere_outbound_packets_dropped * 100 / $this) : (0)) + units: % + every: 1m + warn: $this >= 2 + delay: up 1m down 1h multiplier 1.5 max 2h + info: ratio of outbound dropped packets for the network interface over the last 10 minutes + to: sysadmin diff --git a/health/health.d/web_log.conf b/health/health.d/web_log.conf new file mode 100644 index 0000000..c33c466 --- /dev/null +++ b/health/health.d/web_log.conf @@ -0,0 +1,210 @@ + +# unmatched lines + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_total_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + + template: web_log_1m_total_requests + on: web_log.requests + class: Workload + type: Web Server +component: Web log + families: * + lookup: sum -1m unaligned + calc: ($this == 0)?(1):($this) + units: requests + every: 10s + info: number of HTTP requests in the last minute + + template: web_log_1m_unmatched + on: web_log.excluded_requests + class: Errors + type: Web Server +component: Web log + families: * + lookup: sum -1m unaligned of unmatched + calc: $this * 100 / $web_log_1m_total_requests + units: % + every: 10s + warn: ($web_log_1m_total_requests > 120) ? ($this > 1) : ( 0 ) + delay: up 1m down 5m multiplier 1.5 max 1h + info: percentage of unparsed log lines over the last minute + to: webmaster + +# ----------------------------------------------------------------------------- +# high level response code alarms + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + + template: web_log_1m_requests + on: web_log.type_requests + class: Workload + type: Web Server +component: Web log + families: * + lookup: sum -1m unaligned + calc: ($this == 0)?(1):($this) + units: requests + every: 10s + info: number of HTTP requests in the last minute + + template: web_log_1m_successful + on: web_log.type_requests + class: Workload + type: Web Server +component: Web log + families: * + lookup: sum -1m unaligned of success + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s + warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 95 ) : ( 85 )) ) : ( 0 ) + crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 85 ) : ( 75 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401) + to: webmaster + + template: web_log_1m_redirects + on: web_log.type_requests + class: Workload + type: Web Server +component: Web log + families: * + lookup: sum -1m unaligned of redirect + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s + warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING ) ? ( 1 ) : ( 20 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: ratio of redirection HTTP requests over the last minute (3xx except 304) + to: webmaster + + template: web_log_1m_bad_requests + on: web_log.type_requests + class: Errors + type: Web Server +component: Web log + families: * + lookup: sum -1m unaligned of bad + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s + warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 10 ) : ( 30 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: ratio of client error HTTP requests over the last minute (4xx except 401) + to: webmaster + + template: web_log_1m_internal_errors + on: web_log.type_requests + class: Errors + type: Web Server +component: Web log + families: * + lookup: sum -1m unaligned of error + calc: $this * 100 / $web_log_1m_requests + units: % + every: 10s + warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 1 ) : ( 2 )) ) : ( 0 ) + crit: ($web_log_1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 2 ) : ( 5 )) ) : ( 0 ) + delay: up 2m down 15m multiplier 1.5 max 1h + info: ratio of server error HTTP requests over the last minute (5xx) + to: webmaster + +# ----------------------------------------------------------------------------- +# web slow + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $1m_requests > 120 +# +# i.e. when there are at least 120 requests during the last minute + + template: web_log_10m_response_time + on: web_log.request_processing_time + class: Latency + type: System +component: Web log + families: * + lookup: average -10m unaligned of avg + units: ms + every: 30s + info: average HTTP response time over the last 10 minutes + + template: web_log_web_slow + on: web_log.request_processing_time + class: Latency + type: Web Server +component: Web log + families: * + lookup: average -1m unaligned of avg + units: ms + every: 10s + green: 500 + red: 1000 + warn: ($web_log_1m_requests > 120) ? ($this > $green && $this > ($web_log_10m_response_time * 2) ) : ( 0 ) + crit: ($web_log_1m_requests > 120) ? ($this > $red && $this > ($web_log_10m_response_time * 4) ) : ( 0 ) + delay: down 15m multiplier 1.5 max 1h + info: average HTTP response time over the last 1 minute + options: no-clear-notification + to: webmaster + +# ----------------------------------------------------------------------------- +# web too many or too few requests + +# the following alarms trigger only when there are enough data. +# we assume there are enough data when: +# +# $5m_successful_old > 120 +# +# i.e. when there were at least 120 requests during the 5 minutes starting +# at -10m and ending at -5m + + template: web_log_5m_successful_old + on: web_log.type_requests + class: Workload + type: Web Server +component: Web log + families: * + lookup: average -5m at -5m unaligned of success + units: requests/s + every: 30s + info: average number of successful HTTP requests for the 5 minutes starting 10 minutes ago + + template: web_log_5m_successful + on: web_log.type_requests + class: Workload + type: Web Server +component: Web log + families: * + lookup: average -5m unaligned of success + units: requests/s + every: 30s + info: average number of successful HTTP requests over the last 5 minutes + + template: web_log_5m_requests_ratio + on: web_log.type_requests + class: Workload + type: Web Server +component: Web log + families: * + calc: ($web_log_5m_successful_old > 0)?($web_log_5m_successful * 100 / $web_log_5m_successful_old):(100) + units: % + every: 30s + warn: ($web_log_5m_successful_old > 120) ? ($this > 200 OR $this < 50) : (0) + crit: ($web_log_5m_successful_old > 120) ? ($this > 400 OR $this < 25) : (0) + delay: down 15m multiplier 1.5 max 1h + options: no-clear-notification + info: ratio of successful HTTP requests over over the last 5 minutes, \ + compared with the previous 5 minutes \ + (clear notification for this alarm will not be sent) + to: webmaster diff --git a/health/health.d/whoisquery.conf b/health/health.d/whoisquery.conf new file mode 100644 index 0000000..be5eb58 --- /dev/null +++ b/health/health.d/whoisquery.conf @@ -0,0 +1,13 @@ + + template: whoisquery_days_until_expiration + on: whoisquery.time_until_expiration + class: Utilization + type: Other +component: WHOIS + calc: $expiry + units: seconds + every: 60s + warn: $this < $days_until_expiration_warning*24*60*60 + crit: $this < $days_until_expiration_critical*24*60*60 + info: time until the domain name registration expires + to: webmaster diff --git a/health/health.d/wmi.conf b/health/health.d/wmi.conf new file mode 100644 index 0000000..90d39ce --- /dev/null +++ b/health/health.d/wmi.conf @@ -0,0 +1,139 @@ + +## CPU + + template: wmi_10min_cpu_usage + on: wmi.cpu_utilization_total + class: Utilization + type: Windows +component: CPU + os: linux + hosts: * + lookup: average -10m unaligned match-names of dpc,user,privileged,interrupt + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 15m multiplier 1.5 max 1h + info: average CPU utilization over the last 10 minutes + to: sysadmin + + +## Memory + + template: wmi_ram_in_use + on: wmi.memory_utilization + class: Utilization + type: Windows +component: Memory + os: linux + hosts: * + calc: ($used) * 100 / ($used + $available) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: memory utilization + to: sysadmin + + template: wmi_swap_in_use + on: wmi.memory_swap_utilization + class: Utilization + type: Windows +component: Memory + os: linux + hosts: * + calc: ($used) * 100 / ($used + $available) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: swap memory utilization + to: sysadmin + + +## Network + + template: wmi_inbound_packets_discarded + on: wmi.net_discarded + class: Errors + type: Windows +component: Network + os: linux + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of inbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: number of inbound discarded packets for the network interface in the last 10 minutes + to: sysadmin + + template: wmi_outbound_packets_discarded + on: wmi.net_discarded + class: Errors + type: Windows +component: Network + os: linux + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of outbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: number of outbound discarded packets for the network interface in the last 10 minutes + to: sysadmin + + template: wmi_inbound_packets_errors + on: wmi.net_errors + class: Errors + type: Windows +component: Network + os: linux + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of inbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: number of inbound errors for the network interface in the last 10 minutes + to: sysadmin + + template: wmi_outbound_packets_errors + on: wmi.net_errors + class: Errors + type: Windows +component: Network + os: linux + hosts: * + families: * + lookup: sum -10m unaligned absolute match-names of outbound + units: packets + every: 1m + warn: $this >= 5 + delay: down 1h multiplier 1.5 max 2h + info: number of outbound errors for the network interface in the last 10 minutes + to: sysadmin + + +## Disk + + template: wmi_disk_in_use + on: wmi.logical_disk_utilization + class: Utilization + type: Windows +component: Disk + os: linux + hosts: * + calc: ($used) * 100 / ($used + $free) + units: % + every: 10s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + info: disk space utilization + to: sysadmin diff --git a/health/health.d/x509check.conf b/health/health.d/x509check.conf new file mode 100644 index 0000000..fc69d02 --- /dev/null +++ b/health/health.d/x509check.conf @@ -0,0 +1,24 @@ + + template: x509check_days_until_expiration + on: x509check.time_until_expiration + class: Latency + type: Certificates +component: x509 certificates + calc: $expiry + units: seconds + every: 60s + warn: $this < $days_until_expiration_warning*24*60*60 + crit: $this < $days_until_expiration_critical*24*60*60 + info: time until x509 certificate expires + to: webmaster + + template: x509check_revocation_status + on: x509check.revocation_status + class: Errors + type: Certificates +component: x509 certificates + calc: $revoked + every: 60s + crit: $this != nan AND $this != 0 + info: x509 certificate revocation status (0: revoked, 1: valid) + to: webmaster diff --git a/health/health.d/zfs.conf b/health/health.d/zfs.conf new file mode 100644 index 0000000..785838d --- /dev/null +++ b/health/health.d/zfs.conf @@ -0,0 +1,41 @@ + + alarm: zfs_memory_throttle + on: zfs.memory_ops + class: Utilization + type: System +component: File system + lookup: sum -10m unaligned absolute of throttled + units: events + every: 1m + warn: $this > 0 + delay: down 1h multiplier 1.5 max 2h + info: number of times ZFS had to limit the ARC growth in the last 10 minutes + to: sysadmin + +# ZFS pool state + + template: zfs_pool_state_warn + on: zfspool.state + class: Errors + type: System +component: File system + calc: $degraded + units: boolean + every: 10s + warn: $this > 0 + delay: down 1m multiplier 1.5 max 1h + info: ZFS pool $family state is degraded + to: sysadmin + + template: zfs_pool_state_crit + on: zfspool.state + class: Errors + type: System +component: File system + calc: $faulted + $unavail + units: boolean + every: 10s + crit: $this > 0 + delay: down 1m multiplier 1.5 max 1h + info: ZFS pool $family state is faulted or unavail + to: sysadmin diff --git a/health/health.h b/health/health.h new file mode 100644 index 0000000..15d8326 --- /dev/null +++ b/health/health.h @@ -0,0 +1,103 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#ifndef NETDATA_HEALTH_H +#define NETDATA_HEALTH_H 1 + +#include "daemon/common.h" + +extern unsigned int default_health_enabled; + +#define HEALTH_ENTRY_FLAG_PROCESSED 0x00000001 +#define HEALTH_ENTRY_FLAG_UPDATED 0x00000002 +#define HEALTH_ENTRY_FLAG_EXEC_RUN 0x00000004 +#define HEALTH_ENTRY_FLAG_EXEC_FAILED 0x00000008 +#define HEALTH_ENTRY_FLAG_SILENCED 0x00000010 +#define HEALTH_ENTRY_RUN_ONCE 0x00000020 +#define HEALTH_ENTRY_FLAG_EXEC_IN_PROGRESS 0x00000040 +#define HEALTH_ENTRY_FLAG_IS_REPEATING 0x00000080 + +#define HEALTH_ENTRY_FLAG_SAVED 0x10000000 +#define HEALTH_ENTRY_FLAG_ACLK_QUEUED 0x20000000 +#define HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION 0x80000000 + +#ifndef HEALTH_LISTEN_PORT +#define HEALTH_LISTEN_PORT 19998 +#endif + +#ifndef HEALTH_LISTEN_BACKLOG +#define HEALTH_LISTEN_BACKLOG 4096 +#endif + +#define HEALTH_SILENCERS_MAX_FILE_LEN 10000 + +extern char *silencers_filename; + +void health_init(void); + +void health_reload(void); + +void health_aggregate_alarms(RRDHOST *host, BUFFER *wb, BUFFER* context, RRDCALC_STATUS status); +void health_alarms2json(RRDHOST *host, BUFFER *wb, int all); +void health_alarms_values2json(RRDHOST *host, BUFFER *wb, int all); +void health_alarm_log2json(RRDHOST *host, BUFFER *wb, uint32_t after, char *chart); + +void health_api_v1_chart_variables2json(RRDSET *st, BUFFER *buf); +void health_api_v1_chart_custom_variables2json(RRDSET *st, BUFFER *buf); + +int health_alarm_log_open(RRDHOST *host); +void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae); +void health_alarm_log_load(RRDHOST *host); + +void health_thread_spawn(RRDHOST *host); +void health_thread_stop(RRDHOST *host); + +ALARM_ENTRY* health_create_alarm_entry( + RRDHOST *host, + uint32_t alarm_id, + uint32_t alarm_event_id, + const uuid_t config_hash_id, + time_t when, + STRING *name, + STRING *chart, + STRING *chart_context, + STRING *family, + STRING *classification, + STRING *component, + STRING *type, + STRING *exec, + STRING *recipient, + time_t duration, + NETDATA_DOUBLE old_value, + NETDATA_DOUBLE new_value, + RRDCALC_STATUS old_status, + RRDCALC_STATUS new_status, + STRING *source, + STRING *units, + STRING *info, + int delay, + uint32_t flags); + +void health_alarm_log_add_entry(RRDHOST *host, ALARM_ENTRY *ae); + +struct health_state { + RRDHOST *host; + netdata_thread_t thread; +}; + +void health_readdir(RRDHOST *host, const char *user_path, const char *stock_path, const char *subpath); +char *health_user_config_dir(void); +char *health_stock_config_dir(void); +void health_alarm_log_free(RRDHOST *host); + +void health_alarm_log_free_one_nochecks_nounlink(ALARM_ENTRY *ae); + +void *health_cmdapi_thread(void *ptr); + +void health_label_log_save(RRDHOST *host); + +char *health_edit_command_from_source(const char *source); +void sql_refresh_hashes(void); + +void health_add_host_labels(void); + +#endif //NETDATA_HEALTH_H diff --git a/health/health_config.c b/health/health_config.c new file mode 100644 index 0000000..f9decfa --- /dev/null +++ b/health/health_config.c @@ -0,0 +1,1182 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +#define HEALTH_CONF_MAX_LINE 4096 + +#define HEALTH_ALARM_KEY "alarm" +#define HEALTH_TEMPLATE_KEY "template" +#define HEALTH_ON_KEY "on" +#define HEALTH_HOST_KEY "hosts" +#define HEALTH_OS_KEY "os" +#define HEALTH_FAMILIES_KEY "families" +#define HEALTH_PLUGIN_KEY "plugin" +#define HEALTH_MODULE_KEY "module" +#define HEALTH_CHARTS_KEY "charts" +#define HEALTH_LOOKUP_KEY "lookup" +#define HEALTH_CALC_KEY "calc" +#define HEALTH_EVERY_KEY "every" +#define HEALTH_GREEN_KEY "green" +#define HEALTH_RED_KEY "red" +#define HEALTH_WARN_KEY "warn" +#define HEALTH_CRIT_KEY "crit" +#define HEALTH_EXEC_KEY "exec" +#define HEALTH_RECIPIENT_KEY "to" +#define HEALTH_UNITS_KEY "units" +#define HEALTH_INFO_KEY "info" +#define HEALTH_CLASS_KEY "class" +#define HEALTH_COMPONENT_KEY "component" +#define HEALTH_TYPE_KEY "type" +#define HEALTH_DELAY_KEY "delay" +#define HEALTH_OPTIONS_KEY "options" +#define HEALTH_REPEAT_KEY "repeat" +#define HEALTH_HOST_LABEL_KEY "host labels" +#define HEALTH_FOREACH_KEY "foreach" + +static inline int health_parse_delay( + size_t line, const char *filename, char *string, + int *delay_up_duration, + int *delay_down_duration, + int *delay_max_duration, + float *delay_multiplier) { + + char given_up = 0; + char given_down = 0; + char given_max = 0; + char given_multiplier = 0; + + char *s = string; + while(*s) { + char *key = s; + + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!*key) break; + + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!strcasecmp(key, "up")) { + if (!config_parse_duration(value, delay_up_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_up = 1; + } + else if(!strcasecmp(key, "down")) { + if (!config_parse_duration(value, delay_down_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_down = 1; + } + else if(!strcasecmp(key, "multiplier")) { + *delay_multiplier = strtof(value, NULL); + if(isnan(*delay_multiplier) || isinf(*delay_multiplier) || islessequal(*delay_multiplier, 0)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_multiplier = 1; + } + else if(!strcasecmp(key, "max")) { + if (!config_parse_duration(value, delay_max_duration)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, filename, value, key); + } + else given_max = 1; + } + else { + error("Health configuration at line %zu of file '%s': unknown keyword '%s'", + line, filename, key); + } + } + + if(!given_up) + *delay_up_duration = 0; + + if(!given_down) + *delay_down_duration = 0; + + if(!given_multiplier) + *delay_multiplier = 1.0; + + if(!given_max) { + if((*delay_max_duration) < (*delay_up_duration) * (*delay_multiplier)) + *delay_max_duration = (int)((*delay_up_duration) * (*delay_multiplier)); + + if((*delay_max_duration) < (*delay_down_duration) * (*delay_multiplier)) + *delay_max_duration = (int)((*delay_down_duration) * (*delay_multiplier)); + } + + return 1; +} + +static inline uint32_t health_parse_options(const char *s) { + uint32_t options = 0; + char buf[100+1] = ""; + + while(*s) { + buf[0] = '\0'; + + // skip spaces + while(*s && isspace(*s)) + s++; + + // find the next space + size_t count = 0; + while(*s && count < 100 && !isspace(*s)) + buf[count++] = *s++; + + if(buf[0]) { + buf[count] = '\0'; + + if(!strcasecmp(buf, "no-clear-notification") || !strcasecmp(buf, "no-clear")) + options |= RRDCALC_OPTION_NO_CLEAR_NOTIFICATION; + else + error("Ignoring unknown alarm option '%s'", buf); + } + } + + return options; +} + +static inline int health_parse_repeat( + size_t line, + const char *file, + char *string, + uint32_t *warn_repeat_every, + uint32_t *crit_repeat_every +) { + + char *s = string; + while(*s) { + char *key = s; + + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!*key) break; + + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!strcasecmp(key, "off")) { + *warn_repeat_every = 0; + *crit_repeat_every = 0; + return 1; + } + if(!strcasecmp(key, "warning")) { + if (!config_parse_duration(value, (int*)warn_repeat_every)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, file, value, key); + } + } + else if(!strcasecmp(key, "critical")) { + if (!config_parse_duration(value, (int*)crit_repeat_every)) { + error("Health configuration at line %zu of file '%s': invalid value '%s' for '%s' keyword", + line, file, value, key); + } + } + } + + return 1; +} + +/** + * Health pattern from Foreach + * + * Create a new simple pattern using the user input + * + * @param s the string that will be used to create the simple pattern. + */ + +static void dimension_remove_pipe_comma(char *str) { + while(*str) { + if(*str == '|' || *str == ',') *str = ' '; + str++; + } +} + +static SIMPLE_PATTERN *health_pattern_from_foreach(const char *s) { + char *convert= strdupz(s); + SIMPLE_PATTERN *val = NULL; + + if(convert) { + dimension_remove_pipe_comma(convert); + val = simple_pattern_create(convert, NULL, SIMPLE_PATTERN_EXACT); + freez(convert); + } + + return val; +} + +static inline int health_parse_db_lookup( + size_t line, const char *filename, char *string, + RRDR_GROUPING *group_method, int *after, int *before, int *every, + RRDCALC_OPTIONS *options, STRING **dimensions, STRING **foreachdim +) { + debug(D_HEALTH, "Health configuration parsing database lookup %zu@%s: %s", line, filename, string); + + if(*dimensions) string_freez(*dimensions); + if(*foreachdim) string_freez(*foreachdim); + *dimensions = NULL; + *foreachdim = NULL; + *after = 0; + *before = 0; + *every = 0; + *options = (*options) & RRDCALC_ALL_OPTIONS_EXCLUDING_THE_RRDR_ONES; // preserve rrdcalc options + + char *s = string, *key; + + // first is the group method + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + if(!*s) { + error("Health configuration invalid chart calculation at line %zu of file '%s': expected group method followed by the 'after' time, but got '%s'", + line, filename, key); + return 0; + } + + if((*group_method = web_client_api_request_v1_data_group(key, RRDR_GROUPING_UNDEFINED)) == RRDR_GROUPING_UNDEFINED) { + error("Health configuration at line %zu of file '%s': invalid group method '%s'", + line, filename, key); + return 0; + } + + // then is the 'after' time + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if(!config_parse_duration(key, after)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' after group method", + line, filename, key); + return 0; + } + + // sane defaults + *every = ABS(*after); + + // now we may have optional parameters + while(*s) { + key = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + if(!*key) break; + + if(!strcasecmp(key, "at")) { + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if (!config_parse_duration(value, before)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' for '%s' keyword", + line, filename, value, key); + } + } + else if(!strcasecmp(key, HEALTH_EVERY_KEY)) { + char *value = s; + while(*s && !isspace(*s)) s++; + while(*s && isspace(*s)) *s++ = '\0'; + + if (!config_parse_duration(value, every)) { + error("Health configuration at line %zu of file '%s': invalid duration '%s' for '%s' keyword", + line, filename, value, key); + } + } + else if(!strcasecmp(key, "absolute") || !strcasecmp(key, "abs") || !strcasecmp(key, "absolute_sum")) { + *options |= RRDR_OPTION_ABSOLUTE; + } + else if(!strcasecmp(key, "min2max")) { + *options |= RRDR_OPTION_MIN2MAX; + } + else if(!strcasecmp(key, "null2zero")) { + *options |= RRDR_OPTION_NULL2ZERO; + } + else if(!strcasecmp(key, "percentage")) { + *options |= RRDR_OPTION_PERCENTAGE; + } + else if(!strcasecmp(key, "unaligned")) { + *options |= RRDR_OPTION_NOT_ALIGNED; + } + else if(!strcasecmp(key, "anomaly-bit")) { + *options |= RRDR_OPTION_ANOMALY_BIT; + } + else if(!strcasecmp(key, "match-ids") || !strcasecmp(key, "match_ids")) { + *options |= RRDR_OPTION_MATCH_IDS; + } + else if(!strcasecmp(key, "match-names") || !strcasecmp(key, "match_names")) { + *options |= RRDR_OPTION_MATCH_NAMES; + } + else if(!strcasecmp(key, "of")) { + char *find = NULL; + if(*s && strcasecmp(s, "all") != 0) { + find = strcasestr(s, " foreach"); + if(find) { + *find = '\0'; + } + *dimensions = string_strdupz(s); + } + + if(!find) { + break; + } + s = ++find; + } + else if(!strcasecmp(key, HEALTH_FOREACH_KEY )) { + *foreachdim = string_strdupz(s); + break; + } + else { + error("Health configuration at line %zu of file '%s': unknown keyword '%s'", + line, filename, key); + } + } + + return 1; +} + +static inline STRING *health_source_file(size_t line, const char *file) { + char buffer[FILENAME_MAX + 1]; + snprintfz(buffer, FILENAME_MAX, "%zu@%s", line, file); + return string_strdupz(buffer); +} + +char *health_edit_command_from_source(const char *source) +{ + char buffer[FILENAME_MAX + 1]; + char *temp = strdupz(source); + char *line_num = strchr(temp, '@'); + char *file_no_path = strrchr(temp, '/'); + + if (likely(file_no_path && line_num)) { + *line_num = '\0'; + snprintfz( + buffer, + FILENAME_MAX, + "sudo %s/edit-config health.d/%s=%s=%s", + netdata_configured_user_config_dir, + file_no_path + 1, + temp, + rrdhost_registry_hostname(localhost)); + } else + buffer[0] = '\0'; + + freez(temp); + return strdupz(buffer); +} + +static inline void strip_quotes(char *s) { + while(*s) { + if(*s == '\'' || *s == '"') *s = ' '; + s++; + } +} + +static inline void alert_config_free(struct alert_config *cfg) +{ + string_freez(cfg->alarm); + string_freez(cfg->template_key); + string_freez(cfg->os); + string_freez(cfg->host); + string_freez(cfg->on); + string_freez(cfg->families); + string_freez(cfg->plugin); + string_freez(cfg->module); + string_freez(cfg->charts); + string_freez(cfg->lookup); + string_freez(cfg->calc); + string_freez(cfg->warn); + string_freez(cfg->crit); + string_freez(cfg->every); + string_freez(cfg->green); + string_freez(cfg->red); + string_freez(cfg->exec); + string_freez(cfg->to); + string_freez(cfg->units); + string_freez(cfg->info); + string_freez(cfg->classification); + string_freez(cfg->component); + string_freez(cfg->type); + string_freez(cfg->delay); + string_freez(cfg->options); + string_freez(cfg->repeat); + string_freez(cfg->host_labels); + string_freez(cfg->p_db_lookup_dimensions); + string_freez(cfg->p_db_lookup_method); + freez(cfg); +} + +int sql_store_hashes = 1; +static int health_readfile(const char *filename, void *data) { + RRDHOST *host = (RRDHOST *)data; + + debug(D_HEALTH, "Health configuration reading file '%s'", filename); + + static uint32_t + hash_alarm = 0, + hash_template = 0, + hash_os = 0, + hash_on = 0, + hash_host = 0, + hash_families = 0, + hash_plugin = 0, + hash_module = 0, + hash_charts = 0, + hash_calc = 0, + hash_green = 0, + hash_red = 0, + hash_warn = 0, + hash_crit = 0, + hash_exec = 0, + hash_every = 0, + hash_lookup = 0, + hash_units = 0, + hash_info = 0, + hash_class = 0, + hash_component = 0, + hash_type = 0, + hash_recipient = 0, + hash_delay = 0, + hash_options = 0, + hash_repeat = 0, + hash_host_label = 0; + + char buffer[HEALTH_CONF_MAX_LINE + 1]; + + if(unlikely(!hash_alarm)) { + hash_alarm = simple_uhash(HEALTH_ALARM_KEY); + hash_template = simple_uhash(HEALTH_TEMPLATE_KEY); + hash_on = simple_uhash(HEALTH_ON_KEY); + hash_os = simple_uhash(HEALTH_OS_KEY); + hash_host = simple_uhash(HEALTH_HOST_KEY); + hash_families = simple_uhash(HEALTH_FAMILIES_KEY); + hash_plugin = simple_uhash(HEALTH_PLUGIN_KEY); + hash_module = simple_uhash(HEALTH_MODULE_KEY); + hash_charts = simple_uhash(HEALTH_CHARTS_KEY); + hash_calc = simple_uhash(HEALTH_CALC_KEY); + hash_lookup = simple_uhash(HEALTH_LOOKUP_KEY); + hash_green = simple_uhash(HEALTH_GREEN_KEY); + hash_red = simple_uhash(HEALTH_RED_KEY); + hash_warn = simple_uhash(HEALTH_WARN_KEY); + hash_crit = simple_uhash(HEALTH_CRIT_KEY); + hash_exec = simple_uhash(HEALTH_EXEC_KEY); + hash_every = simple_uhash(HEALTH_EVERY_KEY); + hash_units = simple_hash(HEALTH_UNITS_KEY); + hash_info = simple_hash(HEALTH_INFO_KEY); + hash_class = simple_uhash(HEALTH_CLASS_KEY); + hash_component = simple_uhash(HEALTH_COMPONENT_KEY); + hash_type = simple_uhash(HEALTH_TYPE_KEY); + hash_recipient = simple_hash(HEALTH_RECIPIENT_KEY); + hash_delay = simple_uhash(HEALTH_DELAY_KEY); + hash_options = simple_uhash(HEALTH_OPTIONS_KEY); + hash_repeat = simple_uhash(HEALTH_REPEAT_KEY); + hash_host_label = simple_uhash(HEALTH_HOST_LABEL_KEY); + } + + FILE *fp = fopen(filename, "r"); + if(!fp) { + error("Health configuration cannot read file '%s'.", filename); + return 0; + } + + RRDCALC *rc = NULL; + RRDCALCTEMPLATE *rt = NULL; + struct alert_config *alert_cfg = NULL; + + int ignore_this = 0; + size_t line = 0, append = 0; + char *s; + while((s = fgets(&buffer[append], (int)(HEALTH_CONF_MAX_LINE - append), fp)) || append) { + int stop_appending = !s; + line++; + s = trim(buffer); + if(!s || *s == '#') continue; + + append = strlen(s); + if(!stop_appending && s[append - 1] == '\\') { + s[append - 1] = ' '; + append = &s[append] - buffer; + if(append < HEALTH_CONF_MAX_LINE) + continue; + else { + error("Health configuration has too long multi-line at line %zu of file '%s'.", line, filename); + } + } + append = 0; + + char *key = s; + while(*s && *s != ':') s++; + if(!*s) { + error("Health configuration has invalid line %zu of file '%s'. It does not contain a ':'. Ignoring it.", line, filename); + continue; + } + *s = '\0'; + s++; + + char *value = s; + key = trim_all(key); + value = trim_all(value); + + if(!key) { + error("Health configuration has invalid line %zu of file '%s'. Keyword is empty. Ignoring it.", line, filename); + continue; + } + + if(!value) { + error("Health configuration has invalid line %zu of file '%s'. value is empty. Ignoring it.", line, filename); + continue; + } + + uint32_t hash = simple_uhash(key); + + if(hash == hash_alarm && !strcasecmp(key, HEALTH_ALARM_KEY)) { + if(rc) { + if(!alert_hash_and_store_config(rc->config_hash_id, alert_cfg, sql_store_hashes) || ignore_this) + rrdcalc_free_unused_rrdcalc_loaded_from_config(rc); + else + rrdcalc_add_from_config(host, rc); + + // health_add_alarms_loop(host, rc, ignore_this) ; + } + + if(rt) { + if(!alert_hash_and_store_config(rt->config_hash_id, alert_cfg, sql_store_hashes) || ignore_this) + rrdcalctemplate_free_unused_rrdcalctemplate_loaded_from_config(rt); + else + rrdcalctemplate_add_from_config(host, rt); + + rt = NULL; + } + + rc = callocz(1, sizeof(RRDCALC)); + rc->next_event_id = 1; + + { + char *tmp = strdupz(value); + if(rrdvar_fix_name(tmp)) + error("Health configuration renamed alarm '%s' to '%s'", value, tmp); + + rc->name = string_strdupz(tmp); + freez(tmp); + } + + rc->source = health_source_file(line, filename); + rc->green = NAN; + rc->red = NAN; + rc->value = NAN; + rc->old_value = NAN; + rc->delay_multiplier = 1.0; + rc->old_status = RRDCALC_STATUS_UNINITIALIZED; + rc->warn_repeat_every = host->health_default_warn_repeat_every; + rc->crit_repeat_every = host->health_default_crit_repeat_every; + if (alert_cfg) + alert_config_free(alert_cfg); + alert_cfg = callocz(1, sizeof(struct alert_config)); + + alert_cfg->alarm = string_dup(rc->name); + ignore_this = 0; + } + else if(hash == hash_template && !strcasecmp(key, HEALTH_TEMPLATE_KEY)) { + if(rc) { +// health_add_alarms_loop(host, rc, ignore_this) ; + if(!alert_hash_and_store_config(rc->config_hash_id, alert_cfg, sql_store_hashes) || ignore_this) + rrdcalc_free_unused_rrdcalc_loaded_from_config(rc); + else + rrdcalc_add_from_config(host, rc); + + rc = NULL; + } + + if(rt) { + if(!alert_hash_and_store_config(rt->config_hash_id, alert_cfg, sql_store_hashes) || ignore_this) + rrdcalctemplate_free_unused_rrdcalctemplate_loaded_from_config(rt); + else + rrdcalctemplate_add_from_config(host, rt); + } + + rt = callocz(1, sizeof(RRDCALCTEMPLATE)); + + { + char *tmp = strdupz(value); + if(rrdvar_fix_name(tmp)) + error("Health configuration renamed template '%s' to '%s'", value, tmp); + + rt->name = string_strdupz(tmp); + freez(tmp); + } + + rt->source = health_source_file(line, filename); + rt->green = NAN; + rt->red = NAN; + rt->delay_multiplier = (float)1.0; + rt->warn_repeat_every = host->health_default_warn_repeat_every; + rt->crit_repeat_every = host->health_default_crit_repeat_every; + if (alert_cfg) + alert_config_free(alert_cfg); + alert_cfg = callocz(1, sizeof(struct alert_config)); + + alert_cfg->template_key = string_dup(rt->name); + ignore_this = 0; + } + else if(hash == hash_os && !strcasecmp(key, HEALTH_OS_KEY)) { + char *os_match = value; + if (alert_cfg) alert_cfg->os = string_strdupz(value); + SIMPLE_PATTERN *os_pattern = simple_pattern_create(os_match, NULL, SIMPLE_PATTERN_EXACT); + + if(!simple_pattern_matches(os_pattern, rrdhost_os(host))) { + if(rc) + debug(D_HEALTH, "HEALTH on '%s' ignoring alarm '%s' defined at %zu@%s: host O/S does not match '%s'", rrdhost_hostname(host), rrdcalc_name(rc), line, filename, os_match); + + if(rt) + debug(D_HEALTH, "HEALTH on '%s' ignoring template '%s' defined at %zu@%s: host O/S does not match '%s'", rrdhost_hostname(host), rrdcalctemplate_name(rt), line, filename, os_match); + + ignore_this = 1; + } + + simple_pattern_free(os_pattern); + } + else if(hash == hash_host && !strcasecmp(key, HEALTH_HOST_KEY)) { + char *host_match = value; + if (alert_cfg) alert_cfg->host = string_strdupz(value); + SIMPLE_PATTERN *host_pattern = simple_pattern_create(host_match, NULL, SIMPLE_PATTERN_EXACT); + + if(!simple_pattern_matches(host_pattern, rrdhost_hostname(host))) { + if(rc) + debug(D_HEALTH, "HEALTH on '%s' ignoring alarm '%s' defined at %zu@%s: hostname does not match '%s'", rrdhost_hostname(host), rrdcalc_name(rc), line, filename, host_match); + + if(rt) + debug(D_HEALTH, "HEALTH on '%s' ignoring template '%s' defined at %zu@%s: hostname does not match '%s'", rrdhost_hostname(host), rrdcalctemplate_name(rt), line, filename, host_match); + + ignore_this = 1; + } + + simple_pattern_free(host_pattern); + } + else if(rc) { + if(hash == hash_on && !strcasecmp(key, HEALTH_ON_KEY)) { + alert_cfg->on = string_strdupz(value); + if(rc->chart) { + if(strcmp(rrdcalc_chart_name(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalc_name(rc), key, rrdcalc_chart_name(rc), value, value); + + string_freez(rc->chart); + } + rc->chart = string_strdupz(value); + } + else if(hash == hash_class && !strcasecmp(key, HEALTH_CLASS_KEY)) { + strip_quotes(value); + + alert_cfg->classification = string_strdupz(value); + if(rc->classification) { + if(strcmp(rrdcalc_classification(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalc_name(rc), key, rrdcalc_classification(rc), value, value); + + string_freez(rc->classification); + } + rc->classification = string_strdupz(value); + } + else if(hash == hash_component && !strcasecmp(key, HEALTH_COMPONENT_KEY)) { + strip_quotes(value); + + alert_cfg->component = string_strdupz(value); + if(rc->component) { + if(strcmp(rrdcalc_component(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalc_name(rc), key, rrdcalc_component(rc), value, value); + + string_freez(rc->component); + } + rc->component = string_strdupz(value); + } + else if(hash == hash_type && !strcasecmp(key, HEALTH_TYPE_KEY)) { + strip_quotes(value); + + alert_cfg->type = string_strdupz(value); + if(rc->type) { + if(strcmp(rrdcalc_type(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalc_name(rc), key, rrdcalc_type(rc), value, value); + + string_freez(rc->type); + } + rc->type = string_strdupz(value); + } + else if(hash == hash_lookup && !strcasecmp(key, HEALTH_LOOKUP_KEY)) { + alert_cfg->lookup = string_strdupz(value); + health_parse_db_lookup(line, filename, value, &rc->group, &rc->after, &rc->before, + &rc->update_every, &rc->options, &rc->dimensions, &rc->foreach_dimension); + + if(rc->foreach_dimension) + rc->foreach_dimension_pattern = health_pattern_from_foreach(rrdcalc_foreachdim(rc)); + + if (rc->after) { + if (rc->dimensions) + alert_cfg->p_db_lookup_dimensions = string_dup(rc->dimensions); + if (rc->group) + alert_cfg->p_db_lookup_method = string_strdupz(group_method2string(rc->group)); + alert_cfg->p_db_lookup_options = rc->options; + alert_cfg->p_db_lookup_after = rc->after; + alert_cfg->p_db_lookup_before = rc->before; + alert_cfg->p_update_every = rc->update_every; + } + } + else if(hash == hash_every && !strcasecmp(key, HEALTH_EVERY_KEY)) { + alert_cfg->every = string_strdupz(value); + if(!config_parse_duration(value, &rc->update_every)) + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' cannot parse duration: '%s'.", + line, filename, rrdcalc_name(rc), key, value); + alert_cfg->p_update_every = rc->update_every; + } + else if(hash == hash_green && !strcasecmp(key, HEALTH_GREEN_KEY)) { + alert_cfg->green = string_strdupz(value); + char *e; + rc->green = str2ndd(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rrdcalc_name(rc), key, e); + } + } + else if(hash == hash_red && !strcasecmp(key, HEALTH_RED_KEY)) { + alert_cfg->red = string_strdupz(value); + char *e; + rc->red = str2ndd(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rrdcalc_name(rc), key, e); + } + } + else if(hash == hash_calc && !strcasecmp(key, HEALTH_CALC_KEY)) { + alert_cfg->calc = string_strdupz(value); + const char *failed_at = NULL; + int error = 0; + rc->calculation = expression_parse(value, &failed_at, &error); + if(!rc->calculation) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rrdcalc_name(rc), key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_warn && !strcasecmp(key, HEALTH_WARN_KEY)) { + alert_cfg->warn = string_strdupz(value); + const char *failed_at = NULL; + int error = 0; + rc->warning = expression_parse(value, &failed_at, &error); + if(!rc->warning) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rrdcalc_name(rc), key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_crit && !strcasecmp(key, HEALTH_CRIT_KEY)) { + alert_cfg->crit = string_strdupz(value); + const char *failed_at = NULL; + int error = 0; + rc->critical = expression_parse(value, &failed_at, &error); + if(!rc->critical) { + error("Health configuration at line %zu of file '%s' for alarm '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rrdcalc_name(rc), key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_exec && !strcasecmp(key, HEALTH_EXEC_KEY)) { + alert_cfg->exec = string_strdupz(value); + if(rc->exec) { + if(strcmp(rrdcalc_exec(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalc_name(rc), key, rrdcalc_exec(rc), value, value); + + string_freez(rc->exec); + } + rc->exec = string_strdupz(value); + } + else if(hash == hash_recipient && !strcasecmp(key, HEALTH_RECIPIENT_KEY)) { + alert_cfg->to = string_strdupz(value); + if(rc->recipient) { + if(strcmp(rrdcalc_recipient(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalc_name(rc), key, rrdcalc_recipient(rc), value, value); + + string_freez(rc->recipient); + } + rc->recipient = string_strdupz(value); + } + else if(hash == hash_units && !strcasecmp(key, HEALTH_UNITS_KEY)) { + strip_quotes(value); + + alert_cfg->units = string_strdupz(value); + if(rc->units) { + if(strcmp(rrdcalc_units(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalc_name(rc), key, rrdcalc_units(rc), value, value); + + string_freez(rc->units); + } + rc->units = string_strdupz(value); + } + else if(hash == hash_info && !strcasecmp(key, HEALTH_INFO_KEY)) { + strip_quotes(value); + + alert_cfg->info = string_strdupz(value); + if(rc->info) { + if(strcmp(rrdcalc_info(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalc_name(rc), key, rrdcalc_info(rc), value, value); + + string_freez(rc->info); + string_freez(rc->original_info); + } + rc->info = string_strdupz(value); + rc->original_info = string_dup(rc->info); + } + else if(hash == hash_delay && !strcasecmp(key, HEALTH_DELAY_KEY)) { + alert_cfg->delay = string_strdupz(value); + health_parse_delay(line, filename, value, &rc->delay_up_duration, &rc->delay_down_duration, &rc->delay_max_duration, &rc->delay_multiplier); + } + else if(hash == hash_options && !strcasecmp(key, HEALTH_OPTIONS_KEY)) { + alert_cfg->options = string_strdupz(value); + rc->options |= health_parse_options(value); + } + else if(hash == hash_repeat && !strcasecmp(key, HEALTH_REPEAT_KEY)){ + alert_cfg->repeat = string_strdupz(value); + health_parse_repeat(line, filename, value, + &rc->warn_repeat_every, + &rc->crit_repeat_every); + } + else if(hash == hash_host_label && !strcasecmp(key, HEALTH_HOST_LABEL_KEY)) { + alert_cfg->host_labels = string_strdupz(value); + if(rc->host_labels) { + if(strcmp(rrdcalc_host_labels(rc), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'.", + line, filename, rrdcalc_name(rc), key, value, value); + + string_freez(rc->host_labels); + simple_pattern_free(rc->host_labels_pattern); + } + + { + char *tmp = simple_pattern_trim_around_equal(value); + rc->host_labels = string_strdupz(tmp); + freez(tmp); + } + rc->host_labels_pattern = simple_pattern_create(rrdcalc_host_labels(rc), NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_plugin && !strcasecmp(key, HEALTH_PLUGIN_KEY)) { + alert_cfg->plugin = string_strdupz(value); + string_freez(rc->plugin_match); + simple_pattern_free(rc->plugin_pattern); + + rc->plugin_match = string_strdupz(value); + rc->plugin_pattern = simple_pattern_create(rrdcalc_plugin_match(rc), NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_module && !strcasecmp(key, HEALTH_MODULE_KEY)) { + alert_cfg->module = string_strdupz(value); + string_freez(rc->module_match); + simple_pattern_free(rc->module_pattern); + + rc->module_match = string_strdupz(value); + rc->module_pattern = simple_pattern_create(rrdcalc_module_match(rc), NULL, SIMPLE_PATTERN_EXACT); + } + else { + error("Health configuration at line %zu of file '%s' for alarm '%s' has unknown key '%s'.", + line, filename, rrdcalc_name(rc), key); + } + } + else if(rt) { + if(hash == hash_on && !strcasecmp(key, HEALTH_ON_KEY)) { + alert_cfg->on = string_strdupz(value); + if(rt->context) { + if(strcmp(string2str(rt->context), value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, string2str(rt->context), value, value); + + string_freez(rt->context); + } + rt->context = string_strdupz(value); + } + else if(hash == hash_class && !strcasecmp(key, HEALTH_CLASS_KEY)) { + strip_quotes(value); + + alert_cfg->classification = string_strdupz(value); + if(rt->classification) { + if(strcmp(rrdcalctemplate_classification(rt), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, rrdcalctemplate_classification(rt), value, value); + + string_freez(rt->classification); + } + rt->classification = string_strdupz(value); + } + else if(hash == hash_component && !strcasecmp(key, HEALTH_COMPONENT_KEY)) { + strip_quotes(value); + + alert_cfg->component = string_strdupz(value); + if(rt->component) { + if(strcmp(rrdcalctemplate_component(rt), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, rrdcalctemplate_component(rt), value, value); + + string_freez(rt->component); + } + rt->component = string_strdupz(value); + } + else if(hash == hash_type && !strcasecmp(key, HEALTH_TYPE_KEY)) { + strip_quotes(value); + + alert_cfg->type = string_strdupz(value); + if(rt->type) { + if(strcmp(rrdcalctemplate_type(rt), value) != 0) + error("Health configuration at line %zu of file '%s' for alarm '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, rrdcalctemplate_type(rt), value, value); + + string_freez(rt->type); + } + rt->type = string_strdupz(value); + } + else if(hash == hash_families && !strcasecmp(key, HEALTH_FAMILIES_KEY)) { + alert_cfg->families = string_strdupz(value); + string_freez(rt->family_match); + simple_pattern_free(rt->family_pattern); + + rt->family_match = string_strdupz(value); + rt->family_pattern = simple_pattern_create(rrdcalctemplate_family_match(rt), NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_plugin && !strcasecmp(key, HEALTH_PLUGIN_KEY)) { + alert_cfg->plugin = string_strdupz(value); + string_freez(rt->plugin_match); + simple_pattern_free(rt->plugin_pattern); + + rt->plugin_match = string_strdupz(value); + rt->plugin_pattern = simple_pattern_create(rrdcalctemplate_plugin_match(rt), NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_module && !strcasecmp(key, HEALTH_MODULE_KEY)) { + alert_cfg->module = string_strdupz(value); + string_freez(rt->module_match); + simple_pattern_free(rt->module_pattern); + + rt->module_match = string_strdupz(value); + rt->module_pattern = simple_pattern_create(rrdcalctemplate_module_match(rt), NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_charts && !strcasecmp(key, HEALTH_CHARTS_KEY)) { + alert_cfg->charts = string_strdupz(value); + string_freez(rt->charts_match); + simple_pattern_free(rt->charts_pattern); + + rt->charts_match = string_strdupz(value); + rt->charts_pattern = simple_pattern_create(rrdcalctemplate_charts_match(rt), NULL, SIMPLE_PATTERN_EXACT); + } + else if(hash == hash_lookup && !strcasecmp(key, HEALTH_LOOKUP_KEY)) { + alert_cfg->lookup = string_strdupz(value); + health_parse_db_lookup(line, filename, value, &rt->group, &rt->after, &rt->before, + &rt->update_every, &rt->options, &rt->dimensions, &rt->foreach_dimension); + + if(rt->foreach_dimension) + rt->foreach_dimension_pattern = health_pattern_from_foreach(rrdcalctemplate_foreachdim(rt)); + + if (rt->after) { + if (rt->dimensions) + alert_cfg->p_db_lookup_dimensions = string_dup(rt->dimensions); + + if (rt->group) + alert_cfg->p_db_lookup_method = string_strdupz(group_method2string(rt->group)); + + alert_cfg->p_db_lookup_options = rt->options; + alert_cfg->p_db_lookup_after = rt->after; + alert_cfg->p_db_lookup_before = rt->before; + alert_cfg->p_update_every = rt->update_every; + } + } + else if(hash == hash_every && !strcasecmp(key, HEALTH_EVERY_KEY)) { + alert_cfg->every = string_strdupz(value); + if(!config_parse_duration(value, &rt->update_every)) + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' cannot parse duration: '%s'.", + line, filename, rrdcalctemplate_name(rt), key, value); + alert_cfg->p_update_every = rt->update_every; + } + else if(hash == hash_green && !strcasecmp(key, HEALTH_GREEN_KEY)) { + alert_cfg->green = string_strdupz(value); + char *e; + rt->green = str2ndd(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rrdcalctemplate_name(rt), key, e); + } + } + else if(hash == hash_red && !strcasecmp(key, HEALTH_RED_KEY)) { + alert_cfg->red = string_strdupz(value); + char *e; + rt->red = str2ndd(value, &e); + if(e && *e) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' leaves this string unmatched: '%s'.", + line, filename, rrdcalctemplate_name(rt), key, e); + } + } + else if(hash == hash_calc && !strcasecmp(key, HEALTH_CALC_KEY)) { + alert_cfg->calc = string_strdupz(value); + const char *failed_at = NULL; + int error = 0; + rt->calculation = expression_parse(value, &failed_at, &error); + if(!rt->calculation) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rrdcalctemplate_name(rt), key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_warn && !strcasecmp(key, HEALTH_WARN_KEY)) { + alert_cfg->warn = string_strdupz(value); + const char *failed_at = NULL; + int error = 0; + rt->warning = expression_parse(value, &failed_at, &error); + if(!rt->warning) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rrdcalctemplate_name(rt), key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_crit && !strcasecmp(key, HEALTH_CRIT_KEY)) { + alert_cfg->crit = string_strdupz(value); + const char *failed_at = NULL; + int error = 0; + rt->critical = expression_parse(value, &failed_at, &error); + if(!rt->critical) { + error("Health configuration at line %zu of file '%s' for template '%s' at key '%s' has unparse-able expression '%s': %s at '%s'", + line, filename, rrdcalctemplate_name(rt), key, value, expression_strerror(error), failed_at); + } + } + else if(hash == hash_exec && !strcasecmp(key, HEALTH_EXEC_KEY)) { + alert_cfg->exec = string_strdupz(value); + if(rt->exec) { + if(strcmp(rrdcalctemplate_exec(rt), value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, rrdcalctemplate_exec(rt), value, value); + + string_freez(rt->exec); + } + rt->exec = string_strdupz(value); + } + else if(hash == hash_recipient && !strcasecmp(key, HEALTH_RECIPIENT_KEY)) { + alert_cfg->to = string_strdupz(value); + if(rt->recipient) { + if(strcmp(rrdcalctemplate_recipient(rt), value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, rrdcalctemplate_recipient(rt), value, value); + + string_freez(rt->recipient); + } + rt->recipient = string_strdupz(value); + } + else if(hash == hash_units && !strcasecmp(key, HEALTH_UNITS_KEY)) { + strip_quotes(value); + + alert_cfg->units = string_strdupz(value); + if(rt->units) { + if(strcmp(rrdcalctemplate_units(rt), value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, rrdcalctemplate_units(rt), value, value); + + string_freez(rt->units); + } + rt->units = string_strdupz(value); + } + else if(hash == hash_info && !strcasecmp(key, HEALTH_INFO_KEY)) { + strip_quotes(value); + + alert_cfg->info = string_strdupz(value); + if(rt->info) { + if(strcmp(rrdcalctemplate_info(rt), value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, rrdcalctemplate_info(rt), value, value); + + string_freez(rt->info); + } + rt->info = string_strdupz(value); + } + else if(hash == hash_delay && !strcasecmp(key, HEALTH_DELAY_KEY)) { + alert_cfg->delay = string_strdupz(value); + health_parse_delay(line, filename, value, &rt->delay_up_duration, &rt->delay_down_duration, &rt->delay_max_duration, &rt->delay_multiplier); + } + else if(hash == hash_options && !strcasecmp(key, HEALTH_OPTIONS_KEY)) { + alert_cfg->options = string_strdupz(value); + rt->options |= health_parse_options(value); + } + else if(hash == hash_repeat && !strcasecmp(key, HEALTH_REPEAT_KEY)){ + alert_cfg->repeat = string_strdupz(value); + health_parse_repeat(line, filename, value, + &rt->warn_repeat_every, + &rt->crit_repeat_every); + } + else if(hash == hash_host_label && !strcasecmp(key, HEALTH_HOST_LABEL_KEY)) { + alert_cfg->host_labels = string_strdupz(value); + if(rt->host_labels) { + if(strcmp(rrdcalctemplate_host_labels(rt), value) != 0) + error("Health configuration at line %zu of file '%s' for template '%s' has key '%s' twice, once with value '%s' and later with value '%s'. Using ('%s').", + line, filename, rrdcalctemplate_name(rt), key, rrdcalctemplate_host_labels(rt), value, value); + + string_freez(rt->host_labels); + simple_pattern_free(rt->host_labels_pattern); + } + + { + char *tmp = simple_pattern_trim_around_equal(value); + rt->host_labels = string_strdupz(tmp); + freez(tmp); + } + rt->host_labels_pattern = simple_pattern_create(rrdcalctemplate_host_labels(rt), NULL, SIMPLE_PATTERN_EXACT); + } + else { + error("Health configuration at line %zu of file '%s' for template '%s' has unknown key '%s'.", + line, filename, rrdcalctemplate_name(rt), key); + } + } + else { + error("Health configuration at line %zu of file '%s' has unknown key '%s'. Expected either '" HEALTH_ALARM_KEY "' or '" HEALTH_TEMPLATE_KEY "'.", + line, filename, key); + } + } + + if(rc) { + //health_add_alarms_loop(host, rc, ignore_this) ; + if(!alert_hash_and_store_config(rc->config_hash_id, alert_cfg, sql_store_hashes) || ignore_this) + rrdcalc_free_unused_rrdcalc_loaded_from_config(rc); + else + rrdcalc_add_from_config(host, rc); + } + + if(rt) { + if(!alert_hash_and_store_config(rt->config_hash_id, alert_cfg, sql_store_hashes) || ignore_this) + rrdcalctemplate_free_unused_rrdcalctemplate_loaded_from_config(rt); + else + rrdcalctemplate_add_from_config(host, rt); + } + + if (alert_cfg) + alert_config_free(alert_cfg); + + fclose(fp); + return 1; +} + +void sql_refresh_hashes(void) +{ + sql_store_hashes = 1; +} + +void health_readdir(RRDHOST *host, const char *user_path, const char *stock_path, const char *subpath) { + if(unlikely(!host->health_enabled) && !rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH)) { + debug(D_HEALTH, "CONFIG health is not enabled for host '%s'", rrdhost_hostname(host)); + return; + } + + int stock_enabled = (int)config_get_boolean(CONFIG_SECTION_HEALTH, "enable stock health configuration", + CONFIG_BOOLEAN_YES); + + if (!stock_enabled) { + log_health("[%s]: Netdata will not load stock alarms.", rrdhost_hostname(host)); + stock_path = user_path; + } + + recursive_config_double_dir_load(user_path, stock_path, subpath, health_readfile, (void *) host, 0); + log_health("[%s]: Read health configuration.", rrdhost_hostname(host)); + sql_store_hashes = 0; +} diff --git a/health/health_json.c b/health/health_json.c new file mode 100644 index 0000000..2dd59fd --- /dev/null +++ b/health/health_json.c @@ -0,0 +1,439 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +void health_string2json(BUFFER *wb, const char *prefix, const char *label, const char *value, const char *suffix) { + if(value && *value) { + buffer_sprintf(wb, "%s\"%s\":\"", prefix, label); + buffer_strcat_htmlescape(wb, value); + buffer_strcat(wb, "\""); + buffer_strcat(wb, suffix); + } + else + buffer_sprintf(wb, "%s\"%s\":null%s", prefix, label, suffix); +} + +void health_alarm_entry2json_nolock(BUFFER *wb, ALARM_ENTRY *ae, RRDHOST *host) { + char *edit_command = ae->source ? health_edit_command_from_source(ae_source(ae)) : strdupz("UNKNOWN=0=UNKNOWN"); + char config_hash_id[GUID_LEN + 1]; + uuid_unparse_lower(ae->config_hash_id, config_hash_id); + + buffer_sprintf(wb, + "\n\t{\n" + "\t\t\"hostname\": \"%s\",\n" + "\t\t\"utc_offset\": %d,\n" + "\t\t\"timezone\": \"%s\",\n" + "\t\t\"unique_id\": %u,\n" + "\t\t\"alarm_id\": %u,\n" + "\t\t\"alarm_event_id\": %u,\n" + "\t\t\"config_hash_id\": \"%s\",\n" + "\t\t\"name\": \"%s\",\n" + "\t\t\"chart\": \"%s\",\n" + "\t\t\"context\": \"%s\",\n" + "\t\t\"family\": \"%s\",\n" + "\t\t\"class\": \"%s\",\n" + "\t\t\"component\": \"%s\",\n" + "\t\t\"type\": \"%s\",\n" + "\t\t\"processed\": %s,\n" + "\t\t\"updated\": %s,\n" + "\t\t\"exec_run\": %lu,\n" + "\t\t\"exec_failed\": %s,\n" + "\t\t\"exec\": \"%s\",\n" + "\t\t\"recipient\": \"%s\",\n" + "\t\t\"exec_code\": %d,\n" + "\t\t\"source\": \"%s\",\n" + "\t\t\"command\": \"%s\",\n" + "\t\t\"units\": \"%s\",\n" + "\t\t\"when\": %lu,\n" + "\t\t\"duration\": %lu,\n" + "\t\t\"non_clear_duration\": %lu,\n" + "\t\t\"status\": \"%s\",\n" + "\t\t\"old_status\": \"%s\",\n" + "\t\t\"delay\": %d,\n" + "\t\t\"delay_up_to_timestamp\": %lu,\n" + "\t\t\"updated_by_id\": %u,\n" + "\t\t\"updates_id\": %u,\n" + "\t\t\"value_string\": \"%s\",\n" + "\t\t\"old_value_string\": \"%s\",\n" + "\t\t\"last_repeat\": \"%lu\",\n" + "\t\t\"silenced\": \"%s\",\n" + , rrdhost_hostname(host) + , host->utc_offset + , rrdhost_abbrev_timezone(host) + , ae->unique_id + , ae->alarm_id + , ae->alarm_event_id + , config_hash_id + , ae_name(ae) + , ae_chart_name(ae) + , ae_chart_context(ae) + , ae_family(ae) + , ae->classification?ae_classification(ae):"Unknown" + , ae->component?ae_component(ae):"Unknown" + , ae->type?ae_type(ae):"Unknown" + , (ae->flags & HEALTH_ENTRY_FLAG_PROCESSED)?"true":"false" + , (ae->flags & HEALTH_ENTRY_FLAG_UPDATED)?"true":"false" + , (unsigned long)ae->exec_run_timestamp + , (ae->flags & HEALTH_ENTRY_FLAG_EXEC_FAILED)?"true":"false" + , ae->exec?ae_exec(ae):string2str(host->health_default_exec) + , ae->recipient?ae_recipient(ae):string2str(host->health_default_recipient) + , ae->exec_code + , ae_source(ae) + , edit_command + , ae_units(ae) + , (unsigned long)ae->when + , (unsigned long)ae->duration + , (unsigned long)ae->non_clear_duration + , rrdcalc_status2string(ae->new_status) + , rrdcalc_status2string(ae->old_status) + , ae->delay + , (unsigned long)ae->delay_up_to_timestamp + , ae->updated_by_id + , ae->updates_id + , ae_new_value_string(ae) + , ae_old_value_string(ae) + , (unsigned long)ae->last_repeat + , (ae->flags & HEALTH_ENTRY_FLAG_SILENCED)?"true":"false" + ); + + health_string2json(wb, "\t\t", "info", ae->info ? ae_info(ae) : "", ",\n"); + + if(unlikely(ae->flags & HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION)) { + buffer_strcat(wb, "\t\t\"no_clear_notification\": true,\n"); + } + + buffer_strcat(wb, "\t\t\"value\":"); + buffer_rrd_value(wb, ae->new_value); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\"old_value\":"); + buffer_rrd_value(wb, ae->old_value); + buffer_strcat(wb, "\n"); + + buffer_strcat(wb, "\t}"); + + freez(edit_command); +} + +void health_alarm_log2json(RRDHOST *host, BUFFER *wb, uint32_t after, char *chart) { + + buffer_strcat(wb, "["); + + unsigned int max = host->health_log.max; + unsigned int count = 0; + + STRING *chart_string = string_strdupz(chart); + + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *ae; + for (ae = host->health_log.alarms; ae && count < max; ae = ae->next) { + if ((ae->unique_id > after) && (!chart || chart_string == ae->chart)) { + if (likely(count)) + buffer_strcat(wb, ","); + health_alarm_entry2json_nolock(wb, ae, host); + count++; + } + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + string_freez(chart_string); + + buffer_strcat(wb, "\n]\n"); +} + +static inline void health_rrdcalc_values2json_nolock(RRDHOST *host, BUFFER *wb, RRDCALC *rc) { + (void)host; + buffer_sprintf(wb, + "\t\t\"%s.%s\": {\n" + "\t\t\t\"id\": %lu,\n" + , rrdcalc_chart_name(rc), rrdcalc_name(rc) + , (unsigned long)rc->id); + + buffer_strcat(wb, "\t\t\t\"value\":"); + buffer_rrd_value(wb, rc->value); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\t\"last_updated\":"); + buffer_sprintf(wb, "%lu", (unsigned long)rc->last_updated); + buffer_strcat(wb, ",\n"); + + buffer_sprintf(wb, + "\t\t\t\"status\": \"%s\"\n" + , rrdcalc_status2string(rc->status)); + + buffer_strcat(wb, "\t\t}"); +} + +static inline void health_rrdcalc2json_nolock(RRDHOST *host, BUFFER *wb, RRDCALC *rc) { + char value_string[100 + 1]; + format_value_and_unit(value_string, 100, rc->value, rrdcalc_units(rc), -1); + + char hash_id[GUID_LEN + 1]; + uuid_unparse_lower(rc->config_hash_id, hash_id); + + buffer_sprintf(wb, + "\t\t\"%s.%s\": {\n" + "\t\t\t\"id\": %lu,\n" + "\t\t\t\"config_hash_id\": \"%s\",\n" + "\t\t\t\"name\": \"%s\",\n" + "\t\t\t\"chart\": \"%s\",\n" + "\t\t\t\"family\": \"%s\",\n" + "\t\t\t\"class\": \"%s\",\n" + "\t\t\t\"component\": \"%s\",\n" + "\t\t\t\"type\": \"%s\",\n" + "\t\t\t\"active\": %s,\n" + "\t\t\t\"disabled\": %s,\n" + "\t\t\t\"silenced\": %s,\n" + "\t\t\t\"exec\": \"%s\",\n" + "\t\t\t\"recipient\": \"%s\",\n" + "\t\t\t\"source\": \"%s\",\n" + "\t\t\t\"units\": \"%s\",\n" + "\t\t\t\"info\": \"%s\",\n" + "\t\t\t\"status\": \"%s\",\n" + "\t\t\t\"last_status_change\": %lu,\n" + "\t\t\t\"last_updated\": %lu,\n" + "\t\t\t\"next_update\": %lu,\n" + "\t\t\t\"update_every\": %d,\n" + "\t\t\t\"delay_up_duration\": %d,\n" + "\t\t\t\"delay_down_duration\": %d,\n" + "\t\t\t\"delay_max_duration\": %d,\n" + "\t\t\t\"delay_multiplier\": %f,\n" + "\t\t\t\"delay\": %d,\n" + "\t\t\t\"delay_up_to_timestamp\": %lu,\n" + "\t\t\t\"warn_repeat_every\": \"%u\",\n" + "\t\t\t\"crit_repeat_every\": \"%u\",\n" + "\t\t\t\"value_string\": \"%s\",\n" + "\t\t\t\"last_repeat\": \"%lu\",\n" + "\t\t\t\"times_repeat\": %lu,\n" + , rrdcalc_chart_name(rc), rrdcalc_name(rc) + , (unsigned long)rc->id + , hash_id + , rrdcalc_name(rc) + , rrdcalc_chart_name(rc) + , (rc->rrdset)?rrdset_family(rc->rrdset):"" + , rc->classification?rrdcalc_classification(rc):"Unknown" + , rc->component?rrdcalc_component(rc):"Unknown" + , rc->type?rrdcalc_type(rc):"Unknown" + , (rc->rrdset)?"true":"false" + , (rc->run_flags & RRDCALC_FLAG_DISABLED)?"true":"false" + , (rc->run_flags & RRDCALC_FLAG_SILENCED)?"true":"false" + , rc->exec?rrdcalc_exec(rc):string2str(host->health_default_exec) + , rc->recipient?rrdcalc_recipient(rc):string2str(host->health_default_recipient) + , rrdcalc_source(rc) + , rrdcalc_units(rc) + , rrdcalc_info(rc) + , rrdcalc_status2string(rc->status) + , (unsigned long)rc->last_status_change + , (unsigned long)rc->last_updated + , (unsigned long)rc->next_update + , rc->update_every + , rc->delay_up_duration + , rc->delay_down_duration + , rc->delay_max_duration + , rc->delay_multiplier + , rc->delay_last + , (unsigned long)rc->delay_up_to_timestamp + , rc->warn_repeat_every + , rc->crit_repeat_every + , value_string + , (unsigned long)rc->last_repeat + , (unsigned long)rc->times_repeat + ); + + if(unlikely(rc->options & RRDCALC_OPTION_NO_CLEAR_NOTIFICATION)) { + buffer_strcat(wb, "\t\t\t\"no_clear_notification\": true,\n"); + } + + if(RRDCALC_HAS_DB_LOOKUP(rc)) { + if(rc->dimensions) + health_string2json(wb, "\t\t\t", "lookup_dimensions", rrdcalc_dimensions(rc), ",\n"); + + buffer_sprintf(wb, + "\t\t\t\"db_after\": %lu,\n" + "\t\t\t\"db_before\": %lu,\n" + "\t\t\t\"lookup_method\": \"%s\",\n" + "\t\t\t\"lookup_after\": %d,\n" + "\t\t\t\"lookup_before\": %d,\n" + "\t\t\t\"lookup_options\": \"", + (unsigned long) rc->db_after, + (unsigned long) rc->db_before, + group_method2string(rc->group), + rc->after, + rc->before + ); + buffer_data_options2string(wb, rc->options); + buffer_strcat(wb, "\",\n"); + } + + if(rc->calculation) { + health_string2json(wb, "\t\t\t", "calc", rc->calculation->source, ",\n"); + health_string2json(wb, "\t\t\t", "calc_parsed", rc->calculation->parsed_as, ",\n"); + } + + if(rc->warning) { + health_string2json(wb, "\t\t\t", "warn", rc->warning->source, ",\n"); + health_string2json(wb, "\t\t\t", "warn_parsed", rc->warning->parsed_as, ",\n"); + } + + if(rc->critical) { + health_string2json(wb, "\t\t\t", "crit", rc->critical->source, ",\n"); + health_string2json(wb, "\t\t\t", "crit_parsed", rc->critical->parsed_as, ",\n"); + } + + buffer_strcat(wb, "\t\t\t\"green\":"); + buffer_rrd_value(wb, rc->green); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\t\"red\":"); + buffer_rrd_value(wb, rc->red); + buffer_strcat(wb, ",\n"); + + buffer_strcat(wb, "\t\t\t\"value\":"); + buffer_rrd_value(wb, rc->value); + buffer_strcat(wb, "\n"); + + buffer_strcat(wb, "\t\t}"); +} + +//void health_rrdcalctemplate2json_nolock(BUFFER *wb, RRDCALCTEMPLATE *rt) { +// +//} + +void health_aggregate_alarms(RRDHOST *host, BUFFER *wb, BUFFER* contexts, RRDCALC_STATUS status) { + RRDCALC *rc; + int numberOfAlarms = 0; + char *tok = NULL; + char *p = NULL; + + if (contexts) { + p = (char*)buffer_tostring(contexts); + while(p && *p && (tok = mystrsep(&p, ", |"))) { + if(!*tok) continue; + + STRING *tok_string = string_strdupz(tok); + + foreach_rrdcalc_in_rrdhost_read(host, rc) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + if (unlikely(!rrdset_is_available_for_exporting_and_alarms(rc->rrdset))) + continue; + if(unlikely(rc->rrdset + && rc->rrdset->context == tok_string + && ((status==RRDCALC_STATUS_RAISED)?(rc->status >= RRDCALC_STATUS_WARNING):rc->status == status))) + numberOfAlarms++; + } + foreach_rrdcalc_in_rrdhost_done(rc); + + string_freez(tok_string); + } + } + else { + foreach_rrdcalc_in_rrdhost_read(host, rc) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + if (unlikely(!rrdset_is_available_for_exporting_and_alarms(rc->rrdset))) + continue; + if(unlikely((status==RRDCALC_STATUS_RAISED)?(rc->status >= RRDCALC_STATUS_WARNING):rc->status == status)) + numberOfAlarms++; + } + foreach_rrdcalc_in_rrdhost_done(rc); + } + + buffer_sprintf(wb, "%d", numberOfAlarms); +} + +static void health_alarms2json_fill_alarms(RRDHOST *host, BUFFER *wb, int all, void (*fp)(RRDHOST *, BUFFER *, RRDCALC *)) { + RRDCALC *rc; + int i = 0; + foreach_rrdcalc_in_rrdhost_read(host, rc) { + if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) + continue; + + if (unlikely(!rrdset_is_available_for_exporting_and_alarms(rc->rrdset))) + continue; + + if(likely(!all && !(rc->status == RRDCALC_STATUS_WARNING || rc->status == RRDCALC_STATUS_CRITICAL))) + continue; + + if(likely(i)) buffer_strcat(wb, ",\n"); + fp(host, wb, rc); + i++; + } + foreach_rrdcalc_in_rrdhost_done(rc); +} + +void health_alarms2json(RRDHOST *host, BUFFER *wb, int all) { + buffer_sprintf(wb, "{\n\t\"hostname\": \"%s\"," + "\n\t\"latest_alarm_log_unique_id\": %u," + "\n\t\"status\": %s," + "\n\t\"now\": %lu," + "\n\t\"alarms\": {\n", + rrdhost_hostname(host), + (host->health_log.next_log_id > 0)?(host->health_log.next_log_id - 1):0, + host->health_enabled?"true":"false", + (unsigned long)now_realtime_sec()); + + health_alarms2json_fill_alarms(host, wb, all, health_rrdcalc2json_nolock); + +// rrdhost_rdlock(host); +// buffer_strcat(wb, "\n\t},\n\t\"templates\": {"); +// RRDCALCTEMPLATE *rt; +// for(rt = host->templates; rt ; rt = rt->next) +// health_rrdcalctemplate2json_nolock(wb, rt); +// rrdhost_unlock(host); + + buffer_strcat(wb, "\n\t}\n}\n"); +} + +void health_alarms_values2json(RRDHOST *host, BUFFER *wb, int all) { + buffer_sprintf(wb, "{\n\t\"hostname\": \"%s\"," + "\n\t\"alarms\": {\n", + rrdhost_hostname(host)); + + health_alarms2json_fill_alarms(host, wb, all, health_rrdcalc_values2json_nolock); + + buffer_strcat(wb, "\n\t}\n}\n"); +} + +static int have_recent_alarm(RRDHOST *host, uint32_t alarm_id, uint32_t mark) +{ + ALARM_ENTRY *ae = host->health_log.alarms; + + while(ae) { + if (ae->alarm_id == alarm_id && ae->unique_id > mark && + (ae->new_status != RRDCALC_STATUS_WARNING && ae->new_status != RRDCALC_STATUS_CRITICAL)) + return 1; + ae = ae->next; + } + return 0; +} + +void health_active_log_alarms_2json(RRDHOST *host, BUFFER *wb) { + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + buffer_sprintf(wb, "[\n"); + + unsigned int max = host->health_log.max; + unsigned int count = 0; + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae && count < max ; ae = ae->next) { + if (!ae->updated_by_id && + ((ae->new_status == RRDCALC_STATUS_WARNING || ae->new_status == RRDCALC_STATUS_CRITICAL) || + ((ae->old_status == RRDCALC_STATUS_WARNING || ae->old_status == RRDCALC_STATUS_CRITICAL) && + ae->new_status == RRDCALC_STATUS_REMOVED))) { + + if (have_recent_alarm(host, ae->alarm_id, ae->unique_id)) + continue; + + if (likely(count)) + buffer_strcat(wb, ","); + health_alarm_entry2json_nolock(wb, ae, host); + count++; + } + } + buffer_strcat(wb, "]"); + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} diff --git a/health/health_log.c b/health/health_log.c new file mode 100644 index 0000000..8105e01 --- /dev/null +++ b/health/health_log.c @@ -0,0 +1,583 @@ +// SPDX-License-Identifier: GPL-3.0-or-later + +#include "health.h" + +// ---------------------------------------------------------------------------- +// health alarm log load/save +// no need for locking - only one thread is reading / writing the alarms log + +inline int health_alarm_log_open(RRDHOST *host) { + if(host->health_log_fp) + fclose(host->health_log_fp); + + host->health_log_fp = fopen(host->health_log_filename, "a"); + + if(host->health_log_fp) { + if (setvbuf(host->health_log_fp, NULL, _IOLBF, 0) != 0) + error("HEALTH [%s]: cannot set line buffering on health log file '%s'.", rrdhost_hostname(host), host->health_log_filename); + return 0; + } + + error("HEALTH [%s]: cannot open health log file '%s'. Health data will be lost in case of netdata or server crash.", rrdhost_hostname(host), host->health_log_filename); + return -1; +} + +static inline void health_alarm_log_close(RRDHOST *host) { + if(host->health_log_fp) { + fclose(host->health_log_fp); + host->health_log_fp = NULL; + } +} + +static inline void health_log_rotate(RRDHOST *host) { + static size_t rotate_every = 0; + + if(unlikely(rotate_every == 0)) { + rotate_every = (size_t)config_get_number(CONFIG_SECTION_HEALTH, "rotate log every lines", 2000); + if(rotate_every < 100) rotate_every = 100; + } + + if(unlikely(host->health_log_entries_written > rotate_every)) { + if(unlikely(host->health_log_fp)) { + health_alarm_log_close(host); + + char old_filename[FILENAME_MAX + 1]; + snprintfz(old_filename, FILENAME_MAX, "%s.old", host->health_log_filename); + + if(unlink(old_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot remove old alarms log file '%s'", rrdhost_hostname(host), old_filename); + + if(link(host->health_log_filename, old_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot move file '%s' to '%s'.", rrdhost_hostname(host), host->health_log_filename, old_filename); + + if(unlink(host->health_log_filename) == -1 && errno != ENOENT) + error("HEALTH [%s]: cannot remove old alarms log file '%s'", rrdhost_hostname(host), host->health_log_filename); + + // open it with truncate + host->health_log_fp = fopen(host->health_log_filename, "w"); + + if(host->health_log_fp) + fclose(host->health_log_fp); + else + error("HEALTH [%s]: cannot truncate health log '%s'", rrdhost_hostname(host), host->health_log_filename); + + host->health_log_fp = NULL; + + host->health_log_entries_written = 0; + health_alarm_log_open(host); + } + } +} + +inline void health_label_log_save(RRDHOST *host) { + health_log_rotate(host); + + if(unlikely(host->health_log_fp)) { + BUFFER *wb = buffer_create(1024); + + rrdlabels_to_buffer(localhost->rrdlabels, wb, "", "=", "", "\t ", NULL, NULL, NULL, NULL); + char *write = (char *) buffer_tostring(wb); + + if (unlikely(fprintf(host->health_log_fp, "L\t%s", write) < 0)) + error("HEALTH [%s]: failed to save alarm log entry to '%s'. Health data may be lost in case of abnormal restart.", + rrdhost_hostname(host), host->health_log_filename); + else + host->health_log_entries_written++; + + buffer_free(wb); + } +} + +inline void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae) { + health_log_rotate(host); + if(unlikely(host->health_log_fp)) { + if(unlikely(fprintf(host->health_log_fp + , "%c\t%s" + "\t%08x\t%08x\t%08x\t%08x\t%08x" + "\t%08x\t%08x\t%08x" + "\t%08x\t%08x\t%08x" + "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" + "\t%d\t%d\t%d\t%d" + "\t" NETDATA_DOUBLE_FORMAT_AUTO "\t" NETDATA_DOUBLE_FORMAT_AUTO + "\t%016"PRIx64"" + "\t%s\t%s\t%s" + "\n" + , (ae->flags & HEALTH_ENTRY_FLAG_SAVED)?'U':'A' + , rrdhost_hostname(host) + + , ae->unique_id + , ae->alarm_id + , ae->alarm_event_id + , ae->updated_by_id + , ae->updates_id + + , (uint32_t)ae->when + , (uint32_t)ae->duration + , (uint32_t)ae->non_clear_duration + , (uint32_t)ae->flags + , (uint32_t)ae->exec_run_timestamp + , (uint32_t)ae->delay_up_to_timestamp + + , ae_name(ae) + , ae_chart_name(ae) + , ae_family(ae) + , ae_exec(ae) + , ae_recipient(ae) + , ae_source(ae) + , ae_units(ae) + , ae_info(ae) + + , ae->exec_code + , ae->new_status + , ae->old_status + , ae->delay + + , ae->new_value + , ae->old_value + , (uint64_t)ae->last_repeat + , (ae->classification)?ae_classification(ae):"Unknown" + , (ae->component)?ae_component(ae):"Unknown" + , (ae->type)?ae_type(ae):"Unknown" + ) < 0)) + error("HEALTH [%s]: failed to save alarm log entry to '%s'. Health data may be lost in case of abnormal restart.", rrdhost_hostname(host), host->health_log_filename); + else { + ae->flags |= HEALTH_ENTRY_FLAG_SAVED; + host->health_log_entries_written++; + } + }else + sql_health_alarm_log_save(host, ae); + +#ifdef ENABLE_ACLK + if (netdata_cloud_setting) { + sql_queue_alarm_to_aclk(host, ae, 0); + } +#endif +} + +static uint32_t is_valid_alarm_id(RRDHOST *host, const char *chart, const char *name, uint32_t alarm_id) +{ + STRING *chart_string = string_strdupz(chart); + STRING *name_string = string_strdupz(name); + + uint32_t ret = 1; + + ALARM_ENTRY *ae; + for(ae = host->health_log.alarms; ae ;ae = ae->next) { + if (unlikely(ae->alarm_id == alarm_id && (!(chart_string == ae->chart && name_string == ae->name)))) { + ret = 0; + break; + } + } + + string_freez(chart_string); + string_freez(name_string); + + return ret; +} + +static inline ssize_t health_alarm_log_read(RRDHOST *host, FILE *fp, const char *filename) { + errno = 0; + + char *s, *buf = mallocz(65536 + 1); + size_t line = 0, len = 0; + ssize_t loaded = 0, updated = 0, errored = 0, duplicate = 0; + + DICTIONARY *all_rrdcalcs = dictionary_create( + DICT_OPTION_NAME_LINK_DONT_CLONE | DICT_OPTION_VALUE_LINK_DONT_CLONE | DICT_OPTION_DONT_OVERWRITE_VALUE); + RRDCALC *rc; + foreach_rrdcalc_in_rrdhost_read(host, rc) { + dictionary_set(all_rrdcalcs, rrdcalc_name(rc), rc, sizeof(*rc)); + } + foreach_rrdcalc_in_rrdhost_done(rc); + + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + + while((s = fgets_trim_len(buf, 65536, fp, &len))) { + host->health_log_entries_written++; + line++; + + int max_entries = 33, entries = 0; + char *pointers[max_entries]; + + pointers[entries++] = s++; + while(*s) { + if(unlikely(*s == '\t')) { + *s = '\0'; + pointers[entries++] = ++s; + if(entries >= max_entries) { + error("HEALTH [%s]: line %zu of file '%s' has more than %d entries. Ignoring excessive entries.", rrdhost_hostname(host), line, filename, max_entries); + break; + } + } + else s++; + } + + if(likely(*pointers[0] == 'L')) + continue; + + if(likely(*pointers[0] == 'U' || *pointers[0] == 'A')) { + ALARM_ENTRY *ae = NULL; + + if(entries < 27) { + error("HEALTH [%s]: line %zu of file '%s' should have at least 27 entries, but it has %d. Ignoring it.", rrdhost_hostname(host), line, filename, entries); + errored++; + continue; + } + + // check that we have valid ids + uint32_t unique_id = (uint32_t)strtoul(pointers[2], NULL, 16); + if(!unique_id) { + error("HEALTH [%s]: line %zu of file '%s' states alarm entry with invalid unique id %u (%s). Ignoring it.", rrdhost_hostname(host), line, filename, unique_id, pointers[2]); + errored++; + continue; + } + + uint32_t alarm_id = (uint32_t)strtoul(pointers[3], NULL, 16); + if(!alarm_id) { + error("HEALTH [%s]: line %zu of file '%s' states alarm entry for invalid alarm id %u (%s). Ignoring it.", rrdhost_hostname(host), line, filename, alarm_id, pointers[3]); + errored++; + continue; + } + + // Check if we got last_repeat field + time_t last_repeat = 0; + if(entries > 27) { + char* alarm_name = pointers[13]; + last_repeat = (time_t)strtoul(pointers[27], NULL, 16); + + rc = dictionary_get(all_rrdcalcs, alarm_name); + if(unlikely(rc)) { + if (rrdcalc_isrepeating(rc)) { + rc->last_repeat = last_repeat; + // We iterate through repeating alarm entries only to + // find the latest last_repeat timestamp. Otherwise, + // there is no need to keep them in memory. + continue; + } + } + } + + if(unlikely(*pointers[0] == 'A')) { + // make sure it is properly numbered + if(unlikely(host->health_log.alarms && unique_id < host->health_log.alarms->unique_id)) { + error( "HEALTH [%s]: line %zu of file '%s' has alarm log entry %u in wrong order. Ignoring it." + , rrdhost_hostname(host), line, filename, unique_id); + errored++; + continue; + } + + ae = callocz(1, sizeof(ALARM_ENTRY)); + } + else if(unlikely(*pointers[0] == 'U')) { + // find the original + for(ae = host->health_log.alarms; ae ; ae = ae->next) { + if(unlikely(unique_id == ae->unique_id)) { + if(unlikely(*pointers[0] == 'A')) { + error("HEALTH [%s]: line %zu of file '%s' adds duplicate alarm log entry %u. Using the later." + , rrdhost_hostname(host), line, filename, unique_id); + *pointers[0] = 'U'; + duplicate++; + } + break; + } + else if(unlikely(unique_id > ae->unique_id)) { + // no need to continue + // the linked list is sorted + ae = NULL; + break; + } + } + } + + // if not found, skip this line + if(unlikely(!ae)) { + // error("HEALTH [%s]: line %zu of file '%s' updates alarm log entry with unique id %u, but it is not found.", host->hostname, line, filename, unique_id); + continue; + } + + // check for a possible host mismatch + //if(strcmp(pointers[1], host->hostname)) + // error("HEALTH [%s]: line %zu of file '%s' provides an alarm for host '%s' but this is named '%s'.", host->hostname, line, filename, pointers[1], host->hostname); + + ae->unique_id = unique_id; + if (!is_valid_alarm_id(host, pointers[14], pointers[13], alarm_id)) { + STRING *chart = string_strdupz(pointers[14]); + STRING *name = string_strdupz(pointers[13]); + alarm_id = rrdcalc_get_unique_id(host, chart, name, NULL); + string_freez(chart); + string_freez(name); + } + ae->alarm_id = alarm_id; + ae->alarm_event_id = (uint32_t)strtoul(pointers[4], NULL, 16); + ae->updated_by_id = (uint32_t)strtoul(pointers[5], NULL, 16); + ae->updates_id = (uint32_t)strtoul(pointers[6], NULL, 16); + + ae->when = (uint32_t)strtoul(pointers[7], NULL, 16); + ae->duration = (uint32_t)strtoul(pointers[8], NULL, 16); + ae->non_clear_duration = (uint32_t)strtoul(pointers[9], NULL, 16); + + ae->flags = (uint32_t)strtoul(pointers[10], NULL, 16); + ae->flags |= HEALTH_ENTRY_FLAG_SAVED; + + ae->exec_run_timestamp = (uint32_t)strtoul(pointers[11], NULL, 16); + ae->delay_up_to_timestamp = (uint32_t)strtoul(pointers[12], NULL, 16); + + string_freez(ae->name); + ae->name = string_strdupz(pointers[13]); + + string_freez(ae->chart); + ae->chart = string_strdupz(pointers[14]); + + string_freez(ae->family); + ae->family = string_strdupz(pointers[15]); + + string_freez(ae->exec); + ae->exec = string_strdupz(pointers[16]); + + string_freez(ae->recipient); + ae->recipient = string_strdupz(pointers[17]); + + string_freez(ae->source); + ae->source = string_strdupz(pointers[18]); + + string_freez(ae->units); + ae->units = string_strdupz(pointers[19]); + + string_freez(ae->info); + ae->info = string_strdupz(pointers[20]); + + ae->exec_code = str2i(pointers[21]); + ae->new_status = str2i(pointers[22]); + ae->old_status = str2i(pointers[23]); + ae->delay = str2i(pointers[24]); + + ae->new_value = str2l(pointers[25]); + ae->old_value = str2l(pointers[26]); + + ae->last_repeat = last_repeat; + + if (likely(entries > 30)) { + string_freez(ae->classification); + ae->classification = string_strdupz(pointers[28]); + + string_freez(ae->component); + ae->component = string_strdupz(pointers[29]); + + string_freez(ae->type); + ae->type = string_strdupz(pointers[30]); + } + + char value_string[100 + 1]; + string_freez(ae->old_value_string); + string_freez(ae->new_value_string); + ae->old_value_string = string_strdupz(format_value_and_unit(value_string, 100, ae->old_value, ae_units(ae), -1)); + ae->new_value_string = string_strdupz(format_value_and_unit(value_string, 100, ae->new_value, ae_units(ae), -1)); + + // add it to host if not already there + if(unlikely(*pointers[0] == 'A')) { + ae->next = host->health_log.alarms; + host->health_log.alarms = ae; + sql_health_alarm_log_insert(host, ae); + loaded++; + } + else { + sql_health_alarm_log_update(host, ae); + updated++; + } + + if(unlikely(ae->unique_id > host->health_max_unique_id)) + host->health_max_unique_id = ae->unique_id; + + if(unlikely(ae->alarm_id >= host->health_max_alarm_id)) + host->health_max_alarm_id = ae->alarm_id; + } + else { + error("HEALTH [%s]: line %zu of file '%s' is invalid (unrecognized entry type '%s').", rrdhost_hostname(host), line, filename, pointers[0]); + errored++; + } + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + dictionary_destroy(all_rrdcalcs); + all_rrdcalcs = NULL; + + freez(buf); + + if(!host->health_max_unique_id) host->health_max_unique_id = (uint32_t)now_realtime_sec(); + if(!host->health_max_alarm_id) host->health_max_alarm_id = (uint32_t)now_realtime_sec(); + + host->health_log.next_log_id = host->health_max_unique_id + 1; + if (unlikely(!host->health_log.next_alarm_id || host->health_log.next_alarm_id <= host->health_max_alarm_id)) + host->health_log.next_alarm_id = host->health_max_alarm_id + 1; + + debug(D_HEALTH, "HEALTH [%s]: loaded file '%s' with %zd new alarm entries, updated %zd alarms, errors %zd entries, duplicate %zd", rrdhost_hostname(host), filename, loaded, updated, errored, duplicate); + return loaded; +} + +inline void health_alarm_log_load(RRDHOST *host) { + health_alarm_log_close(host); + + char filename[FILENAME_MAX + 1]; + snprintfz(filename, FILENAME_MAX, "%s.old", host->health_log_filename); + FILE *fp = fopen(filename, "r"); + if(!fp) + error("HEALTH [%s]: cannot open health file: %s", rrdhost_hostname(host), filename); + else { + health_alarm_log_read(host, fp, filename); + fclose(fp); + } + + host->health_log_entries_written = 0; + fp = fopen(host->health_log_filename, "r"); + if(!fp) + error("HEALTH [%s]: cannot open health file: %s", rrdhost_hostname(host), host->health_log_filename); + else { + health_alarm_log_read(host, fp, host->health_log_filename); + fclose(fp); + } +} + + +// ---------------------------------------------------------------------------- +// health alarm log management + +inline ALARM_ENTRY* health_create_alarm_entry( + RRDHOST *host, + uint32_t alarm_id, + uint32_t alarm_event_id, + const uuid_t config_hash_id, + time_t when, + STRING *name, + STRING *chart, + STRING *chart_context, + STRING *family, + STRING *class, + STRING *component, + STRING *type, + STRING *exec, + STRING *recipient, + time_t duration, + NETDATA_DOUBLE old_value, + NETDATA_DOUBLE new_value, + RRDCALC_STATUS old_status, + RRDCALC_STATUS new_status, + STRING *source, + STRING *units, + STRING *info, + int delay, + uint32_t flags +) { + debug(D_HEALTH, "Health adding alarm log entry with id: %u", host->health_log.next_log_id); + + ALARM_ENTRY *ae = callocz(1, sizeof(ALARM_ENTRY)); + ae->name = string_dup(name); + ae->chart = string_dup(chart); + ae->chart_context = string_dup(chart_context); + + uuid_copy(ae->config_hash_id, *((uuid_t *) config_hash_id)); + + ae->family = string_dup(family); + ae->classification = string_dup(class); + ae->component = string_dup(component); + ae->type = string_dup(type); + ae->exec = string_dup(exec); + ae->recipient = string_dup(recipient); + ae->source = string_dup(source); + ae->units = string_dup(units); + + ae->unique_id = host->health_log.next_log_id++; + ae->alarm_id = alarm_id; + ae->alarm_event_id = alarm_event_id; + ae->when = when; + ae->old_value = old_value; + ae->new_value = new_value; + + char value_string[100 + 1]; + ae->old_value_string = string_strdupz(format_value_and_unit(value_string, 100, ae->old_value, ae_units(ae), -1)); + ae->new_value_string = string_strdupz(format_value_and_unit(value_string, 100, ae->new_value, ae_units(ae), -1)); + + ae->info = string_dup(info); + ae->old_status = old_status; + ae->new_status = new_status; + ae->duration = duration; + ae->delay = delay; + ae->delay_up_to_timestamp = when + delay; + ae->flags |= flags; + + ae->last_repeat = 0; + + if(ae->old_status == RRDCALC_STATUS_WARNING || ae->old_status == RRDCALC_STATUS_CRITICAL) + ae->non_clear_duration += ae->duration; + + return ae; +} + +inline void health_alarm_log_add_entry( + RRDHOST *host, + ALARM_ENTRY *ae +) { + debug(D_HEALTH, "Health adding alarm log entry with id: %u", ae->unique_id); + + // link it + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + ae->next = host->health_log.alarms; + host->health_log.alarms = ae; + host->health_log.count++; + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + // match previous alarms + netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); + ALARM_ENTRY *t; + for(t = host->health_log.alarms ; t ; t = t->next) { + if(t != ae && t->alarm_id == ae->alarm_id) { + if(!(t->flags & HEALTH_ENTRY_FLAG_UPDATED) && !t->updated_by_id) { + t->flags |= HEALTH_ENTRY_FLAG_UPDATED; + t->updated_by_id = ae->unique_id; + ae->updates_id = t->unique_id; + + if((t->new_status == RRDCALC_STATUS_WARNING || t->new_status == RRDCALC_STATUS_CRITICAL) && + (t->old_status == RRDCALC_STATUS_WARNING || t->old_status == RRDCALC_STATUS_CRITICAL)) + ae->non_clear_duration += t->non_clear_duration; + + health_alarm_log_save(host, t); + } + + // no need to continue + break; + } + } + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); + + health_alarm_log_save(host, ae); +} + +inline void health_alarm_log_free_one_nochecks_nounlink(ALARM_ENTRY *ae) { + string_freez(ae->name); + string_freez(ae->chart); + string_freez(ae->chart_context); + string_freez(ae->family); + string_freez(ae->classification); + string_freez(ae->component); + string_freez(ae->type); + string_freez(ae->exec); + string_freez(ae->recipient); + string_freez(ae->source); + string_freez(ae->units); + string_freez(ae->info); + string_freez(ae->old_value_string); + string_freez(ae->new_value_string); + freez(ae); +} + +inline void health_alarm_log_free(RRDHOST *host) { + netdata_rwlock_wrlock(&host->health_log.alarm_log_rwlock); + + ALARM_ENTRY *ae; + while((ae = host->health_log.alarms)) { + host->health_log.alarms = ae->next; + health_alarm_log_free_one_nochecks_nounlink(ae); + } + + netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); +} diff --git a/health/notifications/Makefile.am b/health/notifications/Makefile.am new file mode 100644 index 0000000..f026171 --- /dev/null +++ b/health/notifications/Makefile.am @@ -0,0 +1,53 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +AUTOMAKE_OPTIONS = subdir-objects +MAINTAINERCLEANFILES = $(srcdir)/Makefile.in + +CLEANFILES = \ + alarm-notify.sh \ + $(NULL) + +include $(top_srcdir)/build/subst.inc +SUFFIXES = .in + +dist_libconfig_DATA = \ + health_alarm_notify.conf \ + health_email_recipients.conf \ + $(NULL) + +dist_plugins_SCRIPTS = \ + alarm-notify.sh \ + alarm-email.sh \ + alarm-test.sh \ + $(NULL) + +dist_noinst_DATA = \ + alarm-notify.sh.in \ + README.md \ + $(NULL) + +include alerta/Makefile.inc +include awssns/Makefile.inc +include discord/Makefile.inc +include email/Makefile.inc +include flock/Makefile.inc +include gotify/Makefile.inc +include hangouts/Makefile.inc +include irc/Makefile.inc +include kavenegar/Makefile.inc +include messagebird/Makefile.inc +include msteams/Makefile.inc +include opsgenie/Makefile.inc +include pagerduty/Makefile.inc +include pushbullet/Makefile.inc +include pushover/Makefile.inc +include rocketchat/Makefile.inc +include slack/Makefile.inc +include smstools3/Makefile.inc +include stackpulse/Makefile.inc +include syslog/Makefile.inc +include telegram/Makefile.inc +include twilio/Makefile.inc +include web/Makefile.inc +include matrix/Makefile.inc +include custom/Makefile.inc diff --git a/health/notifications/README.md b/health/notifications/README.md new file mode 100644 index 0000000..0bd6c76 --- /dev/null +++ b/health/notifications/README.md @@ -0,0 +1,86 @@ +<!-- +title: "Alarm notifications" +description: "Reference documentation for Netdata's alarm notification feature, which supports dozens of endpoints, user roles, and more." +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/README.md +--> + +# Alarm notifications + +The `exec` line in health configuration defines an external script that will be called once +the alarm is triggered. The default script is `alarm-notify.sh`. + +You can change the default script globally by editing `/etc/netdata/netdata.conf`. + +`alarm-notify.sh` is capable of sending notifications: + +- to multiple recipients +- using multiple notification methods +- filtering severity per recipient + +It uses **roles**. For example `sysadmin`, `webmaster`, `dba`, etc. + +Each alarm is assigned to one or more roles, using the `to` line of the alarm configuration. Then `alarm-notify.sh` uses +its own configuration file `/etc/netdata/health_alarm_notify.conf`. To edit it on your system, run +`/etc/netdata/edit-config health_alarm_notify.conf` and find the destination address of the notification for each +method. + +Each role may have one or more destinations. + +So, for example the `sysadmin` role may send: + +1. emails to admin1@example.com and admin2@example.com +2. pushover.net notifications to USERTOKENS `A`, `B` and `C`. +3. pushbullet.com push notifications to admin1@example.com and admin2@example.com +4. messages to slack.com channel `#alarms` and `#systems`. +5. messages to Discord channels `#alarms` and `#systems`. + +## Configuration + +Edit `/etc/netdata/health_alarm_notify.conf` by running `/etc/netdata/edit-config health_alarm_notify.conf`: + +- settings per notification method: + + all notification methods except email, require some configuration + (i.e. API keys, tokens, destination rooms, channels, etc). + +- **recipients** per **role** per **notification method** + +```sh +grep sysadmin /etc/netdata/health_alarm_notify.conf + +role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}" +role_recipients_pushover[sysadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" +role_recipients_pushbullet[sysadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" +role_recipients_telegram[sysadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" +role_recipients_slack[sysadmin]="${DEFAULT_RECIPIENT_SLACK}" +... +``` + +## Testing Notifications + +You can run the following command by hand, to test alarms configuration: + +```sh +# become user netdata +su -s /bin/bash netdata + +# enable debugging info on the console +export NETDATA_ALARM_NOTIFY_DEBUG=1 + +# send test alarms to sysadmin +/usr/libexec/netdata/plugins.d/alarm-notify.sh test + +# send test alarms to any role +/usr/libexec/netdata/plugins.d/alarm-notify.sh test "ROLE" +``` + +Note that in versions before 1.16, the plugins.d directory may be installed in a different location in certain OSs (e.g. under `/usr/lib/netdata`). You can always find the location of the alarm-notify.sh script in `netdata.conf`. + +If you need to dig even deeper, you can trace the execution with `bash -x`. Note that in test mode, alarm-notify.sh calls itself with many more arguments. So first do + +```sh +bash -x /usr/libexec/netdata/plugins.d/alarm-notify.sh test +``` + + Then look in the output for the alarm-notify.sh calls and run the one you want to trace with `bash -x`. + diff --git a/health/notifications/alarm-email.sh b/health/notifications/alarm-email.sh new file mode 100755 index 0000000..69c4c3f --- /dev/null +++ b/health/notifications/alarm-email.sh @@ -0,0 +1,7 @@ +#!/usr/bin/env bash +# SPDX-License-Identifier: GPL-3.0-or-later + +# OBSOLETE - REPLACED WITH +# alarm-notify.sh + +${0/alarm-email.sh/alarm-notify.sh} "${@}" diff --git a/health/notifications/alarm-notify.sh.in b/health/notifications/alarm-notify.sh.in new file mode 100755 index 0000000..3edf3d0 --- /dev/null +++ b/health/notifications/alarm-notify.sh.in @@ -0,0 +1,3594 @@ +#!/usr/bin/env bash +#shellcheck source=/dev/null disable=SC2086,SC2154 + +# netdata +# real-time performance and health monitoring, done right! +# (C) 2017 Costa Tsaousis <costa@tsaousis.gr> +# SPDX-License-Identifier: GPL-3.0-or-later +# +# Script to send alarm notifications for netdata +# +# Features: +# - multiple notification methods +# - multiple roles per alarm +# - multiple recipients per role +# - severity filtering per recipient +# +# Supported notification methods: +# - emails by @ktsaou +# - slack.com notifications by @ktsaou +# - alerta.io notifications by @kattunga +# - discordapp.com notifications by @lowfive +# - pushover.net notifications by @ktsaou +# - pushbullet.com push notifications by Tiago Peralta @tperalta82 #1070 +# - telegram.org notifications by @hashworks #1002 +# - twilio.com notifications by Levi Blaney @shadycuz #1211 +# - kafka notifications by @ktsaou #1342 +# - pagerduty.com notifications by Jim Cooley @jimcooley #1373 +# - messagebird.com notifications by @tech_no_logical #1453 +# - hipchat notifications by @ktsaou #1561 +# - fleep notifications by @Ferroin +# - prowlapp.com notifications by @Ferroin +# - irc notifications by @manosf +# - custom notifications by @ktsaou +# - syslog messages by @Ferroin +# - Microsoft Team notification by @tioumen +# - RocketChat notifications by @Hermsi1337 #3777 +# - Google Hangouts Chat notifications by @EnzoAkira and @hendrikhofstadt +# - Dynatrace Event by @illumine +# - Stackpulse Event by @thiagoftsm +# - Opsgenie by @thiaoftsm #9858 +# - Gotify by @coffeegrind123 + +# ----------------------------------------------------------------------------- +# testing notifications + +if { [ "${1}" = "test" ] || [ "${2}" = "test" ]; } && [ "${#}" -le 2 ]; then + if [ "${2}" = "test" ]; then + recipient="${1}" + else + recipient="${2}" + fi + + [ -z "${recipient}" ] && recipient="sysadmin" + + id=1 + last="CLEAR" + test_res=0 + for x in "WARNING" "CRITICAL" "CLEAR"; do + echo >&2 + echo >&2 "# SENDING TEST ${x} ALARM TO ROLE: ${recipient}" + + "${0}" "${recipient}" "$(hostname)" 1 1 "${id}" "$(date +%s)" "test_alarm" "test.chart" "test.family" "${x}" "${last}" 100 90 "${0}" 1 $((0 + id)) "units" "this is a test alarm to verify notifications work" "new value" "old value" "evaluated expression" "expression variable values" 0 0 + #shellcheck disable=SC2181 + if [ $? -ne 0 ]; then + echo >&2 "# FAILED" + test_res=1 + else + echo >&2 "# OK" + fi + + last="${x}" + id=$((id + 1)) + done + + exit $test_res +fi + +export PATH="${PATH}:/sbin:/usr/sbin:/usr/local/sbin" +export LC_ALL=C + +# ----------------------------------------------------------------------------- + +PROGRAM_NAME="$(basename "${0}")" + +logdate() { + date "+%Y-%m-%d %H:%M:%S" +} + +log() { + local status="${1}" + shift + + echo >&2 "$(logdate): ${PROGRAM_NAME}: ${status}: ${*}" + +} + +warning() { + log WARNING "${@}" +} + +error() { + log ERROR "${@}" +} + +info() { + log INFO "${@}" +} + +fatal() { + log FATAL "${@}" + exit 1 +} + +debug=${NETDATA_ALARM_NOTIFY_DEBUG-0} +debug() { + [ "${debug}" = "1" ] && log DEBUG "${@}" +} + +docurl() { + if [ -z "${curl}" ]; then + error "${curl} is unset." + return 1 + fi + + if [ "${debug}" = "1" ]; then + echo >&2 "--- BEGIN curl command ---" + printf >&2 "%q " ${curl} "${@}" + echo >&2 + echo >&2 "--- END curl command ---" + + local out code ret + out=$(mktemp /tmp/netdata-health-alarm-notify-XXXXXXXX) + code=$(${curl} ${curl_options} --write-out "%{http_code}" --output "${out}" --silent --show-error "${@}") + ret=$? + echo >&2 "--- BEGIN received response ---" + cat >&2 "${out}" + echo >&2 + echo >&2 "--- END received response ---" + echo >&2 "RECEIVED HTTP RESPONSE CODE: ${code}" + rm "${out}" + echo "${code}" + return ${ret} + fi + + ${curl} ${curl_options} --write-out "%{http_code}" --output /dev/null --silent --show-error "${@}" + return $? +} + +# ----------------------------------------------------------------------------- +# List of all the notification mechanisms we support. +# Used in a couple of places to write more compact code. + +method_names=" +email +pushover +pushbullet +telegram +slack +alerta +flock +discord +hipchat +twilio +messagebird +pd +fleep +syslog +custom +msteams +kavenegar +prowl +irc +awssns +rocketchat +sms +hangouts +dynatrace +matrix +" + +# ----------------------------------------------------------------------------- +# this is to be overwritten by the config file + +custom_sender() { + info "not sending custom notification for ${status} of '${host}.${chart}.${name}'" +} + +# ----------------------------------------------------------------------------- + +# check for BASH v4+ (required for associative arrays) +if [ ${BASH_VERSINFO[0]} -lt 4 ]; then + fatal "BASH version 4 or later is required (this is ${BASH_VERSION})." +fi + +# ----------------------------------------------------------------------------- +# defaults to allow running this script by hand + +[ -z "${NETDATA_USER_CONFIG_DIR}" ] && NETDATA_USER_CONFIG_DIR="@configdir_POST@" +[ -z "${NETDATA_STOCK_CONFIG_DIR}" ] && NETDATA_STOCK_CONFIG_DIR="@libconfigdir_POST@" +[ -z "${NETDATA_CACHE_DIR}" ] && NETDATA_CACHE_DIR="@cachedir_POST@" +[ -z "${NETDATA_REGISTRY_URL}" ] && NETDATA_REGISTRY_URL="https://registry.my-netdata.io" +[ -z "${NETDATA_REGISTRY_CLOUD_BASE_URL}" ] && NETDATA_REGISTRY_CLOUD_BASE_URL="https://api.netdata.cloud" + +# ----------------------------------------------------------------------------- +# parse command line parameters + +if [[ ${1} = "unittest" ]]; then + unittest=1 # enable unit testing mode + roles="${2}" # the role that should be used for unit testing + cfgfile="${3}" # the location of the config file to use for unit testing + status="${4}" # the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + old_status="${5}" # the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL +elif [[ ${1} = "dump_methods" ]]; then + dump_methods=1 + status="WARNING" +else + roles="${1}" # the roles that should be notified for this event + args_host="${2}" # the host generated this event + unique_id="${3}" # the unique id of this event + alarm_id="${4}" # the unique id of the alarm that generated this event + event_id="${5}" # the incremental id of the event, for this alarm id + when="${6}" # the timestamp this event occurred + name="${7}" # the name of the alarm, as given in netdata health.d entries + chart="${8}" # the name of the chart (type.id) + family="${9}" # the family of the chart + status="${10}" # the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + old_status="${11}" # the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + value="${12}" # the current value of the alarm + old_value="${13}" # the previous value of the alarm + src="${14}" # the line number and file the alarm has been configured + duration="${15}" # the duration in seconds of the previous alarm state + non_clear_duration="${16}" # the total duration in seconds this is/was non-clear + units="${17}" # the units of the value + info="${18}" # a short description of the alarm + value_string="${19}" # friendly value (with units) + # shellcheck disable=SC2034 + # variable is unused, but https://github.com/netdata/netdata/pull/5164#discussion_r255572947 + old_value_string="${20}" # friendly old value (with units), previously named "old_value_string" + calc_expression="${21}" # contains the expression that was evaluated to trigger the alarm + calc_param_values="${22}" # the values of the parameters in the expression, at the time of the evaluation + total_warnings="${23}" # Total number of alarms in WARNING state + total_critical="${24}" # Total number of alarms in CRITICAL state + total_warn_alarms="${25}" # List of alarms in warning state + total_crit_alarms="${26}" # List of alarms in critical state + classification="${27}" # The class field from .conf files + edit_command_line="${28}" # The command to edit the alarm, with the line number + child_machine_guid="${29}" # If populated, the notification is sent for a child +fi + +# ----------------------------------------------------------------------------- +# find a suitable hostname to use, if netdata did not supply a hostname + +if [ -z "${args_host}" ]; then + this_host=$(hostname -s 2>/dev/null) + host="${this_host}" + args_host="${this_host}" +else + host="${args_host}" +fi + +# ----------------------------------------------------------------------------- +# screen statuses we don't need to send a notification + +# don't do anything if this is not WARNING, CRITICAL or CLEAR +if [ "${status}" != "WARNING" ] && [ "${status}" != "CRITICAL" ] && [ "${status}" != "CLEAR" ]; then + info "not sending notification for ${status} of '${host}.${chart}.${name}'" + exit 1 +fi + +# don't do anything if this is CLEAR, but it was not WARNING or CRITICAL +if [ "${clear_alarm_always}" != "YES" ] && [ "${old_status}" != "WARNING" ] && [ "${old_status}" != "CRITICAL" ] && [ "${status}" = "CLEAR" ]; then + info "not sending notification for ${status} of '${host}.${chart}.${name}' (last status was ${old_status})" + exit 1 +fi + +# ----------------------------------------------------------------------------- +# load configuration + +# By default fetch images from the global public registry. +# This is required by default, since all notification methods need to download +# images via the Internet, and private registries might not be reachable. +# This can be overwritten at the configuration file. +images_base_url="https://registry.my-netdata.io" + +# curl options to use +curl_options="" + +# hostname handling +use_fqdn="NO" + +# needed commands +# if empty they will be searched in the system path +curl= +sendmail= + +# enable / disable features +for method_name in ${method_names^^}; do + declare SEND_${method_name}="YES" + declare DEFAULT_RECIPIENT_${method_name} +done + +for method_name in ${method_names}; do + declare -A role_recipients_${method_name} +done + +# slack configs +SLACK_WEBHOOK_URL= + +# Microsoft Teams configs +MSTEAMS_WEBHOOK_URL= + +# Legacy Microsoft Teams configs for backwards compatibility: +declare -A role_recipients_msteam + +# rocketchat configs +ROCKETCHAT_WEBHOOK_URL= + +# alerta configs +ALERTA_WEBHOOK_URL= +ALERTA_API_KEY= + +# flock configs +FLOCK_WEBHOOK_URL= + +# discord configs +DISCORD_WEBHOOK_URL= + +# pushover configs +PUSHOVER_APP_TOKEN= + +# pushbullet configs +PUSHBULLET_ACCESS_TOKEN= +PUSHBULLET_SOURCE_DEVICE= + +# twilio configs +TWILIO_ACCOUNT_SID= +TWILIO_ACCOUNT_TOKEN= +TWILIO_NUMBER= + +# hipchat configs +HIPCHAT_SERVER= +HIPCHAT_AUTH_TOKEN= + +# messagebird configs +MESSAGEBIRD_ACCESS_KEY= +MESSAGEBIRD_NUMBER= + +# kavenegar configs +KAVENEGAR_API_KEY= +KAVENEGAR_SENDER= + +# telegram configs +TELEGRAM_BOT_TOKEN= + +# kafka configs +SEND_KAFKA="YES" +KAFKA_URL= +KAFKA_SENDER_IP= + +# pagerduty.com configs +PD_SERVICE_KEY= +USE_PD_VERSION= + +# fleep.io configs +FLEEP_SENDER="${host}" + +# Amazon SNS configs +AWSSNS_MESSAGE_FORMAT= + +# Matrix configs +MATRIX_HOMESERVER= +MATRIX_ACCESSTOKEN= + +# syslog configs +SYSLOG_FACILITY= + +# email configs +EMAIL_SENDER= +EMAIL_CHARSET=$(locale charmap 2>/dev/null) +EMAIL_THREADING= +EMAIL_PLAINTEXT_ONLY= + +# irc configs +IRC_NICKNAME= +IRC_REALNAME= +IRC_NETWORK= +IRC_PORT=6667 + +# hangouts configs +declare -A HANGOUTS_WEBHOOK_URI +declare -A HANGOUTS_WEBHOOK_THREAD + +# dynatrace configs +DYNATRACE_SPACE= +DYNATRACE_SERVER= +DYNATRACE_TOKEN= +DYNATRACE_TAG_VALUE= +DYNATRACE_ANNOTATION_TYPE= +DYNATRACE_EVENT= +SEND_DYNATRACE= + +# stackpulse configs +STACKPULSE_WEBHOOK= + +# gotify configs +GOTIFY_APP_URL= +GOTIFY_APP_TOKEN= + +# opsgenie configs +OPSGENIE_API_KEY= + +# load the stock and user configuration files +# these will overwrite the variables above + +if [ ${unittest} ]; then + if source "${cfgfile}"; then + error "Failed to load requested config file." + exit 1 + fi +else + for CONFIG in "${NETDATA_STOCK_CONFIG_DIR}/health_alarm_notify.conf" "${NETDATA_USER_CONFIG_DIR}/health_alarm_notify.conf"; do + if [ -f "${CONFIG}" ]; then + debug "Loading config file '${CONFIG}'..." + source "${CONFIG}" || error "Failed to load config file '${CONFIG}'." + else + warning "Cannot find file '${CONFIG}'." + fi + done +fi + +if [[ ! $curl_options =~ .*\--connect-timeout ]]; then + curl_options+=" --connect-timeout 5" +fi + +OPSGENIE_API_URL=${OPSGENIE_API_URL:-"https://api.opsgenie.com"} + +# If we didn't autodetect the character set for e-mail and it wasn't +# set by the user, we need to set it to a reasonable default. UTF-8 +# should be correct for almost all modern UNIX systems. +if [ -z ${EMAIL_CHARSET} ]; then + EMAIL_CHARSET="UTF-8" +fi + +# If we've been asked to use FQDN's for the URL's in the alarm, do so, +# unless we're sending an alarm for a child system which we can't get the +# FQDN of easily. +if [ "${use_fqdn}" = "YES" ] && [ "${host}" = "$(hostname -s 2>/dev/null)" ]; then + host="$(hostname -f 2>/dev/null)" +fi + + +# ----------------------------------------------------------------------------- +# migrate old Microsoft Teams configuration keys after loading configuration + +msteams_migration() { + SEND_MSTEAMS=${SEND_MSTEAM:-$SEND_MSTEAMS} + unset -v SEND_MSTEAM + DEFAULT_RECIPIENT_MSTEAMS=${DEFAULT_RECIPIENT_MSTEAM:-$DEFAULT_RECIPIENT_MSTEAMS} + MSTEAMS_WEBHOOK_URL=${MSTEAM_WEBHOOK_URL:-$MSTEAMS_WEBHOOK_URL} + MSTEAMS_ICON_DEFAULT=${MSTEAM_ICON_DEFAULT:-$MSTEAMS_ICON_DEFAULT} + MSTEAMS_ICON_CLEAR=${MSTEAM_ICON_CLEAR:-$MSTEAMS_ICON_CLEAR} + MSTEAMS_ICON_WARNING=${MSTEAM_ICON_WARNING:-$MSTEAMS_ICON_WARNING} + MSTEAMS_ICON_CRITICAL=${MSTEAM_ICON_CRITICAL:-$MSTEAMS_ICON_CRITICAL} + MSTEAMS_COLOR_DEFAULT=${MSTEAM_COLOR_DEFAULT:-$MSTEAMS_COLOR_DEFAULT} + MSTEAMS_COLOR_CLEAR=${MSTEAM_COLOR_CLEAR:-$MSTEAMS_COLOR_CLEAR} + MSTEAMS_COLOR_WARNING=${MSTEAM_COLOR_WARNING:-$MSTEAMS_COLOR_WARNING} + MSTEAMS_COLOR_CRITICAL=${MSTEAM_COLOR_CRITICAL:-$MSTEAMS_COLOR_CRITICAL} + + # migrate role specific recipients: + for key in "${!role_recipients_msteam[@]}"; do + # Disable check, if role_recipients_msteams is ever used: + # The role_recipients_$method are created and used programmatically + # by iterating over $methods. shellcheck therefore doesn't realize + # that role_recipients_msteams is actually used in the block + # "find the recipients' addresses per method". + # shellcheck disable=SC2034 + role_recipients_msteams["$key"]="${role_recipients_msteam["$key"]}" + done +} + +msteams_migration + +# ----------------------------------------------------------------------------- +# filter a recipient based on alarm event severity + +filter_recipient_by_criticality() { + local method="${1}" x="${2}" r s + shift + + r="${x/|*/}" # the recipient + s="${x/*|/}" # the severity required for notifying this recipient + + # no severity filtering for this person + [ "${r}" = "${s}" ] && return 0 + + # the severity is invalid + s="${s^^}" + if [ "${s}" != "CRITICAL" ]; then + error "SEVERITY FILTERING for ${x} VIA ${method}: invalid severity '${s,,}', only 'critical' is supported." + return 0 + fi + + # create the status tracking directory for this user + [ ! -d "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}" ] && + mkdir -p "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}" + + case "${status}" in + CRITICAL) + # make sure he will get future notifications for this alarm too + touch "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: the alarm is CRITICAL (will now receive next status change)" + return 0 + ;; + + WARNING) + if [ -f "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" ]; then + # we do not remove the file, so that he will get future notifications of this alarm + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: recipient has been notified for this alarm in the past (will still receive next status change)" + return 0 + fi + ;; + + *) + if [ -f "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" ]; then + # remove the file, so that he will only receive notifications for CRITICAL states for this alarm + rm "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" + debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: recipient has been notified for this alarm (will only receive CRITICAL notifications from now on)" + return 0 + fi + ;; + esac + + debug "SEVERITY FILTERING for ${x} VIA ${method}: BLOCK: recipient should not receive this notification" + return 1 +} + +# ----------------------------------------------------------------------------- +# verify the delivery methods supported + +# check slack +[ -z "${SLACK_WEBHOOK_URL}" ] && SEND_SLACK="NO" + +# check rocketchat +[ -z "${ROCKETCHAT_WEBHOOK_URL}" ] && SEND_ROCKETCHAT="NO" + +# check alerta +[ -z "${ALERTA_WEBHOOK_URL}" ] && SEND_ALERTA="NO" + +# check flock +[ -z "${FLOCK_WEBHOOK_URL}" ] && SEND_FLOCK="NO" + +# check discord +[ -z "${DISCORD_WEBHOOK_URL}" ] && SEND_DISCORD="NO" + +# check pushover +[ -z "${PUSHOVER_APP_TOKEN}" ] && SEND_PUSHOVER="NO" + +# check pushbullet +[ -z "${PUSHBULLET_ACCESS_TOKEN}" ] && SEND_PUSHBULLET="NO" + +# check twilio +{ [ -z "${TWILIO_ACCOUNT_TOKEN}" ] || [ -z "${TWILIO_ACCOUNT_SID}" ] || [ -z "${TWILIO_NUMBER}" ]; } && SEND_TWILIO="NO" + +# check hipchat +[ -z "${HIPCHAT_AUTH_TOKEN}" ] && SEND_HIPCHAT="NO" + +# check messagebird +{ [ -z "${MESSAGEBIRD_ACCESS_KEY}" ] || [ -z "${MESSAGEBIRD_NUMBER}" ]; } && SEND_MESSAGEBIRD="NO" + +# check kavenegar +{ [ -z "${KAVENEGAR_API_KEY}" ] || [ -z "${KAVENEGAR_SENDER}" ]; } && SEND_KAVENEGAR="NO" + +# check telegram +[ -z "${TELEGRAM_BOT_TOKEN}" ] && SEND_TELEGRAM="NO" + +# check kafka +{ [ -z "${KAFKA_URL}" ] || [ -z "${KAFKA_SENDER_IP}" ]; } && SEND_KAFKA="NO" + +# check irc +[ -z "${IRC_NETWORK}" ] && SEND_IRC="NO" + +# check hangouts +[ ${#HANGOUTS_WEBHOOK_URI[@]} -eq 0 ] && SEND_HANGOUTS="NO" + +# check fleep +#shellcheck disable=SC2153 +{ [ -z "${FLEEP_SERVER}" ] || [ -z "${FLEEP_SENDER}" ]; } && SEND_FLEEP="NO" + +# check dynatrace +{ [ -z "${DYNATRACE_SPACE}" ] || + [ -z "${DYNATRACE_SERVER}" ] || + [ -z "${DYNATRACE_TOKEN}" ] || + [ -z "${DYNATRACE_TAG_VALUE}" ] || + [ -z "${DYNATRACE_EVENT}" ]; } && SEND_DYNATRACE="NO" + +# check opsgenie +[ -z "${OPSGENIE_API_KEY}" ] && SEND_OPSGENIE="NO" + +# check matrix +{ [ -z "${MATRIX_HOMESERVER}" ] || [ -z "${MATRIX_ACCESSTOKEN}" ]; } && SEND_MATRIX="NO" + +# check gotify +{ [ -z "${GOTIFY_APP_TOKEN}" ] || [ -z "${GOTIFY_APP_URL}" ]; } && SEND_GOTIFY="NO" + +# check stackpulse +[ -z "${STACKPULSE_WEBHOOK}" ] && SEND_STACKPULSE="NO" + +# check msteams +[ -z "${MSTEAMS_WEBHOOK_URL}" ] && SEND_MSTEAMS="NO" + +# check pd +[ -z "${DEFAULT_RECIPIENT_PD}" ] && SEND_PD="NO" + +# check prowl +[ -z "${DEFAULT_RECIPIENT_PROWL}" ] && SEND_PROWL="NO" + +# check custom +[ -z "${DEFAULT_RECIPIENT_CUSTOM}" ] && SEND_CUSTOM="NO" + +if [ "${SEND_PUSHOVER}" = "YES" ] || + [ "${SEND_SLACK}" = "YES" ] || + [ "${SEND_ROCKETCHAT}" = "YES" ] || + [ "${SEND_ALERTA}" = "YES" ] || + [ "${SEND_PD}" = "YES" ] || + [ "${SEND_FLOCK}" = "YES" ] || + [ "${SEND_DISCORD}" = "YES" ] || + [ "${SEND_HIPCHAT}" = "YES" ] || + [ "${SEND_TWILIO}" = "YES" ] || + [ "${SEND_MESSAGEBIRD}" = "YES" ] || + [ "${SEND_KAVENEGAR}" = "YES" ] || + [ "${SEND_TELEGRAM}" = "YES" ] || + [ "${SEND_PUSHBULLET}" = "YES" ] || + [ "${SEND_KAFKA}" = "YES" ] || + [ "${SEND_FLEEP}" = "YES" ] || + [ "${SEND_PROWL}" = "YES" ] || + [ "${SEND_HANGOUTS}" = "YES" ] || + [ "${SEND_MATRIX}" = "YES" ] || + [ "${SEND_CUSTOM}" = "YES" ] || + [ "${SEND_MSTEAMS}" = "YES" ] || + [ "${SEND_DYNATRACE}" = "YES" ] || + [ "${SEND_STACKPULSE}" = "YES" ] || + [ "${SEND_OPSGENIE}" = "YES" ] || + [ "${SEND_GOTIFY}" = "YES" ]; then + # if we need curl, check for the curl command + if [ -z "${curl}" ]; then + curl="$(command -v curl 2>/dev/null)" + fi + if [ -z "${curl}" ]; then + error "Cannot find curl command in the system path. Disabling all curl based notifications." + SEND_PUSHOVER="NO" + SEND_PUSHBULLET="NO" + SEND_TELEGRAM="NO" + SEND_SLACK="NO" + SEND_MSTEAMS="NO" + SEND_ROCKETCHAT="NO" + SEND_ALERTA="NO" + SEND_PD="NO" + SEND_FLOCK="NO" + SEND_DISCORD="NO" + SEND_TWILIO="NO" + SEND_HIPCHAT="NO" + SEND_MESSAGEBIRD="NO" + SEND_KAVENEGAR="NO" + SEND_KAFKA="NO" + SEND_FLEEP="NO" + SEND_PROWL="NO" + SEND_HANGOUTS="NO" + SEND_MATRIX="NO" + SEND_CUSTOM="NO" + SEND_DYNATRACE="NO" + SEND_STACKPULSE="NO" + SEND_OPSGENIE="NO" + SEND_GOTIFY="NO" + fi +fi + +if [ "${SEND_SMS}" = "YES" ]; then + if [ -z "${sendsms}" ]; then + sendsms="$(command -v sendsms 2>/dev/null)" + fi + if [ -z "${sendsms}" ]; then + SEND_SMS="NO" + fi +fi +# if we need sendmail, check for the sendmail command +if [ "${SEND_EMAIL}" = "YES" ] && [ -z "${sendmail}" ]; then + sendmail="$(command -v sendmail 2>/dev/null)" + if [ -z "${sendmail}" ]; then + debug "Cannot find sendmail command in the system path. Disabling email notifications." + SEND_EMAIL="NO" + fi +fi + +# if we need logger, check for the logger command +if [ "${SEND_SYSLOG}" = "YES" ] && [ -z "${logger}" ]; then + logger="$(command -v logger 2>/dev/null)" + if [ -z "${logger}" ]; then + debug "Cannot find logger command in the system path. Disabling syslog notifications." + SEND_SYSLOG="NO" + fi +fi + +# if we need aws, check for the aws command +if [ "${SEND_AWSSNS}" = "YES" ] && [ -z "${aws}" ]; then + aws="$(command -v aws 2>/dev/null)" + if [ -z "${aws}" ]; then + debug "Cannot find aws command in the system path. Disabling Amazon SNS notifications." + SEND_AWSSNS="NO" + fi +fi + +if [ ${dump_methods} ]; then + for name in "${!SEND_@}"; do + if [ "${!name}" = "YES" ]; then + echo "$name" + fi + done + exit +fi + +# ----------------------------------------------------------------------------- +# find the recipients' addresses per method + +# netdata may call us with multiple roles, and roles may have multiple but +# overlapping recipients - so, here we find the unique recipients. +for method_name in ${method_names}; do + send_var="SEND_${method_name^^}" + if [ "${!send_var}" = "NO" ]; then + continue + fi + + declare -A arr_var=() + + for x in ${roles//,/ }; do + # the roles 'silent' and 'disabled' mean: + # don't send a notification for this role + if [ "${x}" = "silent" ] || [ "${x}" = "disabled" ]; then + continue + fi + + role_recipients="role_recipients_${method_name}[$x]" + default_recipient_var="DEFAULT_RECIPIENT_${method_name^^}" + + a="${!role_recipients}" + [ -z "${a}" ] && a="${!default_recipient_var}" + for r in ${a//,/ }; do + [ "${r}" != "disabled" ] && filter_recipient_by_criticality ${method_name} "${r}" && arr_var[${r/|*/}]="1" + done + done + + # build the list of recipients + to_var="to_${method_name}" + declare to_${method_name}="${!arr_var[*]}" + + [ -z "${!to_var}" ] && declare ${send_var}="NO" +done + +# ----------------------------------------------------------------------------- +# handle fixup of the email recipient list. + +fix_to_email() { + to_email= + while [ -n "${1}" ]; do + [ -n "${to_email}" ] && to_email="${to_email}, " + to_email="${to_email}${1}" + shift 1 + done +} + +# ${to_email} without quotes here +fix_to_email ${to_email} + +# ----------------------------------------------------------------------------- +# handle output if we're running in unit test mode +if [ ${unittest} ]; then + for method_name in ${method_names}; do + to_var="to_${method_name}" + echo "results: ${method_name}: ${!to_var}" + done + exit 0 +fi + +# ----------------------------------------------------------------------------- +# check that we have at least a method enabled +proceed=0 +for method in "${SEND_EMAIL}" \ + "${SEND_PUSHOVER}" \ + "${SEND_TELEGRAM}" \ + "${SEND_SLACK}" \ + "${SEND_ROCKETCHAT}" \ + "${SEND_ALERTA}" \ + "${SEND_FLOCK}" \ + "${SEND_DISCORD}" \ + "${SEND_TWILIO}" \ + "${SEND_HIPCHAT}" \ + "${SEND_MESSAGEBIRD}" \ + "${SEND_KAVENEGAR}" \ + "${SEND_PUSHBULLET}" \ + "${SEND_KAFKA}" \ + "${SEND_PD}" \ + "${SEND_FLEEP}" \ + "${SEND_PROWL}" \ + "${SEND_MATRIX}" \ + "${SEND_CUSTOM}" \ + "${SEND_IRC}" \ + "${SEND_HANGOUTS}" \ + "${SEND_AWSSNS}" \ + "${SEND_SYSLOG}" \ + "${SEND_SMS}" \ + "${SEND_MSTEAMS}" \ + "${SEND_DYNATRACE}" \ + "${SEND_STACKPULSE}" \ + "${SEND_OPSGENIE}" \ + "${SEND_GOTIFY}" ; do + + if [ "${method}" == "YES" ]; then + proceed=1 + break + fi +done +if [ "$proceed" -eq 0 ]; then + fatal "All notification methods are disabled. Not sending notification for host '${host}', chart '${chart}' to '${roles}' for '${name}' = '${value}' for status '${status}'." +fi + +# ----------------------------------------------------------------------------- +# get the date the alarm happened + +date=$(date --date=@${when} "${date_format}" 2>/dev/null) +[ -z "${date}" ] && date=$(date "${date_format}" 2>/dev/null) +[ -z "${date}" ] && date=$(date --date=@${when} 2>/dev/null) +[ -z "${date}" ] && date=$(date 2>/dev/null) + +# ----------------------------------------------------------------------------- +# get the date in utc the alarm happened + +date_utc=$(date --date=@${when} "${date_format}" -u 2>/dev/null) +[ -z "${date_utc}" ] && date_utc=$(date -u "${date_format}" 2>/dev/null) +[ -z "${date_utc}" ] && date_utc=$(date -u --date=@${when} 2>/dev/null) +[ -z "${date_utc}" ] && date_utc=$(date -u 2>/dev/null) + +# ---------------------------------------------------------------------------- +# prepare some extra headers if we've been asked to thread e-mails +if [ "${SEND_EMAIL}" == "YES" ] && [ "${EMAIL_THREADING}" != "NO" ]; then + email_thread_headers="In-Reply-To: <${chart}-${name}@${host}>\\r\\nReferences: <${chart}-${name}@${host}>" +else + email_thread_headers= +fi + +# ----------------------------------------------------------------------------- +# function to URL encode a string + +urlencode() { + local string="${1}" strlen encoded pos c o + + strlen=${#string} + for ((pos = 0; pos < strlen; pos++)); do + c=${string:pos:1} + case "${c}" in + [-_.~a-zA-Z0-9]) + o="${c}" + ;; + + *) + printf -v o '%%%02x' "'${c}" + ;; + esac + encoded+="${o}" + done + + REPLY="${encoded}" + echo "${REPLY}" +} + +# ----------------------------------------------------------------------------- +# function to convert a duration in seconds, to a human readable duration +# using DAYS, MINUTES, SECONDS + +duration4human() { + local s="${1}" d=0 h=0 m=0 ds="day" hs="hour" ms="minute" ss="second" ret + d=$((s / 86400)) + s=$((s - (d * 86400))) + h=$((s / 3600)) + s=$((s - (h * 3600))) + m=$((s / 60)) + s=$((s - (m * 60))) + + if [ ${d} -gt 0 ]; then + [ ${m} -ge 30 ] && h=$((h + 1)) + [ ${d} -gt 1 ] && ds="days" + [ ${h} -gt 1 ] && hs="hours" + if [ ${h} -gt 0 ]; then + ret="${d} ${ds} and ${h} ${hs}" + else + ret="${d} ${ds}" + fi + elif [ ${h} -gt 0 ]; then + [ ${s} -ge 30 ] && m=$((m + 1)) + [ ${h} -gt 1 ] && hs="hours" + [ ${m} -gt 1 ] && ms="minutes" + if [ ${m} -gt 0 ]; then + ret="${h} ${hs} and ${m} ${ms}" + else + ret="${h} ${hs}" + fi + elif [ ${m} -gt 0 ]; then + [ ${m} -gt 1 ] && ms="minutes" + [ ${s} -gt 1 ] && ss="seconds" + if [ ${s} -gt 0 ]; then + ret="${m} ${ms} and ${s} ${ss}" + else + ret="${m} ${ms}" + fi + else + [ ${s} -gt 1 ] && ss="seconds" + ret="${s} ${ss}" + fi + + REPLY="${ret}" + echo "${REPLY}" +} + +# ----------------------------------------------------------------------------- +# email sender + +send_email() { + local ret opts=() sender_email="${EMAIL_SENDER}" sender_name= + if [ "${SEND_EMAIL}" = "YES" ]; then + + if [ -n "${EMAIL_SENDER}" ]; then + if [[ ${EMAIL_SENDER} =~ ^\".*\"\ \<.*\>$ ]]; then + # the name includes double quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d '"' -f 2)" + elif [[ ${EMAIL_SENDER} =~ ^\'.*\'\ \<.*\>$ ]]; then + # the name includes single quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d "'" -f 2)" + elif [[ ${EMAIL_SENDER} =~ ^.*\ \<.*\>$ ]]; then + # the name does not have any quotes + sender_email="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 2 | cut -d '>' -f 1)" + sender_name="$(echo "${EMAIL_SENDER}" | cut -d '<' -f 1)" + fi + fi + + [ -n "${sender_email}" ] && opts+=(-f "${sender_email}") + [ -n "${sender_name}" ] && ${sendmail} -F 2>&1 | head -1 | grep -qv "sendmail: unrecognized option: F" && opts+=(-F "${sender_name}") + + if [ "${debug}" = "1" ]; then + echo >&2 "--- BEGIN sendmail command ---" + printf >&2 "%q " "${sendmail}" -t "${opts[@]}" + echo >&2 + echo >&2 "--- END sendmail command ---" + fi + + local cmd_output + cmd_output=$("${sendmail}" -t "${opts[@]}" 2>&1) + ret=$? + + if [ ${ret} -eq 0 ]; then + info "sent email notification for: ${host} ${chart}.${name} is ${status} to '${to_email}'" + return 0 + else + error "failed to send email notification for: ${host} ${chart}.${name} is ${status} to '${to_email}' with error code ${ret} (${cmd_output})." + return 1 + fi + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pushover sender + +send_pushover() { + local apptoken="${1}" usertokens="${2}" when="${3}" url="${4}" status="${5}" title="${6}" message="${7}" httpcode sent=0 user priority + + if [ "${SEND_PUSHOVER}" = "YES" ] && [ -n "${apptoken}" ] && [ -n "${usertokens}" ] && [ -n "${title}" ] && [ -n "${message}" ]; then + + # https://pushover.net/api + priority=-2 + case "${status}" in + CLEAR) priority=-1 ;; # low priority: no sound or vibration + WARNING) priority=0 ;; # normal priority: respect quiet hours + CRITICAL) priority=1 ;; # high priority: bypass quiet hours + *) priority=-2 ;; # lowest priority: no notification at all + esac + + for user in ${usertokens}; do + httpcode=$(docurl \ + --form-string "token=${apptoken}" \ + --form-string "user=${user}" \ + --form-string "html=1" \ + --form-string "title=${title}" \ + --form-string "message=${message}" \ + --form-string "timestamp=${when}" \ + --form-string "url=${url}" \ + --form-string "url_title=Open netdata dashboard to view the alarm" \ + --form-string "priority=${priority}" \ + https://api.pushover.net/1/messages.json) + + if [ "${httpcode}" = "200" ]; then + info "sent pushover notification for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send pushover notification for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pushbullet sender + +send_pushbullet() { + local userapikey="${1}" source_device="${2}" recipients="${3}" url="${4}" title="${5}" message="${6}" httpcode sent=0 userOrChannelTag + if [ "${SEND_PUSHBULLET}" = "YES" ] && [ -n "${userapikey}" ] && [ -n "${recipients}" ] && [ -n "${message}" ] && [ -n "${title}" ]; then + + # https://docs.pushbullet.com/#create-push + # Accept specification of user(s) (PushBullet account email address) and/or channel tag(s), separated by spaces. + # If recipient begins with a "#" then send to channel tag, otherwise send to email recipient. + + for userOrChannelTag in ${recipients}; do + if [ "${userOrChannelTag::1}" = "#" ]; then + userOrChannelTag_type="channel_tag" + userOrChannelTag="${userOrChannelTag:1}" # Remove hash from start of channel tag (required by pushbullet API) + else + userOrChannelTag_type="email" + fi + + httpcode=$(docurl \ + --header 'Access-Token: '${userapikey}'' \ + --header 'Content-Type: application/json' \ + --data-binary @<( + cat <<EOF + {"title": "${title}", + "type": "link", + "${userOrChannelTag_type}": "${userOrChannelTag}", + "body": "$(echo -n ${message})", + "url": "${url}", + "source_device_iden": "${source_device}"} +EOF + ) "https://api.pushbullet.com/v2/pushes" -X POST) + + if [ "${httpcode}" = "200" ]; then + info "sent pushbullet notification for: ${host} ${chart}.${name} is ${status} to '${userOrChannelTag}'" + sent=$((sent + 1)) + else + error "failed to send pushbullet notification for: ${host} ${chart}.${name} is ${status} to '${userOrChannelTag}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# kafka sender + +send_kafka() { + local httpcode sent=0 + if [ "${SEND_KAFKA}" = "YES" ]; then + httpcode=$(docurl -X POST \ + --data "{host_ip:\"${KAFKA_SENDER_IP}\",when:${when},name:\"${name}\",chart:\"${chart}\",family:\"${family}\",status:\"${status}\",old_status:\"${old_status}\",value:${value},old_value:${old_value},duration:${duration},non_clear_duration:${non_clear_duration},units:\"${units}\",info:\"${info}\"}" \ + "${KAFKA_URL}") + + if [ "${httpcode}" = "204" ]; then + info "sent kafka data for: ${host} ${chart}.${name} is ${status} and ip '${KAFKA_SENDER_IP}'" + sent=$((sent + 1)) + else + error "failed to send kafka data for: ${host} ${chart}.${name} is ${status} and ip '${KAFKA_SENDER_IP}' with HTTP response status code ${httpcode}." + fi + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# pagerduty.com sender + +send_pd() { + local recipients="${1}" sent=0 severity current_time payload url response_code + unset t + case ${status} in + CLEAR) t='resolve' ; severity='info' ;; + WARNING) t='trigger' ; severity='warning' ;; + CRITICAL) t='trigger' ; severity='critical' ;; + esac + + if [ ${SEND_PD} = "YES" ] && [ -n "${t}" ]; then + if [ "$(uname)" == "Linux" ]; then + current_time=$(date -d @${when} +'%Y-%m-%dT%H:%M:%S.000') + else + current_time=$(date -r ${when} +'%Y-%m-%dT%H:%M:%S.000') + fi + for PD_SERVICE_KEY in ${recipients}; do + d="${status} ${name} = ${value_string} - ${host}, ${family}" + if [ ${USE_PD_VERSION} = "2" ]; then + payload="$( + cat <<EOF + { + "payload" : { + "summary": "${info:0:1024}", + "source" : "${args_host}", + "severity" : "${severity}", + "timestamp" : "${current_time}", + "group" : "${family}", + "class" : "${chart}", + "custom_details": { + "value_w_units": "${value_string}", + "when": "${when}", + "duration" : "${duration}", + "roles": "${roles}", + "alarm_id" : "${alarm_id}", + "name" : "${name}", + "chart" : "${chart}", + "family" : "${family}", + "status" : "${status}", + "old_status" : "${old_status}", + "value" : "${value}", + "old_value" : "${old_value}", + "src" : "${src}", + "non_clear_duration" : "${non_clear_duration}", + "units" : "${units}", + "info" : "${info}" + } + }, + "routing_key": "${PD_SERVICE_KEY}", + "event_action": "${t}", + "dedup_key": "${unique_id}" + } +EOF + )" + url="https://events.pagerduty.com/v2/enqueue" + response_code="202" + else + payload="$( + cat <<EOF + { + "service_key": "${PD_SERVICE_KEY}", + "event_type": "${t}", + "incident_key" : "${alarm_id}", + "description": "${d}", + "details": { + "value_w_units": "${value_string}", + "when": "${when}", + "duration" : "${duration}", + "roles": "${roles}", + "alarm_id" : "${alarm_id}", + "name" : "${name}", + "chart" : "${chart}", + "family" : "${family}", + "status" : "${status}", + "old_status" : "${old_status}", + "value" : "${value}", + "old_value" : "${old_value}", + "src" : "${src}", + "non_clear_duration" : "${non_clear_duration}", + "units" : "${units}", + "info" : "${info}" + } + } +EOF + )" + url="https://events.pagerduty.com/generic/2010-04-15/create_event.json" + response_code="200" + fi + httpcode=$(docurl -X POST --data "${payload}" ${url}) + if [ "${httpcode}" = "${response_code}" ]; then + info "sent pagerduty notification for: ${host} ${chart}.${name} is ${status}'" + sent=$((sent + 1)) + else + error "failed to send pagerduty notification for: ${host} ${chart}.${name} is ${status}, with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# twilio sender + +send_twilio() { + local accountsid="${1}" accounttoken="${2}" twilionumber="${3}" recipients="${4}" title="${5}" message="${6}" httpcode sent=0 user + if [ "${SEND_TWILIO}" = "YES" ] && [ -n "${accountsid}" ] && [ -n "${accounttoken}" ] && [ -n "${twilionumber}" ] && [ -n "${recipients}" ] && [ -n "${message}" ] && [ -n "${title}" ]; then + #https://www.twilio.com/packages/labs/code/bash/twilio-sms + for user in ${recipients}; do + httpcode=$(docurl -X POST \ + --data-urlencode "From=${twilionumber}" \ + --data-urlencode "To=${user}" \ + --data-urlencode "Body=${title} ${message}" \ + -u "${accountsid}:${accounttoken}" \ + "https://api.twilio.com/2010-04-01/Accounts/${accountsid}/Messages.json") + + if [ "${httpcode}" = "201" ]; then + info "sent Twilio SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Twilio SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# hipchat sender + +send_hipchat() { + local authtoken="${1}" recipients="${2}" message="${3}" httpcode sent=0 room color msg_format notify + + # remove <small></small> from the message + message="${message//<small>/}" + message="${message//<\/small>/}" + + if [ "${SEND_HIPCHAT}" = "YES" ] && [ -n "${HIPCHAT_SERVER}" ] && [ -n "${authtoken}" ] && [ -n "${recipients}" ] && [ -n "${message}" ]; then + # Valid values: html, text. + # Defaults to 'html'. + msg_format="html" + + # Background color for message. Valid values: yellow, green, red, purple, gray, random. Defaults to 'yellow'. + case "${status}" in + WARNING) color="yellow" ;; + CRITICAL) color="red" ;; + CLEAR) color="green" ;; + *) color="gray" ;; + esac + + # Whether this message should trigger a user notification (change the tab color, play a sound, notify mobile phones, etc). + # Each recipient's notification preferences are taken into account. + # Defaults to false. + notify="true" + + for room in ${recipients}; do + httpcode=$(docurl -X POST \ + -H "Content-type: application/json" \ + -H "Authorization: Bearer ${authtoken}" \ + -d "{\"color\": \"${color}\", \"from\": \"${host}\", \"message_format\": \"${msg_format}\", \"message\": \"${message}\", \"notify\": \"${notify}\"}" \ + "https://${HIPCHAT_SERVER}/v2/room/${room}/notification") + + if [ "${httpcode}" = "204" ]; then + info "sent HipChat notification for: ${host} ${chart}.${name} is ${status} to '${room}'" + sent=$((sent + 1)) + else + error "failed to send HipChat notification for: ${host} ${chart}.${name} is ${status} to '${room}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# messagebird sender + +send_messagebird() { + local accesskey="${1}" messagebirdnumber="${2}" recipients="${3}" title="${4}" message="${5}" httpcode sent=0 user + if [ "${SEND_MESSAGEBIRD}" = "YES" ] && [ -n "${accesskey}" ] && [ -n "${messagebirdnumber}" ] && [ -n "${recipients}" ] && [ -n "${message}" ] && [ -n "${title}" ]; then + #https://developers.messagebird.com/docs/messaging + for user in ${recipients}; do + httpcode=$(docurl -X POST \ + --data-urlencode "originator=${messagebirdnumber}" \ + --data-urlencode "recipients=${user}" \ + --data-urlencode "body=${title} ${message}" \ + --data-urlencode "datacoding=auto" \ + -H "Authorization: AccessKey ${accesskey}" \ + "https://rest.messagebird.com/messages") + + if [ "${httpcode}" = "201" ]; then + info "sent Messagebird SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Messagebird SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# kavenegar sender + +send_kavenegar() { + local API_KEY="${1}" kavenegarsender="${2}" recipients="${3}" title="${4}" message="${5}" httpcode sent=0 user + if [ "${SEND_KAVENEGAR}" = "YES" ] && [ -n "${API_KEY}" ] && [ -n "${kavenegarsender}" ] && [ -n "${recipients}" ] && [ -n "${message}" ] && [ -n "${title}" ]; then + # http://api.kavenegar.com/v1/{API-KEY}/sms/send.json + for user in ${recipients}; do + httpcode=$(docurl -X POST http://api.kavenegar.com/v1/${API_KEY}/sms/send.json \ + --data-urlencode "sender=${kavenegarsender}" \ + --data-urlencode "receptor=${user}" \ + --data-urlencode "message=${title} ${message}") + + if [ "${httpcode}" = "200" ]; then + info "sent Kavenegar SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send Kavenegar SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# telegram sender + +send_telegram() { + local bottoken="${1}" chatids="${2}" message="${3}" httpcode sent=0 chatid emoji disableNotification="" + + if [ "${status}" = "CLEAR" ]; then disableNotification="--data-urlencode disable_notification=true"; fi + + case "${status}" in + WARNING) emoji="β οΈ" ;; + CRITICAL) emoji="π΄" ;; + CLEAR) emoji="β
" ;; + *) emoji="βͺοΈ" ;; + esac + + if [ "${SEND_TELEGRAM}" = "YES" ] && [ -n "${bottoken}" ] && [ -n "${chatids}" ] && [ -n "${message}" ]; then + for chatid in ${chatids}; do + notify_telegram=1 + notify_retries=${TELEGRAM_RETRIES_ON_LIMIT:-0} + + while [ ${notify_telegram} -eq 1 ]; do + # https://core.telegram.org/bots/api#sendmessage + httpcode=$(docurl ${disableNotification} \ + --data-urlencode "parse_mode=HTML" \ + --data-urlencode "disable_web_page_preview=true" \ + --data-urlencode "text=${emoji} ${message}" \ + "https://api.telegram.org/bot${bottoken}/sendMessage?chat_id=${chatid}") + + notify_telegram=0 + + if [ "${httpcode}" = "200" ]; then + info "sent telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}'" + sent=$((sent + 1)) + elif [ "${httpcode}" = "401" ]; then + error "failed to send telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}': Wrong bot token." + elif [ "${httpcode}" = "429" ]; then + if [ "$notify_retries" -gt 0 ]; then + error "failed to send telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}': rate limit exceeded, retrying after 1s." + notify_retries=$((notify_retries - 1)) + notify_telegram=1 + sleep 1 + else + error "failed to send telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}': rate limit exceeded." + fi + else + error "failed to send telegram notification for: ${host} ${chart}.${name} is ${status} to '${chatid}' with HTTP response status code ${httpcode}." + fi + done + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# Microsoft Team sender + +send_msteams() { + + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_MSTEAMS}" != "YES" ] && return 1 + + case "${status}" in + WARNING) icon="${MSTEAMS_ICON_WARNING}" && color="${MSTEAMS_COLOR_WARNING}" ;; + CRITICAL) icon="${MSTEAMS_ICON_CRITICAL}" && color="${MSTEAMS_COLOR_CRITICAL}" ;; + CLEAR) icon="${MSTEAMS_ICON_CLEAR}" && color="${MSTEAMS_COLOR_CLEAR}" ;; + *) icon="${MSTEAMS_ICON_DEFAULT}" && color="${MSTEAMS_COLOR_DEFAULT}" ;; + esac + + for channel in ${channels}; do + ## More details are available here regarding the payload syntax options : https://docs.microsoft.com/en-us/outlook/actionable-messages/message-card-reference + ## Online designer : https://adaptivecards.io/designer/ + payload="$( + cat <<EOF + { + "@context": "http://schema.org/extensions", + "@type": "MessageCard", + "themeColor": "${color}", + "title": "$icon Alert ${status} from netdata for ${host}", + "text": "${host} ${status_message}, ${chart} (_${family}_), *${alarm}*", + "potentialAction": [ + { + "@type": "OpenUri", + "name": "Netdata", + "targets": [ + { "os": "default", "uri": "${goto_url}" } + ] + } + ] + } +EOF + )" + + # Replacing in the webhook CHANNEL string by the MS Teams channel name from conf file. + cur_webhook="${webhook//CHANNEL/${channel}}" + + httpcode=$(docurl -H "Content-Type: application/json" -d "${payload}" "${cur_webhook}") + + if [ "${httpcode}" = "200" ]; then + info "sent Microsoft team notification for: ${host} ${chart}.${name} is ${status} to '${cur_webhook}'" + sent=$((sent + 1)) + else + error "failed to send Microsoft team notification for: ${host} ${chart}.${name} is ${status} to '${cur_webhook}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# slack sender + +send_slack() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_SLACK}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels}; do + # Default entry in the recipient is without a hash in front (backwards-compatible). Accept specification of channel or user. + if [ "${channel::1}" != "#" ] && [ "${channel::1}" != "@" ]; then channel="#$channel"; fi + + # If channel is equal to "#" then do not send the channel attribute at all. Slack also defines channels and users in webhooks. + if [ "${channel}" = "#" ]; then + ch="" + chstr="without specifying a channel" + else + ch="\"channel\": \"${channel}\"," + chstr="to '${channel}'" + fi + + payload="$( + cat <<EOF + { + $ch + "username": "netdata on ${host}", + "icon_url": "${images_base_url}/images/banner-icon-144x144.png", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "attachments": [ + { + "fallback": "${alarm} - ${chart} (${family}) - ${info}", + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "short": true + }, + { + "title": "${family}", + "short": true + } + ], + "thumb_url": "${image}", + "footer": "by ${host}", + "ts": ${when} + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ]; then + info "sent slack notification for: ${host} ${chart}.${name} is ${status} ${chstr}" + sent=$((sent + 1)) + else + error "failed to send slack notification for: ${host} ${chart}.${name} is ${status} ${chstr}, with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# rocketchat sender + +send_rocketchat() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_ROCKETCHAT}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels}; do + payload="$( + cat <<EOF + { + "channel": "#${channel}", + "alias": "netdata on ${host}", + "avatar": "${images_base_url}/images/banner-icon-144x144.png", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "attachments": [ + { + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "short": true, + "value": "chart" + }, + { + "title": "${family}", + "short": true, + "value": "family" + } + ], + "thumb_url": "${image}", + "ts": "${when}" + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ]; then + info "sent rocketchat notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send rocketchat notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# alerta sender + +send_alerta() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel severity resource event payload auth + + [ "${SEND_ALERTA}" != "YES" ] && return 1 + + case "${status}" in + CRITICAL) severity="critical" ;; + WARNING) severity="warning" ;; + CLEAR) severity="cleared" ;; + *) severity="indeterminate" ;; + esac + + if [[ ${chart} == httpcheck* ]]; then + resource=$chart + event=$name + else + resource="${host}:${family}" + event="${chart}.${name}" + fi + + for channel in ${channels}; do + payload="$( + cat <<EOF + { + "resource": "${resource}", + "event": "${event}", + "environment": "${channel}", + "severity": "${severity}", + "service": ["Netdata"], + "group": "Performance", + "value": "${value_string}", + "text": "${info}", + "tags": ["alarm_id:${alarm_id}"], + "attributes": { + "roles": "${roles}", + "name": "${name}", + "chart": "${chart}", + "family": "${family}", + "source": "${src}", + "moreInfo": "<a href=\"${goto_url}\">View Netdata</a>" + }, + "origin": "netdata/${host}", + "type": "netdataAlarm", + "rawData": "${BASH_ARGV[@]}" + } +EOF + )" + + if [ -n "${ALERTA_API_KEY}" ]; then + auth="Key ${ALERTA_API_KEY}" + fi + + httpcode=$(docurl -X POST "${webhook}/alert" -H "Content-Type: application/json" -H "Authorization: $auth" --data "${payload}") + + if [ "${httpcode}" = "200" ] || [ "${httpcode}" = "201" ]; then + info "sent alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + elif [ "${httpcode}" = "202" ]; then + info "suppressed alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + else + error "failed to send alerta notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# flock sender + +send_flock() { + local webhook="${1}" channels="${2}" httpcode sent=0 channel color payload + + [ "${SEND_FLOCK}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels}; do + httpcode=$(docurl -X POST "${webhook}" -H "Content-Type: application/json" -d "{ + \"sendAs\": { + \"name\" : \"netdata on ${host}\", + \"profileImage\" : \"${images_base_url}/images/banner-icon-144x144.png\" + }, + \"text\": \"${host} *${status_message}*\", + \"timestamp\": \"${when}\", + \"attachments\": [ + { + \"description\": \"${chart} (${family}) - ${info}\", + \"color\": \"${color}\", + \"title\": \"${alarm}\", + \"url\": \"${goto_url}\", + \"text\": \"${info}\", + \"views\": { + \"image\": { + \"original\": { \"src\": \"${image}\", \"width\": 400, \"height\": 400 }, + \"thumbnail\": { \"src\": \"${image}\", \"width\": 50, \"height\": 50 }, + \"filename\": \"${image}\" + } + } + } + ] + }") + if [ "${httpcode}" = "200" ]; then + info "sent flock notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send flock notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# discord sender + +send_discord() { + local webhook="${1}/slack" channels="${2}" httpcode sent=0 channel color payload username + + [ "${SEND_DISCORD}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + for channel in ${channels}; do + username="netdata on ${host}" + [ ${#username} -gt 32 ] && username="${username:0:29}..." + + payload="$( + cat <<EOF + { + "channel": "#${channel}", + "username": "${username}", + "text": "${host} ${status_message}, \`${chart}\` (_${family}_), *${alarm}*", + "icon_url": "${images_base_url}/images/banner-icon-144x144.png", + "attachments": [ + { + "color": "${color}", + "title": "${alarm}", + "title_link": "${goto_url}", + "text": "${info}", + "fields": [ + { + "title": "${chart}", + "value": "${family}" + } + ], + "thumb_url": "${image}", + "footer_icon": "${images_base_url}/images/banner-icon-144x144.png", + "footer": "${host}", + "ts": ${when} + } + ] + } +EOF + )" + + httpcode=$(docurl -X POST --data-urlencode "payload=${payload}" "${webhook}") + if [ "${httpcode}" = "200" ]; then + info "sent discord notification for: ${host} ${chart}.${name} is ${status} to '${channel}'" + sent=$((sent + 1)) + else + error "failed to send discord notification for: ${host} ${chart}.${name} is ${status} to '${channel}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# fleep sender + +send_fleep() { + local httpcode sent=0 webhooks="${1}" data message + if [ "${SEND_FLEEP}" = "YES" ]; then + message="${host} ${status_message}, \`${chart}\` (${family}), *${alarm}*\\n${info}" + + for hook in ${webhooks}; do + data="{ " + data="${data} 'message': '${message}', " + data="${data} 'user': '${FLEEP_SENDER}' " + data="${data} }" + + httpcode=$(docurl -X POST --data "${data}" "https://fleep.io/hook/${hook}") + + if [ "${httpcode}" = "200" ]; then + info "sent fleep data for: ${host} ${chart}.${name} is ${status} and user '${FLEEP_SENDER}'" + sent=$((sent + 1)) + else + error "failed to send fleep data for: ${host} ${chart}.${name} is ${status} and user '${FLEEP_SENDER}' with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# Prowl sender + +send_prowl() { + local httpcode sent=0 data message keys prio=0 alarm_url event + if [ "${SEND_PROWL}" = "YES" ]; then + message="$(urlencode "${host} ${status_message}, \`${chart}\` (${family}), *${alarm}*\\n${info}")" + message="description=${message}" + keys="$(urlencode "$(echo "${1}" | tr ' ' ,)")" + keys="apikey=${keys}" + app="application=Netdata" + + case "${status}" in + CRITICAL) + prio=2 + ;; + WARNING) + prio=1 + ;; + esac + prio="priority=${prio}" + + alarm_url="$(urlencode ${goto_url})" + alarm_url="url=${alarm_url}" + event="$(urlencode "${host} ${status_message}")" + event="event=${event}" + + data="${keys}&${prio}&${alarm_url}&${app}&${event}&${message}" + + httpcode=$(docurl -X POST --data "${data}" "https://api.prowlapp.com/publicapi/add") + + if [ "${httpcode}" = "200" ]; then + info "sent prowl data for: ${host} ${chart}.${name} is ${status}" + sent=1 + else + error "failed to send prowl data for: ${host} ${chart}.${name} is ${status} with with error code ${httpcode}." + fi + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# irc sender + +send_irc() { + local NICKNAME="${1}" REALNAME="${2}" CHANNELS="${3}" NETWORK="${4}" PORT="${5}" SERVERNAME="${6}" MESSAGE="${7}" sent=0 channel color send_alarm reply_codes error + + if [ "${SEND_IRC}" = "YES" ] && [ -n "${NICKNAME}" ] && [ -n "${REALNAME}" ] && [ -n "${CHANNELS}" ] && [ -n "${NETWORK}" ] && [ -n "${SERVERNAME}" ] && [ -n "${PORT}" ]; then + case "${status}" in + WARNING) color="warning" ;; + CRITICAL) color="danger" ;; + CLEAR) color="good" ;; + *) color="#777777" ;; + esac + + SNDMESSAGE="${MESSAGE//$'\n'/", "}" + for CHANNEL in ${CHANNELS}; do + error=0 + send_alarm=$(echo -e "USER ${NICKNAME} guest ${REALNAME} ${SERVERNAME}\\nNICK ${NICKNAME}\\nJOIN ${CHANNEL}\\nPRIVMSG ${CHANNEL} :${SNDMESSAGE}\\nQUIT\\n" \ | nc "${NETWORK}" "${PORT}") + reply_codes=$(echo "${send_alarm}" | cut -d ' ' -f 2 | grep -o '[0-9]*') + for code in ${reply_codes}; do + if [ "${code}" -ge 400 ] && [ "${code}" -le 599 ]; then + error=1 + break + fi + done + + if [ "${error}" -eq 0 ]; then + info "sent irc notification for: ${host} ${chart}.${name} is ${status} to '${CHANNEL}'" + sent=$((sent + 1)) + else + error "failed to send irc notification for: ${host} ${chart}.${name} is ${status} to '${CHANNEL}', with error code ${code}." + fi + done + fi + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# Amazon SNS sender + +send_awssns() { + local targets="${1}" message='' sent=0 region='' + local default_format="${status} on ${host} at ${date}: ${chart} ${value_string}" + + [ "${SEND_AWSSNS}" = "YES" ] || return 1 + + message=${AWSSNS_MESSAGE_FORMAT:-${default_format}} + + for target in ${targets}; do + # Extract the region from the target ARN. We need to explicitly specify the region so that it matches up correctly. + region="$(echo ${target} | cut -f 4 -d ':')" + if ${aws} sns publish --region "${region}" --subject "${host} ${status_message} - ${name//_/ } - ${chart}" --message "${message}" --target-arn ${target} &>/dev/null; then + info "sent Amazon SNS notification for: ${host} ${chart}.${name} is ${status} to '${target}'" + sent=$((sent + 1)) + else + error "failed to send Amazon SNS notification for: ${host} ${chart}.${name} is ${status} to '${target}'" + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# Matrix sender + +send_matrix() { + local homeserver="${1}" webhook accesstoken rooms="${2}" httpcode sent=0 payload + + [ "${SEND_MATRIX}" != "YES" ] && return 1 + [ -z "${MATRIX_ACCESSTOKEN}" ] && return 1 + + accesstoken="${MATRIX_ACCESSTOKEN}" + + case "${status}" in + WARNING) emoji="β οΈ" ;; + CRITICAL) emoji="π΄" ;; + CLEAR) emoji="β
" ;; + *) emoji="βͺοΈ" ;; + esac + + for room in ${rooms}; do + webhook="$homeserver/_matrix/client/r0/rooms/$(urlencode $room)/send/m.room.message?access_token=$accesstoken" + payload="$( + cat <<EOF + { + "msgtype": "m.notice", + "format": "org.matrix.custom.html", + "formatted_body": "${emoji} ${host} ${status_message} - <b>${name//_/ }</b><br>${chart} (${family})<br><a href=\"${goto_url}\">${alarm}</a><br><i>${info}</i>", + "body": "${emoji} ${host} ${status_message} - ${name//_/ } ${chart} (${family}) ${goto_url} ${alarm} ${info}" + } +EOF + )" + + httpcode=$(docurl -X POST --data "${payload}" "${webhook}") + if [ "${httpcode}" == "200" ]; then + info "sent Matrix notification for: ${host} ${chart}.${name} is ${status} to '${room}'" + sent=$((sent + 1)) + else + error "failed to send Matrix notification for: ${host} ${chart}.${name} is ${status} to '${room}', with HTTP response status code ${httpcode}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# syslog sender + +send_syslog() { + local facility=${SYSLOG_FACILITY:-"local6"} level='info' targets="${1}" + local priority='' message='' server='' port='' prefix='' + local temp1='' temp2='' + + [ "${SEND_SYSLOG}" = "YES" ] || return 1 + + if [ "${status}" = "CRITICAL" ]; then + level='crit' + elif [ "${status}" = "WARNING" ]; then + level='warning' + fi + + for target in ${targets}; do + priority="${facility}.${level}" + message='' + server='' + port='' + prefix='' + temp1='' + temp2='' + + prefix=$(echo ${target} | cut -d '/' -f 2) + temp1=$(echo ${target} | cut -d '/' -f 1) + + if [ ${prefix} != ${temp1} ]; then + if (echo ${temp1} | grep -q '@'); then + temp2=$(echo ${temp1} | cut -d '@' -f 1) + server=$(echo ${temp1} | cut -d '@' -f 2) + + if [ ${temp2} != ${server} ]; then + priority=${temp2} + fi + + port=$(echo ${server} | rev | cut -d ':' -f 1 | rev) + + if (echo ${server} | grep -E -q '\[.*\]'); then + if (echo ${port} | grep -q ']'); then + port='' + else + server=$(echo ${server} | rev | cut -d ':' -f 2- | rev) + fi + else + if [ ${port} = ${server} ]; then + port='' + else + server=$(echo ${server} | cut -d ':' -f 1) + fi + fi + else + priority=${temp1} + fi + fi + + message="${prefix} ${status} on ${host} at ${date}: ${chart} ${value_string}" + + if [ ${server} ]; then + logger_options="${logger_options} -n ${server}" + if [ ${port} ]; then + logger_options="${logger_options} -P ${port}" + fi + fi + + ${logger} -p ${priority} ${logger_options} "${message}" + done + + return $? +} + +# ----------------------------------------------------------------------------- +# SMS sender + +send_sms() { + local recipients="${1}" errcode errmessage sent=0 + + # Human readable SMS + local msg="${host} ${status_message}: ${chart} (${family}), ${alarm}" + + # limit it to 160 characters + msg="${msg:0:160}" + + if [ "${SEND_SMS}" = "YES" ] && [ -n "${sendsms}" ] && [ -n "${recipients}" ] && [ -n "${msg}" ]; then + # http://api.kavenegar.com/v1/{API-KEY}/sms/send.json + for phone in ${recipients}; do + errmessage=$($sendsms $phone "$msg" 2>&1) + errcode=$? + if [ ${errcode} -eq 0 ]; then + info "sent smstools3 SMS for: ${host} ${chart}.${name} is ${status} to '${user}'" + sent=$((sent + 1)) + else + error "failed to send smstools3 SMS for: ${host} ${chart}.${name} is ${status} to '${user}' with error code ${errcode}: ${errmessage}." + fi + done + + [ ${sent} -gt 0 ] && return 0 + fi + + return 1 +} + +# ----------------------------------------------------------------------------- +# hangouts sender + +send_hangouts() { + local rooms="${1}" httpcode sent=0 room color payload webhook thread + + [ "${SEND_HANGOUTS}" != "YES" ] && return 1 + + case "${status}" in + WARNING) color="#ffa700" ;; + CRITICAL) color="#d62d20" ;; + CLEAR) color="#008744" ;; + *) color="#777777" ;; + esac + + for room in ${rooms}; do + if [ -z "${HANGOUTS_WEBHOOK_URI[$room]}" ] ; then + info "Can't send Hangouts notification for: ${host} ${chart}.${name} to room ${room}. HANGOUTS_WEBHOOK_URI[$room] not defined" + else + if [ -n "${HANGOUTS_WEBHOOK_THREAD[$room]}" ]; then + thread="\"name\" : \"${HANGOUTS_WEBHOOK_THREAD[$room]}\"" + fi + webhook="${HANGOUTS_WEBHOOK_URI[$room]}" + payload="$( + cat <<EOF + { + "cards": [ + { + "header": { + "title": "Netdata on ${host}", + "imageUrl": "${images_base_url}/images/banner-icon-144x144.png", + "imageStyle": "IMAGE" + }, + "sections": [ + { + "header": "<b>${host}</b>", + "widgets": [ + { + "keyValue": { + "topLabel": "Status Message", + "content": "<b>${status_message}</b>", + "contentMultiline": "true", + "iconUrl": "${image}", + "onClick": { + "openLink": { + "url": "${goto_url}" + } + } + } + }, + { + "keyValue": { + "topLabel": "${chart} | ${family}", + "content": "<font color=${color}>${alarm}</font>", + "contentMultiline": "true" + } + } + ] + }, + { + "widgets": [ + { + "textParagraph": { + "text": "<font color=\"#0057e7\">@ ${date}\n<b>${info}</b></font>" + } + } + ] + }, + { + "widgets": [ + { + "buttons": [ + { + "textButton": { + "text": "Go to ${host}", + "onClick": { + "openLink": { + "url": "${goto_url}" + } + } + } + } + ] + } + ] + } + ] + } + ], + "thread": { + $thread + } + } +EOF + )" + + httpcode=$(docurl -H "Content-Type: application/json" -X POST -d "${payload}" "${webhook}") + + if [ "${httpcode}" = "200" ]; then + info "sent hangouts notification for: ${host} ${chart}.${name} is ${status} to '${room}'" + sent=$((sent + 1)) + else + error "failed to send hangouts notification for: ${host} ${chart}.${name} is ${status} to '${room}', with HTTP response status code ${httpcode}." + fi + fi + done + + [ ${sent} -gt 0 ] && return 0 + + return 1 +} + +# ----------------------------------------------------------------------------- +# Dynatrace sender + +send_dynatrace() { + [ "${SEND_DYNATRACE}" != "YES" ] && return 1 + + local dynatrace_url="${DYNATRACE_SERVER}/e/${DYNATRACE_SPACE}/api/v1/events" + local description="Netdata Notification for: ${host} ${chart}.${name} is ${status}" + local payload="" + + payload=$(cat <<EOF +{ + "title": "Netdata Alarm from ${host}", + "source" : "${DYNATRACE_ANNOTATION_TYPE}", + "description" : "${description}", + "eventType": "${DYNATRACE_EVENT}", + "attachRules":{ + "tagRule":[{ + "meTypes":["HOST"], + "tags":["${DYNATRACE_TAG_VALUE}"] + }] + }, + "customProperties":{ + "description": "${description}" + } +} +EOF +) + + # echo ${payload} + + httpcode=$(docurl -X POST -H "Authorization: Api-token ${DYNATRACE_TOKEN}" -H "Content-Type: application/json" -d "${payload}" ${dynatrace_url}) + ret=$? + + + if [ ${ret} -eq 0 ]; then + if [ "${httpcode}" = "200" ]; then + info "sent ${DYNATRACE_EVENT} to ${DYNATRACE_SERVER}" + return 0 + else + warning "Dynatrace ${DYNATRACE_SERVER} responded ${httpcode} notification for: ${host} ${chart}.${name} is ${status} was not sent!" + return 1 + fi + else + error "failed to sent ${DYNATRACE_EVENT} notification for: ${host} ${chart}.${name} is ${status} to ${DYNATRACE_SERVER} with error code ${ret}." + return 1 + fi +} + + +# ----------------------------------------------------------------------------- +# Stackpulse sender + +send_stackpulse() { + local payload httpcode oldv currv + [ "${SEND_STACKPULSE}" != "YES" ] && return 1 + + # We are sending null when values are nan to avoid errors while JSON message is parsed + [ "${old_value}" != "nan" ] && oldv="${old_value}" || oldv="null" + [ "${value}" != "nan" ] && currv="${value}" || currv="null" + + payload=$(cat <<EOF + { + "Node" : "${host}", + "Chart" : "${chart}", + "OldValue" : ${oldv}, + "Value" : ${currv}, + "Units" : "${units}", + "OldStatus" : "${old_status}", + "Status" : "${status}", + "Alarm" : "${name}", + "Date": ${when}, + "Duration": ${duration}, + "NonClearDuration": ${non_clear_duration}, + "Description" : "${status_message}, ${info}", + "CalcExpression" : "${calc_expression}", + "CalcParamValues" : "${calc_param_values}", + "TotalWarnings" : "${total_warnings}", + "TotalCritical" : "${total_critical}", + "ID" : ${alarm_id} + } +EOF +) + + httpcode=$(docurl -X POST -H "Content-Type: application/json" -d "${payload}" ${STACKPULSE_WEBHOOK}) + if [ "${httpcode}" = "200" ]; then + info "sent stackpulse notification for: ${host} ${chart}.${name} is ${status}" + else + error "failed to send stackpulse notification for: ${host} ${chart}.${name} is ${status}, with HTTP response status code ${httpcode}." + return 1 + fi + + return 0 +} +# ----------------------------------------------------------------------------- +# Opsgenie sender + +send_opsgenie() { + local payload httpcode oldv currv + [ "${SEND_OPSGENIE}" != "YES" ] && return 1 + + if [ -z "${OPSGENIE_API_KEY}" ] ; then + info "Can't send Opsgenie notification, because OPSGENIE_API_KEY is not defined" + return 1 + fi + + # We are sending null when values are nan to avoid errors while JSON message is parsed + [ "${old_value}" != "nan" ] && oldv="${old_value}" || oldv="null" + [ "${value}" != "nan" ] && currv="${value}" || currv="null" + + payload=$(cat <<EOF + { + "host" : "${host}", + "unique_id" : "${unique_id}", + "alarmId" : ${alarm_id}, + "eventId" : ${event_id}, + "chart" : "${chart}", + "when": ${when}, + "name" : "${name}", + "family" : "${family}", + "status" : "${status}", + "old_status" : "${old_status}", + "value" : ${currv}, + "old_value" : ${oldv}, + "duration": ${duration}, + "non_clear_duration": ${non_clear_duration}, + "units" : "${units}", + "info" : "${status_message}, ${info}", + "calc_expression" : "${calc_expression}", + "total_warnings" : "${total_warnings}", + "total_critical" : "${total_critical}", + "src" : "${src}" + } +EOF +) + + httpcode=$(docurl -X POST -H "Content-Type: application/json" -d "${payload}" "${OPSGENIE_API_URL}/v1/json/integrations/webhooks/netdata?apiKey=${OPSGENIE_API_KEY}") + # https://docs.opsgenie.com/docs/alert-api#create-alert + if [ "${httpcode}" = "200" ]; then + info "sent opsgenie notification for: ${host} ${chart}.${name} is ${status}" + else + error "failed to send opsgenie notification for: ${host} ${chart}.${name} is ${status}, with HTTP error code ${httpcode}." + return 1 + fi + + return 0 +} + +# ----------------------------------------------------------------------------- +# Gotify sender + +send_gotify() { + local payload httpcode priority + [ "${SEND_GOTIFY}" != "YES" ] && return 1 + + if [ -z "${GOTIFY_APP_TOKEN}" ] ; then + info "Can't send Gotify notification, because GOTIFY_APP_TOKEN is not defined" + return 1 + fi + + # priority for Gotify Android app + case "${status}" in + CRITICAL) priority=10 ;; # sound + vibration + WARNING) priority=4 ;; # sound + *) priority=1 ;; # notification only + esac + + payload=$(cat <<EOF + { + "title" : "${status}, ${name} = ${value_string}, on ${host}", + "message" : "${date}: ${chart} ${value_string}", + "priority" : ${priority} + } +EOF +) + + httpcode=$(docurl -X POST -H "Content-Type: application/json" -d "${payload}" "${GOTIFY_APP_URL}/message?token=${GOTIFY_APP_TOKEN}") + if [ "${httpcode}" = "200" ]; then + info "sent gotify notification for: ${host} ${chart}.${name} is ${status}" + else + error "failed to send gotify notification for: ${host} ${chart}.${name} is ${status}, with HTTP error code ${httpcode}." + return 1 + fi + + return 0 +} + +# ----------------------------------------------------------------------------- +# prepare the content of the notification + +# the url to send the user on click +urlencode "${args_host}" >/dev/null +url_host="${REPLY}" +urlencode "${chart}" >/dev/null +url_chart="${REPLY}" +urlencode "${family}" >/dev/null +url_family="${REPLY}" +urlencode "${name}" >/dev/null +url_name="${REPLY}" +urlencode "${value_string}" >/dev/null +url_value_string="${REPLY}" + +redirect_params="host=${url_host}&chart=${url_chart}&family=${url_family}&alarm=${url_name}&alarm_unique_id=${unique_id}&alarm_id=${alarm_id}&alarm_event_id=${event_id}&alarm_when=${when}&alarm_status=${status}&alarm_chart=${chart}&alarm_value=${url_value_string}" +GOTOCLOUD=0 + +if [ "${NETDATA_REGISTRY_URL}" == "https://registry.my-netdata.io" ]; then + if [ -z "${NETDATA_REGISTRY_UNIQUE_ID}" ]; then + if [ -f "@registrydir_POST@/netdata.public.unique.id" ]; then + NETDATA_REGISTRY_UNIQUE_ID="$(cat "@registrydir_POST@/netdata.public.unique.id")" + fi + fi + if [ -n "${NETDATA_REGISTRY_UNIQUE_ID}" ]; then + GOTOCLOUD=1 + fi +fi + +if [ ${GOTOCLOUD} -eq 0 ]; then + goto_url="${NETDATA_REGISTRY_URL}/goto-host-from-alarm.html?${redirect_params}" +else + # Temporarily disable alarm redirection, as the cloud endpoint no longer exists. This functionality will be restored after discussion on #9487. For now, just lead to netdata.cloud + # Re-allow alarm redirection, for alarms 2.0, new template + if [ -z "${child_machine_guid}" ]; then + goto_url="${NETDATA_REGISTRY_CLOUD_BASE_URL}/alarms/redirect?agentId=${NETDATA_REGISTRY_UNIQUE_ID}&${redirect_params}" + else + goto_url="${NETDATA_REGISTRY_CLOUD_BASE_URL}/alarms/redirect?agentId=${NETDATA_REGISTRY_UNIQUE_ID}&childId=${child_machine_guid}&${redirect_params}" + fi +fi + +# the severity of the alarm +severity="${status}" + +# the time the alarm was raised +duration4human ${duration} >/dev/null +duration_txt="${REPLY}" +duration4human ${non_clear_duration} >/dev/null +non_clear_duration_txt="${REPLY}" +raised_for="(was ${old_status,,} for ${duration_txt})" + +# the key status message +status_message="status unknown" + +# the color of the alarm +color="grey" + +# the alarm value +alarm="${name//_/ } = ${value_string}" + +# the image of the alarm +image="${images_base_url}/images/banner-icon-144x144.png" + +# have a default email status, in case the following case does not catch it +status_email_subject="${status}" + +# prepare the title based on status +case "${status}" in +CRITICAL) + image="${images_base_url}/images/alert-128-red.png" + alarm_badge="https://app.netdata.cloud/static/email/img/label_critical.png" + status_message="is critical" + status_email_subject="Critical" + color="#ca414b" + rich_status_raised_for="Raised to critical, for ${non_clear_duration_txt}" + background_color="#FFEBEF" + border_color="#FF4136" + text_color="#FF4136" + action_text_color="#FFFFFF" + ;; + +WARNING) + image="${images_base_url}/images/alert-128-orange.png" + alarm_badge="https://app.netdata.cloud/static/email/img/label_warning.png" + status_message="needs attention" + status_email_subject="Warning" + color="#ffc107" + rich_status_raised_for="Raised to warning, for ${non_clear_duration_txt}" + background_color="#FFF8E1" + border_color="#FFC300" + text_color="#536775" + action_text_color="#35414A" + ;; + +CLEAR) + image="${images_base_url}/images/check-mark-2-128-green.png" + alarm_badge="https://app.netdata.cloud/static/email/img/label_recovered.png" + status_message="recovered" + status_email_subject="Clear" + color="#77ca6d" + rich_status_raised_for= + background_color="#E5F5E8" + border_color="#68C47D" + text_color="#00AB44" + action_text_color="#FFFFFF" + ;; +esac + +# the html email subject +html_email_subject="${status_email_subject}, ${name} = ${value_string}, on ${host}" + +if [ "${status}" = "CLEAR" ]; then + severity="Recovered from ${old_status}" + if [ ${non_clear_duration} -gt ${duration} ]; then + raised_for="(alarm was raised for ${non_clear_duration_txt})" + fi + rich_status_raised_for="Recovered from ${old_status,,}, ${raised_for}" + + # don't show the value when the status is CLEAR + # for certain alarms, this value might not have any meaning + alarm="${name//_/ } ${raised_for}" + html_email_subject="${status_email_subject}, ${name} ${raised_for}, on ${host}" + +elif { [ "${old_status}" = "WARNING" ] && [ "${status}" = "CRITICAL" ]; }; then + severity="Escalated to ${status}" + if [ ${non_clear_duration} -gt ${duration} ]; then + raised_for="(alarm is raised for ${non_clear_duration_txt})" + fi + rich_status_raised_for="Escalated to critical, ${raised_for}" + +elif { [ "${old_status}" = "CRITICAL" ] && [ "${status}" = "WARNING" ]; }; then + severity="Demoted to ${status}" + if [ ${non_clear_duration} -gt ${duration} ]; then + raised_for="(alarm is raised for ${non_clear_duration_txt})" + fi + rich_status_raised_for="Demoted to warning, ${raised_for}" + +else + raised_for= +fi + +# prepare HTML versions of elements +info_html= +[ -n "${info}" ] && info_html=" <small><br/>${info}</small>" + +raised_for_html= +[ -n "${raised_for}" ] && raised_for_html="<br/><small>${raised_for}</small>" + +# ----------------------------------------------------------------------------- +# send the slack notification + +# slack aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_slack "${SLACK_WEBHOOK_URL}" "${to_slack}" +SENT_SLACK=$? + +# ----------------------------------------------------------------------------- +# send the hangouts notification + +# hangouts aggregates posts from the same room +# so we use "${host} ${status}" as the room, to make them diff + +send_hangouts "${to_hangouts}" +SENT_HANGOUTS=$? + +# ----------------------------------------------------------------------------- +# send the Microsoft Teams notification + +# Microsoft teams aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_msteams "${MSTEAMS_WEBHOOK_URL}" "${to_msteams}" +SENT_MSTEAMS=$? + +# ----------------------------------------------------------------------------- +# send the rocketchat notification + +# rocketchat aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_rocketchat "${ROCKETCHAT_WEBHOOK_URL}" "${to_rocketchat}" +SENT_ROCKETCHAT=$? + +# ----------------------------------------------------------------------------- +# send the alerta notification + +# alerta aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_alerta "${ALERTA_WEBHOOK_URL}" "${to_alerta}" +SENT_ALERTA=$? + +# ----------------------------------------------------------------------------- +# send the flock notification + +# flock aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_flock "${FLOCK_WEBHOOK_URL}" "${to_flock}" +SENT_FLOCK=$? + +# ----------------------------------------------------------------------------- +# send the discord notification + +# discord aggregates posts from the same username +# so we use "${host} ${status}" as the bot username, to make them diff + +send_discord "${DISCORD_WEBHOOK_URL}" "${to_discord}" +SENT_DISCORD=$? + +# ----------------------------------------------------------------------------- +# send the pushover notification + +send_pushover "${PUSHOVER_APP_TOKEN}" "${to_pushover}" "${when}" "${goto_url}" "${status}" "${host} ${status_message} - ${name//_/ } - ${chart}" " +<font color=\"${color}\"><b>${alarm}</b></font>${info_html}<br/> +<small><b>${chart}</b><br/>Chart<br/> </small> +<small><b>${family}</b><br/>Family<br/> </small> +<small><b>${severity}</b><br/>Severity<br/> </small> +<small><b>${date}${raised_for_html}</b><br/>Time<br/> </small> +<a href=\"${goto_url}\">View Netdata</a><br/> +<small><small>The source of this alarm is line ${src}</small></small> +" + +SENT_PUSHOVER=$? + +# ----------------------------------------------------------------------------- +# send the pushbullet notification + +send_pushbullet "${PUSHBULLET_ACCESS_TOKEN}" "${PUSHBULLET_SOURCE_DEVICE}" "${to_pushbullet}" "${goto_url}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm}\\n +Severity: ${severity}\\n +Chart: ${chart}\\n +Family: ${family}\\n +${date}\\n +The source of this alarm is line ${src}" + +SENT_PUSHBULLET=$? + +# ----------------------------------------------------------------------------- +# send the twilio SMS + +send_twilio "${TWILIO_ACCOUNT_SID}" "${TWILIO_ACCOUNT_TOKEN}" "${TWILIO_NUMBER}" "${to_twilio}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_TWILIO=$? + +# ----------------------------------------------------------------------------- +# send the messagebird SMS + +send_messagebird "${MESSAGEBIRD_ACCESS_KEY}" "${MESSAGEBIRD_NUMBER}" "${to_messagebird}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_MESSAGEBIRD=$? + +# ----------------------------------------------------------------------------- +# send the kavenegar SMS + +send_kavenegar "${KAVENEGAR_API_KEY}" "${KAVENEGAR_SENDER}" "${to_kavenegar}" "${host} ${status_message} - ${name//_/ } - ${chart}" "${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_KAVENEGAR=$? + +# ----------------------------------------------------------------------------- +# send the telegram.org message + +# https://core.telegram.org/bots/api#formatting-options +send_telegram "${TELEGRAM_BOT_TOKEN}" "${to_telegram}" "${host} ${status_message} - <b>${name//_/ }</b> +${chart} (${family}) +<a href=\"${goto_url}\">${alarm}</a> +<i>${info}</i>" + +SENT_TELEGRAM=$? + +# ----------------------------------------------------------------------------- +# send the kafka message + +send_kafka +SENT_KAFKA=$? + +# ----------------------------------------------------------------------------- +# send the pagerduty.com message + +send_pd "${to_pd}" +SENT_PD=$? + +# ----------------------------------------------------------------------------- +# send the fleep message + +send_fleep "${to_fleep}" +SENT_FLEEP=$? + +# ----------------------------------------------------------------------------- +# send the Prowl message + +send_prowl "${to_prowl}" +SENT_PROWL=$? + +# ----------------------------------------------------------------------------- +# send the irc message + +send_irc "${IRC_NICKNAME}" "${IRC_REALNAME}" "${to_irc}" "${IRC_NETWORK}" "${IRC_PORT}" "${host}" "${host} ${status_message} - ${name//_/ } - ${chart} ----- ${alarm} +Severity: ${severity} +Chart: ${chart} +Family: ${family} +${info}" + +SENT_IRC=$? + +# ----------------------------------------------------------------------------- +# send the SMS message with smstools3 + +send_sms "${to_sms}" + +SENT_SMS=$? + +# ----------------------------------------------------------------------------- +# send the custom message + +send_custom() { + # is it enabled? + [ "${SEND_CUSTOM}" != "YES" ] && return 1 + + # do we have any sender? + [ -z "${1}" ] && return 1 + + # call the custom_sender function + custom_sender "${@}" +} + +send_custom "${to_custom}" +SENT_CUSTOM=$? + +# ----------------------------------------------------------------------------- +# send hipchat message + +send_hipchat "${HIPCHAT_AUTH_TOKEN}" "${to_hipchat}" " \ +${host} ${status_message}<br/> \ +<b>${alarm}</b> ${info_html}<br/> \ +<b>${chart}</b> (family <b>${family}</b>)<br/> \ +<b>${date}${raised_for_html}</b><br/> \ +<a href=\\\"${goto_url}\\\">View netdata dashboard</a> \ +(source of alarm ${src}) \ +" + +SENT_HIPCHAT=$? + +# ----------------------------------------------------------------------------- +# send the Amazon SNS message + +send_awssns "${to_awssns}" + +SENT_AWSSNS=$? + +# ----------------------------------------------------------------------------- +# send the Matrix message +send_matrix "${MATRIX_HOMESERVER}" "${to_matrix}" + +SENT_MATRIX=$? + + +# ----------------------------------------------------------------------------- +# send the syslog message + +send_syslog "${to_syslog}" + +SENT_SYSLOG=$? + +# ----------------------------------------------------------------------------- +# send the email + +IFS='' read -r -d '' email_plaintext_part <<EOF +Content-Type: text/plain; encoding=${EMAIL_CHARSET} +Content-Disposition: inline +Content-Transfer-Encoding: 8bit + +${host} ${status_message} + +${alarm} ${info} +${raised_for} + +Chart : ${chart} +Family : ${family} +Severity: ${severity} +URL : ${goto_url} +Source : ${src} +Date : ${date} +Notification generated on ${host} + +Evaluated Expression : ${calc_expression} +Expression Variables : ${calc_param_values} + +The host has ${total_warnings} WARNING and ${total_critical} CRITICAL alarm(s) raised. +EOF + +if [[ "${EMAIL_PLAINTEXT_ONLY}" == "YES" ]]; then + +send_email <<EOF +To: ${to_email} +Subject: ${host} ${status_message} - ${name//_/ } - ${chart} +MIME-Version: 1.0 +Content-Type: multipart/alternative; boundary="multipart-boundary" +${email_thread_headers} +X-Netdata-Severity: ${status,,} +X-Netdata-Alert-Name: $name +X-Netdata-Chart: $chart +X-Netdata-Family: $family +X-Netdata-Classification: $classification +X-Netdata-Host: $host +X-Netdata-Role: $roles + +This is a MIME-encoded multipart message + +--multipart-boundary +${email_plaintext_part} +--multipart-boundary-- +EOF + +else + +now=$(date "+%s") + +if [ -n "$total_warn_alarms" ]; then + while read -d, -r pair; do + IFS='=' read -r key val <<<"$pair" + + date_w=$(date --date=@${val} "${date_format}" 2>/dev/null) + [ -z "${date_w}" ] && date_w=$(date "${date_format}" 2>/dev/null) + [ -z "${date_w}" ] && date_w=$(date --date=@${val} 2>/dev/null) + [ -z "${date_w}" ] && date_w=$(date 2>/dev/null) + + elapsed=$((now - val)) + + duration4human ${elapsed} >/dev/null + elapsed_txt="${REPLY}" + + WARN_ALARMS+=" + <div class=\"set-font\" style=\"font-family: 'IBM Plex Sans', sans-serif; background: #FFFFFF; background-color: #FFFFFF; margin: 0px auto; max-width: 600px;\"> + <table align=\"center\" border=\"0\" cellpadding=\"0\" cellspacing=\"0\" role=\"presentation\" style=\"background:#FFFFFF;background-color:#FFFFFF;width:100%;\"> + <tbody> + <tr> + <td style=\"border-top:8px solid #F7F8F8;direction:ltr;font-size:0px;padding:20px 0;text-align:center;\"> + <!--[if mso | IE]><table role=\"presentation\" border=\"0\" cellpadding=\"0\" cellspacing=\"0\"><tr><td class=\"\" style=\"vertical-align:top;width:300px;\" ><![endif]--> + <div class=\"mj-column-per-50 mj-outlook-group-fix\" style=\"font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:50%;\"> + <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" role=\"presentation\" style=\"vertical-align:top;\" width=\"100%\"> + <tbody> + <tr> + <td align=\"left\" style=\"font-size:0px;padding:10px 25px;word-break:break-word;\"> + <div style=\"font-family:Open Sans, sans-serif;font-size:14px;font-weight:600;line-height:1;text-align:left;color:#35414A;\">${key}</div> + </td> + </tr> + <tr> + <td align=\"left\" style=\"font-size:0px;padding:10px 25px;padding-top:2px;word-break:break-word;\"> + <div style=\"font-family:Open Sans, sans-serif;font-size:12px;line-height:1;text-align:left;color:#35414A;\">${date_w}</div> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td><td class=\"\" style=\"vertical-align:top;width:300px;\" ><![endif]--> + <div class=\"mj-column-per-50 mj-outlook-group-fix\" style=\"font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:50%;\"> + <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" role=\"presentation\" width=\"100%\"> + <tbody> + <tr> + <td style=\"vertical-align:top;padding-top:13px;\"> + <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" role=\"presentation\" style width=\"100%\"> + <tbody> + <tr> + <td align=\"right\" style=\"font-size:0px;padding:10px 25px;word-break:break-word;\"> + <div style=\"font-family:Open Sans, sans-serif;font-size:13px;line-height:1;text-align:right;color:#555555;\"><span style=\"background-color:#FFF8E1; border: 1px solid #FFC300; border-radius:36px; padding: 2px 12px; margin-top: 20px; white-space: nowrap\"> + Warning for ${elapsed_txt} + </span></div> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + " + + done <<<"$total_warn_alarms," +fi + +if [ -n "$total_crit_alarms" ]; then + while read -d, -r pair; do + IFS='=' read -r key val <<<"$pair" + + date_c=$(date --date=@${val} "${date_format}" 2>/dev/null) + [ -z "${date_c}" ] && date_c=$(date "${date_format}" 2>/dev/null) + [ -z "${date_c}" ] && date_c=$(date --date=@${val} 2>/dev/null) + [ -z "${date_c}" ] && date_c=$(date 2>/dev/null) + + elapsed=$((now - val)) + + duration4human ${elapsed} >/dev/null + elapsed_txt="${REPLY}" + + CRIT_ALARMS+=" + <div class=\"set-font\" style=\"font-family: 'IBM Plex Sans', sans-serif; background: #FFFFFF; background-color: #FFFFFF; margin: 0px auto; max-width: 600px;\"> + <table align=\"center\" border=\"0\" cellpadding=\"0\" cellspacing=\"0\" role=\"presentation\" style=\"background:#FFFFFF;background-color:#FFFFFF;width:100%;\"> + <tbody> + <tr> + <td style=\"border-top:8px solid #F7F8F8;direction:ltr;font-size:0px;padding:20px 0;text-align:center;\"> + <!--[if mso | IE]><table role=\"presentation\" border=\"0\" cellpadding=\"0\" cellspacing=\"0\"><tr><td class=\"\" style=\"vertical-align:top;width:300px;\" ><![endif]--> + <div class=\"mj-column-per-50 mj-outlook-group-fix\" style=\"font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:50%;\"> + <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" role=\"presentation\" style=\"vertical-align:top;\" width=\"100%\"> + <tbody> + <tr> + <td align=\"left\" style=\"font-size:0px;padding:10px 25px;word-break:break-word;\"> + <div style=\"font-family:Open Sans, sans-serif;font-size:14px;font-weight:600;line-height:1;text-align:left;color:#35414A;\">${key}</div> + </td> + </tr> + <tr> + <td align=\"left\" style=\"font-size:0px;padding:10px 25px;padding-top:2px;word-break:break-word;\"> + <div style=\"font-family:Open Sans, sans-serif;font-size:12px;line-height:1;text-align:left;color:#35414A;\">${date_c}</div> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td><td class=\"\" style=\"vertical-align:top;width:300px;\" ><![endif]--> + <div class=\"mj-column-per-50 mj-outlook-group-fix\" style=\"font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:50%;\"> + <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" role=\"presentation\" width=\"100%\"> + <tbody> + <tr> + <td style=\"vertical-align:top;padding-top:13px;\"> + <table border=\"0\" cellpadding=\"0\" cellspacing=\"0\" role=\"presentation\" style width=\"100%\"> + <tbody> + <tr> + <td align=\"right\" style=\"font-size:0px;padding:10px 25px;word-break:break-word;\"> + <div style=\"font-family:Open Sans, sans-serif;font-size:13px;line-height:1;text-align:right;color:#35414A;\"><span style=\"background-color:#FFEBEF; border: 1px solid #FF4136; border-radius:36px; padding: 2px 12px; margin-top: 20px; white-space: nowrap\"> + Critical for ${elapsed_txt} + </span></div> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + " + + done <<<"$total_crit_alarms," +fi + +if (( total_warnings + total_critical > 15 )); then + EXTRA_ALARMS_LIST_TEXT="(Showing latest 15 alerts)" +fi + +if [ -n "$edit_command_line" ]; then + IFS='=' read -r edit_command line s_host <<<"$edit_command_line" +fi + +IFS='' read -r -d '' email_html_part <<EOF +Content-Type: text/html; encoding=${EMAIL_CHARSET} +Content-Disposition: inline +Content-Transfer-Encoding: 8bit + +<!doctype html> +<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office"> +<head> + <title> + </title> + <!--[if !mso]><!--> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <!--<![endif]--> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <style type="text/css"> + #outlook a { padding:0; } + body { margin:0;padding:0;-webkit-text-size-adjust:100%;-ms-text-size-adjust:100%; } + table, td { border-collapse:collapse;mso-table-lspace:0pt;mso-table-rspace:0pt; } + img { border:0;height:auto;line-height:100%; outline:none;text-decoration:none;-ms-interpolation-mode:bicubic; } + p { display:block;margin:13px 0; } + </style> + <!--[if mso]> + <xml> + <o:OfficeDocumentSettings> + <o:AllowPNG/> + <o:PixelsPerInch>96</o:PixelsPerInch> + </o:OfficeDocumentSettings> + </xml> + <![endif]--> + <!--[if lte mso 11]> + <style type="text/css"> + .mj-outlook-group-fix { width:100% !important; } + </style> + <![endif]--> + <!--[if !mso]><!--> + <link href="https://fonts.googleapis.com/css2?family=Open+Sans:wght@300;400;500;600;700&display=swap" rel="stylesheet" type="text/css"> + <link href="https://fonts.googleapis.com/css?family=Ubuntu:300,400,500,700" rel="stylesheet" type="text/css"> + <style type="text/css"> + @import url(https://fonts.googleapis.com/css2?family=Open+Sans:wght@300;400;500;600;700&display=swap); + @import url(https://fonts.googleapis.com/css?family=Ubuntu:300,400,500,700); + </style> + <!--<![endif]--> + <style type="text/css"> + @media only screen and (min-width:100px) { + .mj-column-px-130 { width:130px !important; max-width: 130px; } + .mj-column-per-50 { width:50% !important; max-width: 50%; } + .mj-column-per-70 { width:70% !important; max-width: 70%; } + .mj-column-per-30 { width:30% !important; max-width: 30%; } + .mj-column-per-100 { width:100% !important; max-width: 100%; } + .mj-column-px-66 { width:66px !important; max-width: 66px; } + .mj-column-px-400 { width:400px !important; max-width: 400px; } + } + </style> + <style type="text/css"> + @media only screen and (max-width:100px) { + table.mj-full-width-mobile { width: 100% !important; } + td.mj-full-width-mobile { width: auto !important; } + } + </style> +</head> +<body style="word-spacing:normal;"> +<div class="svgbg" style="background-image: url('https://staging.netdata.cloud/static/email/img/isotype_600.png'); background-repeat: no-repeat; background-position: top center; background-size: 600px 192px;"> + <!--[if mso | IE]><table align="center" border="0" cellpadding="0" cellspacing="0" class="" style="width:600px;" width="600" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div style="margin:0px auto;max-width:600px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;padding-bottom:50px;padding-left:0;text-align:left;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:130px;" ><![endif]--> + <div class="mj-column-px-130 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:130px;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="center" style="font-size:0px;padding:10px 25px;padding-right:0;padding-left:0;word-break:break-word;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="border-collapse:collapse;border-spacing:0px;"> + <tbody> + <tr> + <td style="width:130px;"> + <img alt="Netdata Logo" height="auto" src="https://app.netdata.cloud/static/email/img/full_logo.png" style="border:0;display:block;outline:none;text-decoration:none;height:auto;width:100%;font-size:13px;" width="130"> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td><td class="" style="vertical-align:top;width:300px;" ><![endif]--> + <div class="mj-column-per-50 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:50%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" width="100%"> + <tbody> + <tr> + <td style="vertical-align:top;padding-top:4px;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-left:10px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;line-height:1;text-align:left;color:#35414A;">Notification</div> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><table align="center" border="0" cellpadding="0" cellspacing="0" class="no-collapse-outlook" style="width:600px;" width="600" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="no-collapse" style="border-collapse: initial; margin: 0px auto; border-radius: 4px; max-width: 600px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;border-radius:4px;"> + <tbody> + <tr> + <td style="border:1px solid ${border_color};direction:ltr;font-size:0px;padding:20px 0;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="set-font-outlook" width="600px" ><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:598px;" width="598" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 598px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;padding-bottom:0;padding-top:0;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:418.6px;" ><![endif]--> + <div class="mj-column-per-70 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:70%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:15px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:20px;font-weight:700;line-height:1;text-align:left;color:#35414A;">${name}</div> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td><td class="" style="vertical-align:top;width:179.4px;" ><![endif]--> + <div class="mj-column-per-30 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:30%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="right" style="font-size:0px;padding:10px 25px;padding-right:25px;word-break:break-word;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="border-collapse:collapse;border-spacing:0px;"> + <tbody> + <tr> + <td style="width:100px;"> + <img height="auto" src="${alarm_badge}" style="border:0;display:block;outline:none;text-decoration:none;height:auto;width:100%;font-size:13px;" width="100"/> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table></td></tr><tr><td class="set-font-outlook" width="600px" ><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:598px;" width="598" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 598px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:0;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:598px;" ><![endif]--> + <div class="mj-column-per-100 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:100%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:0;word-break:break-word;"> + <div style="font-family:IBM Plex Sans, sans-serif;font-size:16px;line-height:1;text-align:left;color:#35414A;">on ${host}</div> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table></td></tr><tr><td class="set-font-outlook" width="600px" ><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:598px;" width="598" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 598px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:598px;" ><![endif]--> + <div class="mj-column-per-100 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:100%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:26px;font-weight:700;line-height:1;text-align:left;color:#35414A;"><span style="color: ${text_color}; font-size:26px; background: ${background_color}; padding:4px 24px; border-radius: 36px">${value_string} + </span></div> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table></td></tr><tr><td class="set-font-outlook" width="600px" ><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:598px;" width="598" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 598px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;padding-bottom:0;padding-top:0;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:598px;" ><![endif]--> + <div class="mj-column-per-100 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:100%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;line-height:21px;text-align:left;color:#35414A;">Details: ${info}</div> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table></td></tr><tr><td class="set-font-outlook" width="600px" ><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:598px;" width="598" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 598px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;padding-bottom:0;padding-top:0;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:598px;" ><![endif]--> + <div class="mj-column-per-100 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:100%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="center" vertical-align="middle" style="font-size:0px;padding:10px 25px;word-break:break-word;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="border-collapse:separate;width:100%;line-height:100%;"> + <tr> + <td + align="center" bgcolor="${border_color}" role="presentation" style="border:none;border-radius:3px;cursor:auto;height:44px;background:${border_color};" valign="middle"> + <p style="display:block;background:${border_color};color:#ffffff;font-size:13px;font-weight:600;line-height:44px;margin:0;text-decoration:none;text-transform:none;mso-padding-alt:0px;border-radius:3px;"> + <a href="${goto_url}" style="color: ${action_text_color}; text-decoration: none; width: 100%; display: inline-block">GO TO CHART</a> + </p> + </td> + </tr> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + <div style="height:32px;line-height:32px;"> </div> + <!--[if mso | IE]><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:600px;" width="600" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; background: ${background_color}; background-color: ${background_color}; margin: 0px auto; border-radius: 4px; max-width: 600px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="background:${background_color};background-color:${background_color};width:100%;border-radius:4px;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:600px;" ><![endif]--> + <div class="mj-column-per-100 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:100%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-bottom:6px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:18px;line-height:1;text-align:left;color:#35414A;">Chart: + <span style="font-weight:700; font-size:20px">${chart}</span></div> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:18px;line-height:1;text-align:left;color:#35414A;">Family: + <span style="font-weight:700; font-size:20px">${family}</span></div> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:4px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:14px;line-height:1;text-align:left;color:#35414A;">${rich_status_raised_for}</div> + </td> + </tr> + <tr> + <td align="center" style="font-size:0px;padding:10px 25px;word-break:break-word;"> + <p style="border-top:solid 1px lightgrey;font-size:1px;margin:0px auto;width:100%;"> + </p> + <!--[if mso | IE]><table align="center" border="0" cellpadding="0" cellspacing="0" style="border-top:solid 1px lightgrey;font-size:1px;margin:0px auto;width:550px;" role="presentation" width="550px" ><tr><td style="height:0;line-height:0;"> + </td></tr></table><![endif]--> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-bottom:6px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;line-height:1;text-align:left;color:#35414A;">On + <span style="font-weight:600">${date}</span></div> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;line-height:1;text-align:left;color:#35414A;">By: + <span style="font-weight:600">${host}</span></div> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:4px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:14px;line-height:1;text-align:left;color:#35414A;">Global time: + <span style="font-weight:600">${date_utc}</span></div> + </td> + </tr> + <tr> + <td align="center" style="font-size:0px;padding:10px 25px;word-break:break-word;"> + <p style="border-top:solid 1px lightgrey;font-size:1px;margin:0px auto;width:100%;"> + </p> + <!--[if mso | IE]><table align="center" border="0" cellpadding="0" cellspacing="0" style="border-top:solid 1px lightgrey;font-size:1px;margin:0px auto;width:550px;" role="presentation" width="550px" ><tr><td style="height:0;line-height:0;"> + </td></tr></table><![endif]--> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-bottom:6px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;line-height:1;text-align:left;color:#35414A;">Classification: + <span style="font-weight:600">${classification}</span></div> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;line-height:1;text-align:left;color:#35414A;">Role: + <span style="font-weight:600">${roles}</span></div> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:600px;" width="600" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 600px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;padding-left:25px;text-align:left;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:66px;" ><![endif]--> + <div class="mj-column-px-66 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:66px;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" width="100%"> + <tbody> + <tr> + <td style="vertical-align:top;padding:0;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-right:0;padding-left:0;word-break:break-word;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="border-collapse:collapse;border-spacing:0px;"> + <tbody> + <tr> + <td style="width:48px;"> + <img height="auto" src="https://app.netdata.cloud/static/email/img/community_icon.png" style="border:0;display:block;outline:none;text-decoration:none;height:auto;width:100%;font-size:13px;" width="48"> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td><td align="left" class="" style="vertical-align:top;width:400px;" ><![endif]--> + <div class="mj-column-px-400 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:400px;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" width="100%"> + <tbody> + <tr> + <td style="vertical-align:top;padding-left:0;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-left:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;font-weight:700;line-height:1;text-align:left;color:#35414A;">Want to know more about this alert?</div> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-left:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:14px;line-height:1.3;text-align:left;color:#35414A;">Join the troubleshooting discussion for this alert on our <a href="https://community.netdata.cloud/t/${name//[._]/-}" class="link" style="color: #00AB44; text-decoration: none;">community forums</a>.</div> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:600px;" width="600" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 600px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;padding-left:25px;text-align:left;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:66px;" ><![endif]--> + <div class="mj-column-px-66 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:66px;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" width="100%"> + <tbody> + <tr> + <td style="vertical-align:top;padding:0;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-right:0;padding-left:0;word-break:break-word;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="border-collapse:collapse;border-spacing:0px;"> + <tbody> + <tr> + <td style="width:48px;"> + <img height="auto" src="https://app.netdata.cloud/static/email/img/configure_icon.png" style="border:0;display:block;outline:none;text-decoration:none;height:auto;width:100%;font-size:13px;" width="48"> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td><td align="left" class="" style="vertical-align:top;width:400px;" ><![endif]--> + <div class="mj-column-px-400 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:400px;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" width="100%"> + <tbody> + <tr> + <td style="vertical-align:top;padding-left:0;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-left:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;font-weight:700;line-height:1;text-align:left;color:#35414A;">Need to configure this alert?</div> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-left:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:14px;line-height:1.3;text-align:left;color:#35414A;"><span style="color: #00AB44"><a href="https://learn.netdata.cloud/docs/agent/health/notifications#:~:text=To%20edit%20it%20on%20your,have%20one%20or%20more%20destinations" class="link" style="color: #00AB44; text-decoration: none;">Edit</a></span> this alert's configuration file by logging into $s_host and running the following command:</div> + </td> + </tr> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:8px;padding-left:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:12px;line-height:1.3;text-align:left;color:#35414A;">${edit_command} <br> + <br>The alarm to edit is at line ${line}</div> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><table align="center" border="0" cellpadding="0" cellspacing="0" class="history-wrapper-outlook" style="width:600px;" width="600" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="history-wrapper" style="background: #F7F8F8; background-color: #F7F8F8; margin: 0px auto; max-width: 100%;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="background:#F7F8F8;background-color:#F7F8F8;width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:0;padding-bottom:24px;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="set-font-outlook" width="600px" ><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:600px;" width="600" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 600px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;padding-bottom:12px;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:600px;" ><![endif]--> + <div class="mj-column-per-100 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:100%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style="vertical-align:top;" width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:16px;line-height:1;text-align:center;color:#35414A;">The node has + <span style="font-weight:600">${total_warnings} warning</span> + and + <span style="font-weight:600">${total_critical} critical</span> + additional active alert(s)</div> + </td> + </tr> + <td align="left" style="font-size:0px;padding:10px 25px;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:12px;line-height:1;text-align:center;color:#35414A;">${EXTRA_ALARMS_LIST_TEXT}</div> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + ${CRIT_ALARMS} + ${WARN_ALARMS} + <!--[if mso | IE]></td></tr></table></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><table align="center" border="0" cellpadding="0" cellspacing="0" class="set-font-outlook" style="width:600px;" width="600" ><tr><td style="line-height:0px;font-size:0px;mso-line-height-rule:exactly;"><![endif]--> + <div class="set-font" style="font-family: 'IBM Plex Sans', sans-serif; margin: 0px auto; max-width: 600px;"> + <table align="center" border="0" cellpadding="0" cellspacing="0" role="presentation" style="width:100%;"> + <tbody> + <tr> + <td style="direction:ltr;font-size:0px;padding:20px 0;text-align:center;"> + <!--[if mso | IE]><table role="presentation" border="0" cellpadding="0" cellspacing="0"><tr><td class="" style="vertical-align:top;width:600px;" ><![endif]--> + <div class="mj-column-per-100 mj-outlook-group-fix" style="font-size:0px;text-align:left;direction:ltr;display:inline-block;vertical-align:top;width:100%;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" width="100%"> + <tbody> + <tr> + <td style="vertical-align:top;padding-top:44px;padding-bottom:12px;"> + <table border="0" cellpadding="0" cellspacing="0" role="presentation" style width="100%"> + <tbody> + <tr> + <td align="left" style="font-size:0px;padding:10px 25px;padding-top:0;padding-bottom:0;word-break:break-word;"> + <div style="font-family:Open Sans, sans-serif;font-size:13px;line-height:1;text-align:center;color:#35414A;">Β© Netdata 2021 - The real-time performance and health monitoring</div> + </td> + </tr> + </tbody> + </table> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> + </td> + </tr> + </tbody> + </table> + </div> + <!--[if mso | IE]></td></tr></table><![endif]--> +</div> +</body> +</html> +EOF + +send_email <<EOF +To: ${to_email} +Subject: ${html_email_subject} +MIME-Version: 1.0 +Content-Type: multipart/alternative; boundary="multipart-boundary" +${email_thread_headers} +X-Netdata-Severity: ${status,,} +X-Netdata-Alert-Name: $name +X-Netdata-Chart: $chart +X-Netdata-Family: $family +X-Netdata-Classification: $classification +X-Netdata-Host: $host +X-Netdata-Role: $roles + +This is a MIME-encoded multipart message + +--multipart-boundary +${email_plaintext_part} +--multipart-boundary +${email_html_part} +--multipart-boundary-- +EOF + +fi + +SENT_EMAIL=$? + +# ----------------------------------------------------------------------------- +# send the EVENT to Dynatrace +send_dynatrace "${host}" "${chart}" "${name}" "${status}" +SENT_DYNATRACE=$? + +# ----------------------------------------------------------------------------- +# send the EVENT to Stackpulse +send_stackpulse +SENT_STACKPULSE=$? + +# ----------------------------------------------------------------------------- +# send messages to Opsgenie +send_opsgenie +SENT_OPSGENIE=$? + +# ----------------------------------------------------------------------------- +# send messages to Gotify +send_gotify +SENT_GOTIFY=$? + +# ----------------------------------------------------------------------------- +# let netdata know +for state in "${SENT_EMAIL}" \ + "${SENT_PUSHOVER}" \ + "${SENT_TELEGRAM}" \ + "${SENT_SLACK}" \ + "${SENT_HANGOUTS}" \ + "${SENT_ROCKETCHAT}" \ + "${SENT_ALERTA}" \ + "${SENT_FLOCK}" \ + "${SENT_DISCORD}" \ + "${SENT_TWILIO}" \ + "${SENT_HIPCHAT}" \ + "${SENT_MESSAGEBIRD}" \ + "${SENT_KAVENEGAR}" \ + "${SENT_PUSHBULLET}" \ + "${SENT_KAFKA}" \ + "${SENT_PD}" \ + "${SENT_FLEEP}" \ + "${SENT_PROWL}" \ + "${SENT_CUSTOM}" \ + "${SENT_IRC}" \ + "${SENT_AWSSNS}" \ + "${SENT_MATRIX}" \ + "${SENT_SYSLOG}" \ + "${SENT_SMS}" \ + "${SENT_MSTEAMS}" \ + "${SENT_DYNATRACE}" \ + "${SENT_STACKPULSE}" \ + "${SENT_OPSGENIE}" \ + "${SENT_GOTIFY}"; do + if [ "${state}" -eq 0 ]; then + # we sent something + exit 0 + fi +done +# we did not send anything +exit 1 diff --git a/health/notifications/alarm-test.sh b/health/notifications/alarm-test.sh new file mode 100755 index 0000000..828aa75 --- /dev/null +++ b/health/notifications/alarm-test.sh @@ -0,0 +1,12 @@ +#!/usr/bin/env bash + +# netdata +# real-time performance and health monitoring, done right! +# (C) 2017 Costa Tsaousis <costa@tsaousis.gr> +# SPDX-License-Identifier: GPL-3.0-or-later +# +# Script to test alarm notifications for netdata + +dir="$(dirname "${0}")" +"${dir}/alarm-notify.sh" test "${1}" +exit $? diff --git a/health/notifications/alerta/Makefile.inc b/health/notifications/alerta/Makefile.inc new file mode 100644 index 0000000..10f26b0 --- /dev/null +++ b/health/notifications/alerta/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + alerta/README.md \ + alerta/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/alerta/README.md b/health/notifications/alerta/README.md new file mode 100644 index 0000000..9603aae --- /dev/null +++ b/health/notifications/alerta/README.md @@ -0,0 +1,81 @@ +<!-- +title: "alerta.io" +description: "Send alarm notifications to Alerta to see the latest health status updates from multiple nodes in a single interface." +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/alerta/README.md +--> + +# alerta.io + +The [Alerta](https://alerta.io) monitoring system is a tool used to +consolidate and de-duplicate alerts from multiple sources for quick +βat-a-glanceβ visualisation. With just one system you can monitor +alerts from many other monitoring tools on a single screen. + +![Alerta dashboard](https://docs.alerta.io/_images/alerta-screen-shot-3.png "Alerta dashboard showing several alerts.") + +Alerta's advantage is the main view, where you can see all active alarms with the most recent state. You can also view an alert history. You can send Netdata alerts to Alerta to see alerts coming from many Netdata hosts or also from a multi-host +Netdata configuration. + +## Deploying Alerta + +The recommended setup is using a dedicated server, VM or container. If you have other NGINX or Apache servers in your organization, +it is recommended to proxy to this new server. + +You can install Alerta in several ways: +- **Docker**: Alerta provides a [Docker image](https://hub.docker.com/r/alerta/alerta-web/) to get you started quickly. +- **Deployment on Ubuntu server**: Alerta's [getting started tutorial](https://docs.alerta.io/gettingstarted/tutorial-1-deploy-alerta.html) walks you through this process. +- **Advanced deployment scenarios**: More ways to install and deploy Alerta are documented on the [Alerta docs](http://docs.alerta.io/en/latest/deployment.html). + +## Sending alerts to Alerta + +### Step 1. Create an API key (if authentication in Alerta is enabled) + +You will need an API key to send messages from any source, if +Alerta is configured to use authentication (recommended). + +Create a new API key in Alerta: +1. Go to *Configuration* > *API Keys* +2. Create a new API key called "netdata" with `write:alerts` permission. + +### Step 2. Configure Netdata to send alerts to Alerta +1. Edit the `health_alarm_notify.conf` by running: +```sh +/etc/netdata/edit-config health_alarm_notify.conf +``` + +2. Modify the file as below: +``` +# enable/disable sending alerta notifications +SEND_ALERTA="YES" + +# here set your alerta server API url +# this is the API url you defined when installed Alerta server, +# it is the same for all users. Do not include last slash. +ALERTA_WEBHOOK_URL="http://yourserver/alerta/api" + +# Login with an administrative user to you Alerta server and create an API KEY +# with write permissions. +ALERTA_API_KEY="INSERT_YOUR_API_KEY_HERE" + +# you can define environments in /etc/alertad.conf option ALLOWED_ENVIRONMENTS +# standard environments are Production and Development +# if a role's recipients are not configured, a notification will be send to +# this Environment (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_ALERTA="Production" +``` + +## Test alarms + +We can test alarms using the standard approach: + +```sh +/opt/netdata/netdata-plugins/plugins.d/alarm-notify.sh test +``` + +> **Note** This script will send 3 alarms. +> Alerta will not show the alerts in the main page, because last alarm is "CLEAR". +> To see the test alarms, you need to select "closed" alarms in the top-right lookup. + +For more information see the [Alerta documentation](https://docs.alerta.io) + + diff --git a/health/notifications/awssns/Makefile.inc b/health/notifications/awssns/Makefile.inc new file mode 100644 index 0000000..ee86f4b --- /dev/null +++ b/health/notifications/awssns/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + awssns/README.md \ + awssns/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/awssns/README.md b/health/notifications/awssns/README.md new file mode 100644 index 0000000..fc4a665 --- /dev/null +++ b/health/notifications/awssns/README.md @@ -0,0 +1,53 @@ +<!-- +title: "Amazon SNS" +description: "hello" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/awssns/README.md +--> + +# Amazon SNS + +As part of its AWS suite, Amazon provides a notification broker service called 'Simple Notification Service' (SNS). Amazon SNS works similarly to Netdata's own notification system, allowing to dispatch a single notification to multiple subscribers of different types. While Amazon SNS supports sending differently formatted messages for different delivery methods, Netdata does not currently support this functionality. +Among other things, SNS supports sending notifications to: + +- Email addresses. +- Mobile Phones via SMS. +- HTTP or HTTPS web hooks. +- AWS Lambda functions. +- AWS SQS queues. +- Mobile applications via push notifications. + +For email notification support, we recommend using Netdata's email notifications, as it is has the following benefits: + +- In most cases, it requires less configuration. +- Netdata's emails are nicely pre-formatted and support features like threading, which requires a lot of manual effort in SNS. +- It is less resource intensive and more cost-efficient than SNS. + +Read on to learn how to set up Amazon SNS in Netdata. + +## Prerequisites + +Before you can enable SNS, you need: + +- The [Amazon Web Services CLI tools](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) (`awscli`). +- An actual home directory for the user you run Netdata as, instead of just using `/` as a home directory. The setup depends on the distribution, but `/var/lib/netdata` is the recommended directory. If you are using Netdata as a dedicated user, the permissions will already be correct. +- An Amazon SNS topic to send notifications to with one or more subscribers. The [Getting Started](https://docs.aws.amazon.com/sns/latest/dg/sns-getting-started.html) section of the Amazon SNS documentation covers the basics of how to set this up. Make note of the **Topic ARN** when you create the topic. +- While not mandatory, it is highly recommended to create a dedicated IAM user on your account for Netdata to send notifications. This user needs to have programmatic access, and should only allow access to SNS. For an additional layer of security, you can create one for each system or group of systems. + +## Enabling Amazon SNS + +To enable SNS: +1. Run the following command as the user Netdata runs under: + ``` + aws configure + ``` +2. Enter the access key and secret key for accessing Amazon SNS. The system also prompts you to enter the default region and output format, but you can leave those blank because Netdata doesn't use them. + +3. Specify the desired topic ARN as a recipient, see [SNS documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/US_SetupSNS.html#set-up-sns-topic-cli). +4. Optional: To change the notification format for SNS notifications, change the `AWSSNS_MESSAGE_FORMAT` variable in `health_alarm_notify.conf`. +This variable supports all the same variables you can use in custom notifications. + + The default format looks like this: + ```bash + AWSSNS_MESSAGE_FORMAT="${status} on ${host} at ${date}: ${chart} ${value_string}" + ``` + diff --git a/health/notifications/custom/Makefile.inc b/health/notifications/custom/Makefile.inc new file mode 100644 index 0000000..c64ebda --- /dev/null +++ b/health/notifications/custom/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + custom/README.md \ + custom/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/custom/README.md b/health/notifications/custom/README.md new file mode 100644 index 0000000..edc4262 --- /dev/null +++ b/health/notifications/custom/README.md @@ -0,0 +1,92 @@ +<!-- +title: "Custom" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/custom/README.md +--> + +# Custom + +Netdata allows you to send custom notifications to any endpoint you choose. + +To configure custom notifications, you will need to customize `health_alarm_notify.conf`. Open the file for editing +using [`edit-config`](/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) from the [Netdata config +directory](/docs/configure/nodes.md#the-netdata-config-directory), which is typically at `/etc/netdata`. + +You can look at the other senders in `/usr/libexec/netdata/plugins.d/alarm-notify.sh` for examples of how to modify the `custom_sender()` function in `health_alarm_notify.conf`. + +As with other notifications, you will also need to define the recipient list in `DEFAULT_RECIPIENT_CUSTOM` and/or the `role_recipients_custom` array. + +The following is a sample `custom_sender` function in `health_alarm_notify.conf`, to send an SMS via an imaginary HTTPS endpoint to the SMS gateway: + +``` + custom_sender() { + # example human readable SMS + local msg="${host} ${status_message}: ${alarm} ${raised_for}" + + # limit it to 160 characters and encode it for use in a URL + urlencode "${msg:0:160}" >/dev/null; msg="${REPLY}" + + # a space separated list of the recipients to send alarms to + to="${1}" + + for phone in ${to}; do + httpcode=$(docurl -X POST \ + --data-urlencode "From=XXX" \ + --data-urlencode "To=${phone}" \ + --data-urlencode "Body=${msg}" \ + -u "${accountsid}:${accounttoken}" \ + https://domain.website.com/) + + if [ "${httpcode}" = "200" ]; then + info "sent custom notification ${msg} to ${phone}" + sent=$((sent + 1)) + else + error "failed to send custom notification ${msg} to ${phone} with HTTP error code ${httpcode}." + fi + done +} +``` + +Variables available to the custom_sender: + +- `${to_custom}` the list of recipients for the alarm +- `${host}` the host generated this event +- `${url_host}` same as `${host}` but URL encoded +- `${unique_id}` the unique id of this event +- `${alarm_id}` the unique id of the alarm that generated this event +- `${event_id}` the incremental id of the event, for this alarm id +- `${when}` the timestamp this event occurred +- `${name}` the name of the alarm, as given in Netdata health.d entries +- `${url_name}` same as `${name}` but URL encoded +- `${chart}` the name of the chart (type.id) +- `${url_chart}` same as `${chart}` but URL encoded +- `${family}` the family of the chart +- `${url_family}` same as `${family}` but URL encoded +- `${status}` the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL +- `${old_status}` the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL +- `${value}` the current value of the alarm +- `${old_value}` the previous value of the alarm +- `${src}` the line number and file the alarm has been configured +- `${duration}` the duration in seconds of the previous alarm state +- `${duration_txt}` same as `${duration}` for humans +- `${non_clear_duration}` the total duration in seconds this is/was non-clear +- `${non_clear_duration_txt}` same as `${non_clear_duration}` for humans +- `${units}` the units of the value +- `${info}` a short description of the alarm +- `${value_string}` friendly value (with units) +- `${old_value_string}` friendly old value (with units) +- `${image}` the URL of an image to represent the status of the alarm +- `${color}` a color in #AABBCC format for the alarm +- `${goto_url}` the URL the user can click to see the Netdata dashboard +- `${calc_expression}` the expression evaluated to provide the value for the alarm +- `${calc_param_values}` the value of the variables in the evaluated expression +- `${total_warnings}` the total number of alarms in WARNING state on the host +- `${total_critical}` the total number of alarms in CRITICAL state on the host + +The following are more human friendly: + +- `${alarm}` like "name = value units" +- `${status_message}` like "needs attention", "recovered", "is critical" +- `${severity}` like "Escalated to CRITICAL", "Recovered from WARNING" +- `${raised_for}` like "(alarm was raised for 10 minutes)" + + diff --git a/health/notifications/discord/Makefile.inc b/health/notifications/discord/Makefile.inc new file mode 100644 index 0000000..78de723 --- /dev/null +++ b/health/notifications/discord/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + discord/README.md \ + discord/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/discord/README.md b/health/notifications/discord/README.md new file mode 100644 index 0000000..568d03b --- /dev/null +++ b/health/notifications/discord/README.md @@ -0,0 +1,50 @@ +<!-- +title: "Discordapp.com" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/discord/README.md +--> + +# Discordapp.com + +This is what you will get: + +![image](https://cloud.githubusercontent.com/assets/7321975/22215935/b49ede7e-e162-11e6-98d0-ae8541e6b92e.png) + +You need: + +1. The **incoming webhook URL** as given by Discord. Create a webhook by following the official [Discord documentation](https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks). You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). +2. One or more Discord channels to post the messages to. + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# sending discord notifications + +# note: multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending discord notifications +SEND_DISCORD="YES" + +# Create a webhook by following the official documentation - +# https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks +DISCORD_WEBHOOK_URL="https://discordapp.com/api/webhooks/XXXXXXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role's recipients are not configured, a notification will be send to +# this discord channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_DISCORD="alarms" +``` + +You can define multiple channels like this: `alarms systems`. +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_discord[sysadmin]="systems" +role_recipients_discord[dba]="databases systems" +role_recipients_discord[webmaster]="marketing development" +``` + +The keywords `systems`, `databases`, `marketing`, `development` are discordapp.com channels (they should already exist within your discord server). + + diff --git a/health/notifications/dynatrace/Makefile.inc b/health/notifications/dynatrace/Makefile.inc new file mode 100644 index 0000000..a2ae623 --- /dev/null +++ b/health/notifications/dynatrace/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + dynatrace/README.md \ + dynatrace/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/dynatrace/README.md b/health/notifications/dynatrace/README.md new file mode 100644 index 0000000..3f8ad85 --- /dev/null +++ b/health/notifications/dynatrace/README.md @@ -0,0 +1,34 @@ +<!-- +title: "Dynatrace" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/dynatrace/README.md +--> + +# Dynatrace + +Dynatrace allows you to receive notifications using their Events REST API. + +See [the Dynatrace documentation](https://www.dynatrace.com/support/help/extend-dynatrace/dynatrace-api/environment-api/events/post-event/) about POSTing an event in the Events API for more details. + + + +You need: + +1. Dynatrace Server. You can use the same on all your Netdata servers but make sure the server is network visible from your Netdata hosts. +The Dynatrace server should be with protocol prefixed (`http://` or `https://`). For example: `https://monitor.example.com` +This is a required parameter. +2. API Token. Generate a secure access API token that enables access to your Dynatrace monitoring data via the REST-based API. +Generate a Dynatrace API authentication token. On your Dynatrace server, go to **Settings** --> **Integration** --> **Dynatrace API** --> **Generate token**. +See [Dynatrace API - Authentication](https://www.dynatrace.com/support/help/extend-dynatrace/dynatrace-api/basics/dynatrace-api-authentication/) for more details. +This is a required parameter. +3. API Space. This is the URL part of the page you have access in order to generate the API Token. For example, the URL + for a generated API token might look like: + `https://monitor.illumineit.com/e/2a93fe0e-4cd5-469a-9d0d-1a064235cfce/#settings/integration/apikeys;gf=all` In that + case, my space is _2a93fe0e-4cd5-469a-9d0d-1a064235cfce_ This is a required parameter. +4. Generate a Server Tag. On your Dynatrace Server, go to **Settings** --> **Tags** --> **Manually applied tags** and create the Tag. +The Netdata alarm is sent as a Dynatrace Event to be correlated with all those hosts tagged with this Tag you have created. +This is a required parameter. +5. Specify the Dynatrace event. This can be one of `CUSTOM_INFO`, `CUSTOM_ANNOTATION`, `CUSTOM_CONFIGURATION`, and `CUSTOM_DEPLOYMENT`. +The default value is `CUSTOM_INFO`. +This is a required parameter. +6. Specify the annotation type. This is the source of the Dynatrace event. Put whatever it fits you, for example, +_Netdata Alarm_, which is also the default value. diff --git a/health/notifications/email/Makefile.inc b/health/notifications/email/Makefile.inc new file mode 100644 index 0000000..95dc7cf --- /dev/null +++ b/health/notifications/email/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + email/README.md \ + email/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/email/README.md b/health/notifications/email/README.md new file mode 100644 index 0000000..3dc84dd --- /dev/null +++ b/health/notifications/email/README.md @@ -0,0 +1,77 @@ +<!-- +title: "Email" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/email/README.md +--> + +# Email + +You need a working `sendmail` command for email alerts to work. Almost all MTAs provide a `sendmail` interface. + +Netdata sends all emails as user `netdata`, so make sure your `sendmail` works for local users. + +email notifications look like this: + +![image](https://user-images.githubusercontent.com/1905463/133216974-a2ca0e4f-787b-4dce-b1b2-9996a8c5f718.png) + +## Configuration + +To edit `health_alarm_notify.conf` on your system run `/etc/netdata/edit-config health_alarm_notify.conf`. + +You can configure recipients in [`/etc/netdata/health_alarm_notify.conf`](https://github.com/netdata/netdata/blob/99d44b7d0c4e006b11318a28ba4a7e7d3f9b3bae/conf.d/health_alarm_notify.conf#L101). + +You can also configure per role recipients [in the same file, a few lines below](https://github.com/netdata/netdata/blob/99d44b7d0c4e006b11318a28ba4a7e7d3f9b3bae/conf.d/health_alarm_notify.conf#L313). + +Changes to this file do not require a Netdata restart. + +You can test your configuration by issuing the commands: + +```sh +# become user netdata +sudo su -s /bin/bash netdata + +# send a test alarm +/usr/libexec/netdata/plugins.d/alarm-notify.sh test [ROLE] +``` + +Where `[ROLE]` is the role you want to test. The default (if you don't give a `[ROLE]`) is `sysadmin`. + +Note that in versions before 1.16, the plugins.d directory may be installed in a different location in certain OSs (e.g. under `/usr/lib/netdata`). +You can always find the location of the alarm-notify.sh script in `netdata.conf`. + +## Filtering + +Every notification email (both the plain text and the rich html versions) from the Netdata agent, contain a set of custom email headers that can be used for filtering using an email client. Example: + +``` +X-Netdata-Severity: warning +X-Netdata-Alert-Name: inbound_packets_dropped_ratio +X-Netdata-Chart: net_packets.enp2s0 +X-Netdata-Family: enp2s0 +X-Netdata-Classification: System +X-Netdata-Host: winterland +X-Netdata-Role: sysadmin +``` + +## Simple SMTP transport configuration + +If you want an alternative to `sendmail` in order to have a simple MTA configuration for sending emails and auth to an existing SMTP server, you can do the following: + +- Install `msmtp`. +- Modify the `sendmail` path in `health_alarm_notify.conf` to point to the location of `msmtp`: +``` +# The full path to the sendmail command. +# If empty, the system $PATH will be searched for it. +# If not found, email notifications will be disabled (silently). +sendmail="/usr/bin/msmtp" +``` +- Login as netdata : +```sh +(sudo) su -s /bin/bash netdata +``` +- Configure `~/.msmtprc` as shown [in the documentation](https://marlam.de/msmtp/documentation/). +- Finally set the appropriate permissions on the `.msmtprc` file : +```sh +chmod 600 ~/.msmtprc +``` + + diff --git a/health/notifications/flock/Makefile.inc b/health/notifications/flock/Makefile.inc new file mode 100644 index 0000000..5bde161 --- /dev/null +++ b/health/notifications/flock/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + flock/README.md \ + flock/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/flock/README.md b/health/notifications/flock/README.md new file mode 100644 index 0000000..b9e0025 --- /dev/null +++ b/health/notifications/flock/README.md @@ -0,0 +1,37 @@ +<!-- +title: "Flock" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/flock/README.md +--> + +# Flock + +This is what you will get: + +![Flock](https://i.imgur.com/ok9bRzw.png) + +You need: + +The **incoming webhook URL** as given by flock.com. +You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). + +Get them here: <https://admin.flock.com/webhooks> + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# sending flock notifications + +# enable/disable sending pushover notifications +SEND_FLOCK="YES" + +# Login to flock.com and create an incoming webhook. +# You need only one for all your Netdata servers. +# Without it, Netdata cannot send flock notifications. +FLOCK_WEBHOOK_URL="https://api.flock.com/hooks/sendMessage/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role recipient is not configured, no notification will be sent +DEFAULT_RECIPIENT_FLOCK="alarms" +``` + + diff --git a/health/notifications/gotify/Makefile.inc b/health/notifications/gotify/Makefile.inc new file mode 100644 index 0000000..7825591 --- /dev/null +++ b/health/notifications/gotify/Makefile.inc @@ -0,0 +1,11 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + gotify/README.md \ + gotify/Makefile.inc \ + $(NULL) diff --git a/health/notifications/gotify/README.md b/health/notifications/gotify/README.md new file mode 100644 index 0000000..c253c84 --- /dev/null +++ b/health/notifications/gotify/README.md @@ -0,0 +1,62 @@ +<!-- +title: "Send notifications to Gotify" +description: "Send alerts to your Gotify instance when an alert gets triggered in Netdata." +sidebar_label: "Gotify" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/gotify/README.md +--> + +# Send notifications to Gotify + +[Gotify](https://gotify.net/) is a self-hosted push notification service created for sending and receiving messages in real time. + +## Configuring Gotify + +### Prerequisites + +To use Gotify as your notification service, you need an application token. +You can generate a new token in the Gotify Web UI. + +### Configuration + +To set up Gotify in Netdata: + +1. Switch to your [config +directory](/docs/configure/nodes.md) and edit the file `health_alarm_notify.conf` using the edit config script. + + ```bash + ./edit-config health_alarm_notify.conf + ``` + +2. Change the variable `GOTIFY_APP_TOKEN` to the application token you generated in the Gotify Web UI. Change +`GOTIFY_APP_URL` to point to your Gotify instance. + + ```conf + SEND_GOTIFY="YES" + + # Application token + # Gotify instance url + GOTIFY_APP_TOKEN=XXXXXXXXXXXXXXX + GOTIFY_APP_URL=https://push.example.de/ + ``` + + Changes to `health_alarm_notify.conf` do not require a Netdata restart. + +3. Test your Gotify notifications configuration by running the following commands, replacing `ROLE` with your preferred role: + + ```sh + # become user netdata + sudo su -s /bin/bash netdata + + # send a test alarm + /usr/libexec/netdata/plugins.d/alarm-notify.sh test ROLE + ``` + + π’ If everything works, you'll see alarms in Gotify: + + ![Example alarm notifications in Gotify](https://user-images.githubusercontent.com/103264516/162509205-1e88e5d9-96b6-4f7f-9426-182776158128.png) + + π΄ If sending the test notifications fails, check `/var/log/netdata/error.log` to find the relevant error message: + + ```log + 2020-09-03 23:07:00: alarm-notify.sh: ERROR: failed to send Gotify notification for: hades test.chart.test_alarm is CRITICAL, with HTTP error code 401. + ``` diff --git a/health/notifications/hangouts/Makefile.inc b/health/notifications/hangouts/Makefile.inc new file mode 100644 index 0000000..6ff1dff --- /dev/null +++ b/health/notifications/hangouts/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + hangouts/README.md \ + hangouts/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/hangouts/README.md b/health/notifications/hangouts/README.md new file mode 100644 index 0000000..7554b39 --- /dev/null +++ b/health/notifications/hangouts/README.md @@ -0,0 +1,55 @@ +<!-- +title: "Send notifications to Google Hangouts" +description: "Send alerts to Send notifications to Google Hangouts any time an anomaly or performance issue strikes a node in your infrastructure." +sidebar_label: "Google Hangouts" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/hangouts/README.md +--> + +# Send notifications to Google Hangouts + +[Google Hangouts](https://hangouts.google.com/) is a cross-platform messaging app developed by Google. You can configure +Netdata to send alarm notifications to a Hangouts room in order to stay aware of possible health or performance issues +on your nodes. Here's an example of the notification in action: + +![Netdata on Hangouts](https://user-images.githubusercontent.com/1153921/66427166-47de6900-e9c8-11e9-8322-b4b03f084dc1.png) + +To receive notifications in Google Hangouts, you need the following in your Hangouts setup: + +1. One or more rooms. +2. An **incoming webhook** for each room. + +Follow [Google's documentation](https://developers.google.com/hangouts/chat/how-tos/webhooks) to create an incoming +webhook for each room you want to send Netdata notifications to. + +Set the webhook URIs and room names in `health_alarm_notify.conf`. To edit it on your system, run +`/etc/netdata/edit-config health_alarm_notify.conf`): + +## Threads (optional) + +Instead to receive alarms on different threads, Netdata allows you to concentrate them inside an unique thread when you +set the variable `HANGOUTS_WEBHOOK_THREAD[NAME]`. + +``` +#------------------------------------------------------------------------------ +# hangouts (google hangouts chat) global notification options +# enable/disable sending hangouts notifications +SEND_HANGOUTS="YES" +# On Hangouts, in the room you choose, create an incoming webhook, +# copy the link and paste it below and also identify the room name. +# Without it, netdata cannot send hangouts notifications to that room. +# HANGOUTS_WEBHOOK_URI[ROOM_NAME]="URLforroom1" +HANGOUTS_WEBHOOK_URI[systems]="https://chat.googleapis.com/v1/spaces/AAAAXXXXXXX/..." +HANGOUTS_WEBHOOK_URI[development]="https://chat.googleapis.com/v1/spaces/AAAAYYYYY/..." +# On Hangouts, copy a thread link and change the values for space and thread +# HANGOUTS_WEBHOOK_THREAD[systems]="spaces/AAAAXXXXXXX/threads/XXXXXXXXXXX" +# if a DEFAULT_RECIPIENT_HANGOUTS are not configured, +# notifications wouldn't be send to hangouts rooms. +# DEFAULT_RECIPIENT_HANGOUTS="systems development|critical" +DEFAULT_RECIPIENT_HANGOUTS="sysadmin devops alarms|critical" +``` + +You can define multiple rooms like this: `sysadmin devops alarms|critical`. + +The keywords `sysadmin`, `devops`, and `alarms` are Hangouts rooms. + + diff --git a/health/notifications/health_alarm_notify.conf b/health/notifications/health_alarm_notify.conf new file mode 100755 index 0000000..52de866 --- /dev/null +++ b/health/notifications/health_alarm_notify.conf @@ -0,0 +1,1286 @@ +# Configuration for alarm notifications +# +# This configuration is used by: alarm-notify.sh +# changes take effect immediately (the next alarm will use them). +# +# alarm-notify.sh can send: +# - e-mails (using the sendmail command), +# - push notifications to your mobile phone (pushover.net), +# - messages to your slack team (slack.com), +# - messages to your alerta server (alerta.io), +# - messages to your flock team (flock.com), +# - messages to your discord guild (discordapp.com), +# - messages to your telegram chat / group chat (telegram.org) +# - sms messages to your cell phone or any sms enabled device (twilio.com) +# - sms messages to your cell phone or any sms enabled device (messagebird.com) +# - sms messages to your cell phone or any sms enabled device (smstools3) +# - notifications to users on pagerduty.com +# - push notifications to iOS devices (via prowlapp.com) +# - notifications to Amazon SNS topics (aws.amazon.com) +# - messages to your irc channel on your selected network +# - messages to a local or remote syslog daemon +# - message to Microsoft Teams (through webhook) +# - message to Rocket.Chat (through webhook) +# - message to Google Hangouts Chat (through webhook) +# +# The 'to' line given at netdata alarms defines a *role*, so that many +# people can be notified for each role. +# +# This file is a BASH script itself. +# +# +#------------------------------------------------------------------------------ +# proxy configuration +# +# If you need to send curl based notifications (pushover, pushbullet, slack, alerta, +# flock, discord, telegram) via a proxy, set these to your proxy address: +#export http_proxy="http://10.0.0.1:3128/" +#export https_proxy="http://10.0.0.1:3128/" + + +#------------------------------------------------------------------------------ +# notifications images +# +# Images in notifications need to be downloaded from an Internet facing site. +# To allow notification providers fetch the icons/images, by default we set +# the URL of the global public netdata registry. +# If you have an Internet facing netdata (or you have copied the images/ folder +# of netdata to your web server), set its URL here, to fetch the notification +# images from it. +#images_base_url="http://my.public.netdata.server:19999" + + +#------------------------------------------------------------------------------ +# date handling +# +# You can configure netdata alerts to send dates in any format you want. +# This uses standard `date` command format strings. See `man date` for +# more info on what you can put in here. Note that this has to start with a '+', otherwise it won't work. +# +# For ISO 8601 dates, use '+%FT%T%z' +# For RFC 5322 dates, use '+%a, %d %b %Y %H:%M:%S %z' +# For RFC 3339 dates, use '+%F %T%:z' +# For RFC 1123 dates, use '+%a, %d %b %Y %H:%M:%S %Z' +# For RFC 1036 dates, use '+%A, %d-%b-%y %H:%M:%S %Z' +# For a reasonably local date and time (in that order), use '+%x %X' +# For the old default behavior (compatible with ANSI C's asctime() function), leave this empty. +date_format='' + + +#------------------------------------------------------------------------------ +# hostname handling +# +# By default, Netdata will use the simple hostname for the system (the +# hostname with everything after the first `.` removed) when displaying +# the hostname in alert notifications. If you prefer, you can uncomment +# the line below to have Netdata instead use the host's fully qualified +# domain name. +# +# This does not report correct FQDN's for child systems for which this +# system is a parent. +# +# Additionally, if the system host name is overridden in /etc/netdata.conf +# with the `hostname` option, that name will be used unconditionally +# instead of this. +#use_fqdn='YES' + + +#------------------------------------------------------------------------------ +# external commands + +# The full path to the sendmail command. +# If empty, the system $PATH will be searched for it. +# If not found, email notifications will be disabled (silently). +sendmail="" + +# The full path of the curl command. +# If empty, the system $PATH will be searched for it. +# If not found, most notifications will be silently disabled. +curl="" + +# The full path of the nc command. +# If empty, the system $PATH will be searched for it. +# If not found, irc notifications will be silently disabled. +nc="" + +# The full path of the logger command. +# If empty, the system $PATH will be searched for it. +# If not found, syslog notifications will be silently disabled. +logger="" + +# The full path of the aws command. +# If empty, the system $PATH will be searched for it. +# If not found, Amazon SNS notifications will be silently disabled. +aws="" + +# The full path of the sendsms command (smstools3). +# If empty, the system $PATH will be searched for it. +# If not found, SMS notifications will be silently disabled. +sendsms="" + +#------------------------------------------------------------------------------ +# extra options for external commands +# +# In some cases, you may need to change what options get passed to an +# external command. Such cases are covered here. + +# Extra options to pass to curl. In most cases, you shouldn't need to add anything +# to this. If you're having issues with HTTPS connections, you might try adding +# '--insecure' here, but be warned that it will make it much easier for +# third-parties to block notification delivery, and may allow disclosure +# of potentially sensitive information. +#curl_options="--insecure" + +# Extra options to pass to logger. You shouldn't have to specify anything +# here in most cases. +#logger_options="" + +#------------------------------------------------------------------------------ +# extra options + +# By default don't do anything if this is CLEAR, but it was not WARNING or CRITICAL. +# You can send it always if your system makes deduplication for alarms. +#clear_alarm_always='YES' + +# +#------------------------------------------------------------------------------ +# NOTE ABOUT RECIPIENTS +# +# When you define recipients (all types): +# +# - emails addresses +# - pushover user tokens +# - telegram chat ids +# - slack channels +# - alerta environment +# - flock rooms +# - discord channels +# - hipchat rooms +# - sms phone numbers +# - pagerduty.com (pd) services +# - irc channels +# +# You can append |critical to limit the notifications to be sent. +# +# In these examples, the first recipient receives all the alarms +# while the second one receives only notifications for alarms that +# have at some point become critical. The second user may still receive +# warning and clear notifications, but only for the event that previously +# caused a critical alarm. +# +# email : "user1@example.com user2@example.com|critical" +# pushover : "2987343...9437837 8756278...2362736|critical" +# telegram : "111827421 112746832|critical" +# slack : "alarms disasters|critical" +# alerta : "alarms disasters|critical" +# flock : "alarms disasters|critical" +# discord : "alarms disasters|critical" +# twilio : "+15555555555 +17777777777|critical" +# messagebird: "+15555555555 +17777777777|critical" +# kavenegar : "09155555555 09177777777|critical" +# pd : "<pd_service_key_1> <pd_service_key_2>|critical" +# irc : "<irc_channel_1> <irc_channel_2>|critical" +# hangouts : "alarms disasters|critical" +# +# If a recipient is set to empty string, the default recipient of the given +# notification method (email, pushover, telegram, slack, alerta, etc) will be used. +# To disable a notification, use the recipient called: disabled +# This works for all notification methods (including the default recipients). + + +#------------------------------------------------------------------------------ +# email global notification options + +# multiple recipients can be given like this: +# "admin1@example.com admin2@example.com ..." + +# the email address sending email notifications +# the default is the system user netdata runs as (usually: netdata) +# The following formats are supported: +# EMAIL_SENDER="user@domain" +# EMAIL_SENDER="User Name <user@domain>" +# EMAIL_SENDER="'User Name' <user@domain>" +# EMAIL_SENDER="\"User Name\" <user@domain>" +EMAIL_SENDER="" + +# enable/disable sending emails +SEND_EMAIL="YES" + +# if a role recipient is not configured, an email will be send to: +DEFAULT_RECIPIENT_EMAIL="root" +# to receive only critical alarms, set it to "root|critical" + +# Optionally specify the encoding to list in the Content-Type header. +# This doesn't change what encoding the e-mail is sent with, just what +# the headers say it was encoded as. +# This shouldn't need to be changed as it will almost always be +# autodetected from the environment. +#EMAIL_CHARSET="UTF-8" + +# You can also have netdata add headers to the message that will +# cause most e-mail clients to treat all notifications for a given +# chart+alarm+host combination as a single thread. This can help +# simplify tracking of alarms, as it provides an easy way for scripts +# to correlate messages and also will cause most clients to group all the +# messages together. This is enabled by default, uncomment the line +# below if you want to disable it. +#EMAIL_THREADING="NO" + +# By default, netdata sends HTML and Plain Text emails, some clients +# do not parse HTML emails such as command line clients. +# To make emails readable in these clients, you can configure netdata +# to not send HTML but Plain Text only emails. +#EMAIL_PLAINTEXT_ONLY="YES" + +#------------------------------------------------------------------------------ +# Dynatrace global notification options +#------------------------------------------------------------------------------ +# enable/disable sending Dynatrace notifications +SEND_DYNATRACE="YES" + +# The Dynatrace server with protocol prefix (http:// or https://), example https://monitor.illumineit.com +# Required +DYNATRACE_SERVER="" + +# Generate a Dynatrace API authentication token +# Read https://www.dynatrace.com/support/help/extend-dynatrace/dynatrace-api/basics/dynatrace-api-authentication/ +# On Dynatrace server goto Settings --> Integration --> Dynatrace API --> Generate token +# Required +DYNATRACE_TOKEN="" + +# Beware: Space is taken from dynatrace URL from browser when you create the TOKEN +# Required +DYNATRACE_SPACE="" + +# Generate a Server Tag. On the Dynatrace Server go to Settings --> Tags --> Manually applied tags create the Tag +# The Netdata alarm will be sent as a Dynatrace Event to be correlated with all those hosts tagged with this Tag +# you created. +# Required +DYNATRACE_TAG_VALUE="" + +# Change this to what you want +DYNATRACE_ANNOTATION_TYPE="Netdata Alarm" + +# This can be CUSTOM_INFO, CUSTOM_ANNOTATION, CUSTOM_CONFIGURATION, CUSTOM_DEPLOYMENT +# Applying default value +# Required +DYNATRACE_EVENT="CUSTOM_INFO" + + +DEFAULT_RECIPIENT_DYNATRACE="" + +#------------------------------------------------------------------------------ +# Stackpulse global notification options +SEND_STACKPULSE="YES" + +# Webhook +STACKPULSE_WEBHOOK="" + +DEFAULT_RECIPIENT_STACKPULSE="" + +#------------------------------------------------------------------------------ +# gotify global notification options +SEND_GOTIFY="YES" + +# App token and url +GOTIFY_APP_TOKEN="" +GOTIFY_APP_URL="" + +DEFAULT_RECIPIENT_GOTIFY="" + +#------------------------------------------------------------------------------ +# opsgenie global notification options +SEND_OPSGENIE="YES" + +# Api key +OPSGENIE_API_KEY="" +OPSGENIE_API_URL="" + +DEFAULT_RECIPIENT_OPSGENIE="" + +#------------------------------------------------------------------------------ +# hangouts (google hangouts chat) global notification options + +# enable/disable sending hangouts notifications +SEND_HANGOUTS="YES" + +# On Hangouts, in the room you choose, create an incoming webhook, +# copy the link and paste it below and also give it a room name. +# Without it, netdata cannot send hangouts notifications to that room. +# You will then use the same room name in your recipients list. For each URI, you need +# HANGOUTS_WEBHOOK_URI[room_name]="WEBHOOK_URI" +# e.g. to define systems and development rooms/recipients: +# HANGOUTS_WEBHOOK_URI[systems]="URLforroom1" +# HANGOUTS_WEBHOOK_URI[development]="URLforroom2" + +# if a DEFAULT_RECIPIENT_HANGOUTS is not configured, +# notifications won't be send to hangouts rooms. For the example above, +# a valid recipients list is the following +# DEFAULT_RECIPIENT_HANGOUTS="systems development|critical" +DEFAULT_RECIPIENT_HANGOUTS="" + +#------------------------------------------------------------------------------ +# pushover (pushover.net) global notification options + +# multiple recipients can be given like this: +# "USERTOKEN1 USERTOKEN2 ..." + +# enable/disable sending pushover notifications +SEND_PUSHOVER="YES" + +# Login to pushover.net to get your pushover app token. +# You need only one for all your netdata servers (or you can have one for +# each of your netdata - your call). +# Without an app token, netdata cannot send pushover notifications. +PUSHOVER_APP_TOKEN="" + +# if a role's recipients are not configured, a notification will be send to +# this pushover user token (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_PUSHOVER="" + + +#------------------------------------------------------------------------------ +# pushbullet (pushbullet.com) push notification options + +# multiple recipients can be given like this: +# "user1@email.com user2@mail.com" + +# enable/disable sending pushbullet notifications +SEND_PUSHBULLET="YES" + +# Signup and Login to pushbullet.com +# To get your Access Token, go to https://www.pushbullet.com/#settings/account +# Create a new access token and paste it below. +# Then just set the recipients' emails. +# Please note that the if the email in the DEFAULT_RECIPIENT_PUSHBULLET does +# not have a pushbullet account, the pushbullet service will send an email +# to that address instead. + +# Without an access token, netdata cannot send pushbullet notifications. +PUSHBULLET_ACCESS_TOKEN="" +DEFAULT_RECIPIENT_PUSHBULLET="" + +# Device iden of the sending device. Optional. +PUSHBULLET_SOURCE_DEVICE="" + + +#------------------------------------------------------------------------------ +# Twilio (twilio.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending twilio SMS +SEND_TWILIO="YES" + +# Signup for free trial and select a SMS capable Twilio Number +# To get your Account SID and Token, go to https://www.twilio.com/console +# Place your sid, token and number below. +# Then just set the recipients' phone numbers. +# The trial account is only allowed to use the number specified when set up. + +# Without an account sid and token, netdata cannot send Twilio text messages. +TWILIO_ACCOUNT_SID="" +TWILIO_ACCOUNT_TOKEN="" +TWILIO_NUMBER="" +DEFAULT_RECIPIENT_TWILIO="" + + +#------------------------------------------------------------------------------ +# Messagebird (messagebird.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending messagebird SMS +SEND_MESSAGEBIRD="YES" + +# to get an access key, create a free account at https://www.messagebird.com +# verify and activate the account (no CC info needed) +# login to your account and enter your phonenumber to get some free credits +# to get the API key, click on 'API' in the sidebar, then 'API Access (REST)' +# click 'Add access key' and fill in data (you want a live key to send SMS) + +# Without an access key, netdata cannot send Messagebird text messages. +MESSAGEBIRD_ACCESS_KEY="" +MESSAGEBIRD_NUMBER="" +DEFAULT_RECIPIENT_MESSAGEBIRD="" + + +#------------------------------------------------------------------------------ +# Kavenegar (Kavenegar.com) SMS options + +# multiple recipients can be given like this: +# "09155555555 09177777777" + +# enable/disable sending kavenegar SMS +SEND_KAVENEGAR="YES" + +# to get an access key, after selecting and purchasing your desired service +# at http://kavenegar.com/pricing.html +# login to your account, go to your dashboard and my account are +# https://panel.kavenegar.com/Client/setting/account from API Key +# copy your api key. You can generate new API Key too. +# You can find and select kevenegar sender number from this place. + +# Without an API key, netdata cannot send KAVENEGAR text messages. +KAVENEGAR_API_KEY="" +KAVENEGAR_SENDER="" +DEFAULT_RECIPIENT_KAVENEGAR="" + + +#------------------------------------------------------------------------------ +# telegram (telegram.org) global notification options + +# multiple recipients can be given like this: +# "CHAT_ID_1 CHAT_ID_2 ..." + +# enable/disable sending telegram messages +SEND_TELEGRAM="YES" + +# Contact the bot @BotFather to create a new bot and receive a bot token. +# Without it, netdata cannot send telegram messages. +TELEGRAM_BOT_TOKEN="" + +# If an API limit error is returned on sending a message, Netdata will retry this number of times before giving up. +# Setting the number to 0 makes Netdata do no retries (which is the default). +# See https://core.telegram.org/bots/faq#my-bot-is-hitting-limits-how-do-i-avoid-this +TELEGRAM_RETRIES_ON_LIMIT="0" + +# To get your chat ID send the command /getid to telegram bot @myidbot +# (https://t.me/myidbot). Each user also needs to open a conversation with the +# bot that will be sending notifications. +# If a role's recipients are not configured, a message will be sent to +# this chat id (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_TELEGRAM="" + + +#------------------------------------------------------------------------------ +# slack (slack.com) global notification options + +# multiple recipients can be given like this: +# "RECIPIENT1 RECIPIENT2 ..." + +# enable/disable sending slack notifications +SEND_SLACK="YES" + +# Login to your slack.com workspace and create an incoming webhook, using the "Incoming Webhooks" App: https://slack.com/apps/A0F7XDUAZ-incoming-webhooks +# Do not use the instructions in https://api.slack.com/incoming-webhooks#enable_webhooks, as those webhooks work only for a single channel. +# You need only one for all your netdata servers (or you can have one for each of your netdata). +# Without the app and a webhook, netdata cannot send slack notifications. +SLACK_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to: +# - A slack channel (syntax: '#channel' or 'channel') +# - A slack user (syntax: '@user') +# - The channel or user defined in slack for the webhook (syntax: '#') +# empty = do not send a notification for unconfigured roles +DEFAULT_RECIPIENT_SLACK="" + +#------------------------------------------------------------------------------ +# Microsoft Teams (office.com) global notification options +# More details are available here regarding the payload syntax options: +# https://docs.microsoft.com/en-us/outlook/actionable-messages/message-card-reference +# Online designer : https://adaptivecards.io/designer/ +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending teams notifications +SEND_MSTEAMS="YES" + +# In Microsoft Teams the channel name is encoded in the URI after +# .../IncomingWebhook/... +# You have to replace the encoded channel name by the placeholder `CHANNEL` +# in `MSTEAMS_WEBHOOK_URL`. The placeholder `CHANNEL` will be replaced by the +# actual encoded channel name before sending the notification. +MSTEAMS_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to +# this Teams channel (empty = do not send a notification for unconfigured +# roles): +# Put the different encoded channel names here like : "CHANNEL1 CHANNEL2 ..." +# AT LEAST ONE CHANNEL IS MANDATORY +DEFAULT_RECIPIENT_MSTEAMS="" + +# Define the default color scheme for alert to MS Teams - icon and color +# Icons - go to https://emojipedia.org/bomb/ +MSTEAMS_ICON_DEFAULT="β‘" +MSTEAMS_ICON_CLEAR="π" +MSTEAMS_ICON_WARNING="β οΈ" +MSTEAMS_ICON_CRITICAL="π₯" + +# Colors +MSTEAMS_COLOR_DEFAULT="0076D7" +MSTEAMS_COLOR_CLEAR="65A677" +MSTEAMS_COLOR_WARNING="FFA500" +MSTEAMS_COLOR_CRITICAL="D93F3C" + + +#------------------------------------------------------------------------------ +# rocketchat (rocket.chat) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending rocketchat notifications +SEND_ROCKETCHAT="YES" + +# Login to rocket.chat and create an incoming webhook. You need only one for all +# your netdata servers (or you can have one for each of your netdata). +# Without it, netdata cannot send rocketchat notifications. +ROCKETCHAT_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to +# this rocketchat channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_ROCKETCHAT="" + + +#------------------------------------------------------------------------------ +# alerta (alerta.io) global notification options + +# multiple recipients (Environments) can be given like this: +# "Production Development ..." + +# enable/disable sending alerta notifications +SEND_ALERTA="YES" + +# here set your alerta server API url +# this is the API url you defined when installed Alerta server, +# it is the same for all users. Do not include last slash. +# ALERTA_WEBHOOK_URL="https://<server>/alerta/api" +ALERTA_WEBHOOK_URL="" + +# Login with an administrative user to you Alerta server and create an API KEY +# with write permissions. +ALERTA_API_KEY="" + +# you can define environments in /etc/alertad.conf option ALLOWED_ENVIRONMENTS +# standard environments are Production and Development +# if a role's recipients are not configured, a notification will be send to +# this Environment (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_ALERTA="" + + +#------------------------------------------------------------------------------ +# flock (flock.com) global notification options + +# enable/disable sending flock notifications +SEND_FLOCK="YES" + +# Login to flock.com and create an incoming webhook. You need only one for all +# your netdata servers (or you can have one for each of your netdata). +# Without it, netdata cannot send flock notifications. +FLOCK_WEBHOOK_URL="" + +# if a role recipient is not configured, no notification will be sent +DEFAULT_RECIPIENT_FLOCK="" + + +#------------------------------------------------------------------------------ +# discord (discordapp.com) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending discord notifications +SEND_DISCORD="YES" + +# Create a webhook by following the official documentation - +# https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks +DISCORD_WEBHOOK_URL="" + +# if a role's recipients are not configured, a notification will be send to +# this discord channel (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_DISCORD="" + + +#------------------------------------------------------------------------------ +# hipchat global notification options + +# multiple recipients can be given like this: +# "ROOM1 ROOM2 ..." + +# enable/disable sending hipchat notifications +SEND_HIPCHAT="YES" + +# define hipchat server +HIPCHAT_SERVER="api.hipchat.com" + +# api.hipchat.com authorization token +# Without this, netdata cannot send hipchat notifications. +HIPCHAT_AUTH_TOKEN="" + +# if a role's recipients are not configured, a notification will be send to +# this hipchat room (empty = do not send a notification for unconfigured +# roles): +DEFAULT_RECIPIENT_HIPCHAT="" + + +#------------------------------------------------------------------------------ +# kafka notification options + +# enable/disable sending kafka notifications +SEND_KAFKA="YES" + +# The URL to POST kafka alarm data to. It should be the full URL. +KAFKA_URL="" + +# The IP to be used in the kafka message as the sender. +KAFKA_SENDER_IP="" + + +#------------------------------------------------------------------------------ +# pagerduty.com notification options +# +# pagerduty.com notifications require a "Generic API" (Events v1) +# pagerduty service. +# https://support.pagerduty.com/docs/services-and-integrations + +# multiple recipients can be given like this: +# "<pd_service_key_1> <pd_service_key_2> ..." + +# enable/disable sending pagerduty notifications +SEND_PD="YES" + +# if a role's recipients are not configured, a notification will be sent to +# the "General API" pagerduty.com service that uses this service key. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_PD="" + +# Which PD API are we going to use? For version 2 or newer, it is necessary to do a request for Pagerduty +# before to set the version(https://developer.pagerduty.com/docs/events-api-v2/overview/). +USE_PD_VERSION="1" + +#------------------------------------------------------------------------------ +# fleep notification options +# +# To send fleep.io notifications, you will need a webhook for the +# conversation you want to send to. + +# Fleep recipients are specified as the last part of the webhook URL. +# So, for a webhook URL of: https://fleep.io/hook/IJONmBuuSlWlkb_ttqyXJg, the +# recipient name would be: 'IJONmBuuSlWlkb_ttqyXJg'. + +# enable/disable sending fleep notifications +SEND_FLEEP="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_FLEEP="" + +# The user name to label the messages with. If this is unset, +# the hostname of the system the notification is for will be used. +FLEEP_SENDER="" + + +#------------------------------------------------------------------------------ +# irc notification options +# +# irc notifications require only the nc utility to be installed. + +# multiple recipients can be given like this: +# "<irc_channel_1> <irc_channel_2> ..." + +# enable/disable sending irc notifications +SEND_IRC="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_IRC="" + +# The irc network to which the recipients belong. It must be the full network. +# e.g. "irc.freenode.net" +IRC_NETWORK="" + +# The irc port to which a connection will occur. +# e.g. 6667 (the default one), 6697 (a TLS/SSL one) +IRC_PORT=6667 + +# The irc nickname which is required to send the notification. It must not be +# an already registered name as the connection's MODE is defined as a 'guest'. +IRC_NICKNAME="" + +# The irc realname which is required in order to make the connection and is an +# extra identifier. +IRC_REALNAME="" + + +#------------------------------------------------------------------------------ +# syslog notifications +# +# syslog notifications only need you to have a working logger command, which +# should be the case on pretty much any Linux system. + +# enable/disable sending syslog notifications +# NOTE: make sure you have everything else configured the way you want +# it _before_ turning this on. +SEND_SYSLOG="NO" + +# A note on log levels and facilities: +# +# The traditional UNIX syslog mechanism has the concept of both log +# levels and facilities. A log level indicates the relative severity of +# the message, while a facility specifies a generic source for the message +# (for example, the `mail` facility is where sendmail and postfix log +# their messages). All major syslog daemons have the ability to filter +# messages based on both log level and facility, and can often also make +# routing decisions for messages based on both factors. +# +# On Linux, the eight log levels in decreasing order of severity are: +# emerg, alert, crit, err, warning, notice, info, debug +# +# By default, warnings will be logged at the warning level, critical +# alerts at the crit level, and clear notifications at the invo level. +# +# And the 19 facilities you can log to are: +# auth, authpriv, cron, daemon, ftp, lpr, mail, news, syslog, user, +# uucp, local0, local1, local2, local3, local4, local5, local6, and local7 +# +# By default, netdata alerts will be logged to the local6 facility. +# +# Depending on your distribution, this means that either all your +# netdata alerts will by default end up in the main system log (usually +# /var/log/messages), or they won't be logged to a file at all. +# Neither of these are likely to be what you actually want, but any +# configuration to change that needs to happen in the syslog daemon +# configuration, not here. + +# This controls which facility is used by default for logging. Defaults +# to local6. +SYSLOG_FACILITY='' + +# If a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +# +# The recipient format for syslog uses the following format: +# [[facility.level][@host[:port]]/]prefix +# +# `prefix` gets appended to the front of all log messages generated for +# that recipient. The prefix is mandatory. +# 'host' and 'port' can be used to specify a remote syslog server to +# send messages to. Leave these out if you want messages to be delivered +# locally. 'host' can be either a hostname or an IP address. +# IPv6 addresses must have square around them. +# 'facility' and 'level' are used to override the default logging facility +# set above and the log level. If one is specified, both must be present. +# +# For example, to send messages with a 'netdata' prefix to a syslog +# daemon listening on port 514 on 'loghost' using the daemon facility and +# notice log level: +# DEFAULT_RECIPIENT_SYSLOG='daemon.notice@loghost:514/netdata' +# +DEFAULT_RECIPIENT_SYSLOG="netdata" + +#------------------------------------------------------------------------------ +# iOS Push Notifications + +# enable/disable sending iOS push notifications +SEND_PROWL="YES" + +# If a role's recipients are not configured, use the following, +# (empty = do not send a notification for unconfigured roles) +# +# Recipients for iOS push notifications are Prowl API keys. +# +# A recipient may also consist of multiple Prowl API keys separated by +# commas, in which case notifications will be simultaneously sent for all +# of those API keys. +DEFAULT_RECIPIENT_PROWL="" + +#------------------------------------------------------------------------------ +# Amazon SNS notifications +# +# This method requires potentially complex manual configuration. See the +# netdata wiki for information on what is needed. + +# enable/disable sending Amazon SNS notifications +SEND_AWSSNS="YES" + +# Specify a template for the Amazon SNS notifications. This supports +# the same set of variables that are usable in the `custom_sender()` +# function in the custom notification configuration below. +# +AWSSNS_MESSAGE_FORMAT="${status} on ${host} at ${date}: ${chart} ${value_string}" + +# If a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +# +# Recipients for AWS SNS notifications are specified as topic ARN's. +# +DEFAULT_RECIPIENT_AWSSNS="" + +#------------------------------------------------------------------------------ +# SMS Server Tools 3 (smstools3) global notification options + +# enable/disable sending SMS Server Tools 3 SMS notifications +SEND_SMS="YES" + +# if a role's recipients are not configured, a notification will be sent to +# this SMS channel (empty = do not send a notification for unconfigured +# roles). Multiple recipients can be given like this: "PHONE1 PHONE2 ..." + +DEFAULT_RECIPIENT_SMS="" + +# Matrix notifications +# + +# enable/disable Matrix notifications +SEND_MATRIX="YES" + +# The url of the Matrix homeserver +# e.g https://matrix.org:8448 +MATRIX_HOMESERVER= + +# An access token from a valid Matrix account. Tokens usually don't expire, +# can be controlled from a Matrix client. +# See https://matrix.org/docs/guides/client-server.html +MATRIX_ACCESSTOKEN= + +# Specify the default rooms to receive the notification if no rooms are provided +# in a role's recipients. +# The format is !roomid:homeservername +DEFAULT_RECIPIENT_MATRIX="" + +#------------------------------------------------------------------------------ +# custom notifications +# + +# enable/disable sending custom notifications +SEND_CUSTOM="YES" + +# if a role's recipients are not configured, use the following. +# (empty = do not send a notification for unconfigured roles) +DEFAULT_RECIPIENT_CUSTOM="" + +# The custom_sender() is a custom function to do whatever you need to do +custom_sender() { + # variables you can use: + # ${host} the host generated this event + # ${url_host} same as ${host} but URL encoded + # ${unique_id} the unique id of this event + # ${alarm_id} the unique id of the alarm that generated this event + # ${event_id} the incremental id of the event, for this alarm id + # ${when} the timestamp this event occurred + # ${name} the name of the alarm, as given in netdata health.d entries + # ${url_name} same as ${name} but URL encoded + # ${chart} the name of the chart (type.id) + # ${url_chart} same as ${chart} but URL encoded + # ${family} the family of the chart + # ${url_family} same as ${family} but URL encoded + # ${status} the current status : REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + # ${old_status} the previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL + # ${value} the current value of the alarm + # ${old_value} the previous value of the alarm + # ${src} the line number and file the alarm has been configured + # ${duration} the duration in seconds of the previous alarm state + # ${duration_txt} same as ${duration} for humans + # ${non_clear_duration} the total duration in seconds this is/was non-clear + # ${non_clear_duration_txt} same as ${non_clear_duration} for humans + # ${units} the units of the value + # ${info} a short description of the alarm + # ${value_string} friendly value (with units) + # ${old_value_string} friendly old value (with units) + # ${image} the URL of an image to represent the status of the alarm + # ${color} a color in #AABBCC format for the alarm + # ${goto_url} the URL the user can click to see the netdata dashboard + # ${calc_expression} the expression evaluated to provide the value for the alarm + # ${calc_param_values} the value of the variables in the evaluated expression + # ${total_warnings} the total number of alarms in WARNING state on the host + # ${total_critical} the total number of alarms in CRITICAL state on the host + + # these are more human friendly: + # ${alarm} like "name = value units" + # ${status_message} like "needs attention", "recovered", "is critical" + # ${severity} like "Escalated to CRITICAL", "Recovered from WARNING" + # ${raised_for} like "(alarm was raised for 10 minutes)" + + # example human readable SMS + local msg="${host} ${status_message}: ${alarm} ${raised_for}" + + # limit it to 160 characters and encode it for use in a URL + urlencode "${msg:0:160}" >/dev/null; msg="${REPLY}" + + # a space separated list of the recipients to send alarms to + to="${1}" + + # Sample send SMS to an imaginary SMS gateway accessible via HTTPS + #for phone in ${to}; do + # httpcode=$(docurl -X POST \ + # --data-urlencode "From=XXX" \ + # --data-urlencode "To=${phone}" \ + # --data-urlencode "Body=${msg}" \ + # -u "${accountsid}:${accounttoken}" \ + # https://domain.website.com/) + # + # if [ "${httpcode}" = "200" ]; then + # info "sent custom notification ${msg} to ${phone}" + # sent=$((sent + 1)) + # else + # error "failed to send custom notification ${msg} to ${phone} with HTTP error code ${httpcode}." + # fi + #done + + info "not sending custom notification to ${to}, for ${status} of '${host}.${chart}.${name}' - custom_sender() is not configured." +} + + +############################################################################### +# RECIPIENTS PER ROLE + +# ----------------------------------------------------------------------------- +# generic system alarms +# CPU, disks, network interfaces, entropy, etc + +role_recipients_email[sysadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[sysadmin]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[sysadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[sysadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[sysadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[sysadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[sysadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[sysadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[sysadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[sysadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[sysadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[sysadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[sysadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[sysadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[sysadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[sysadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[sysadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[sysadmin]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[sysadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[sysadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteams[sysadmin]="${DEFAULT_RECIPIENT_MSTEAMS}" + +role_recipients_rocketchat[sysadmin]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_dynatrace[sysadmin]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[sysadmin]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[sysadmin]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[sysadmin]="${DEFAULT_RECIPIENT_STACKPULSE}" + +role_recipients_gotify[sysadmin]="${DEFAULT_RECIPIENT_GOTIFY}" + +# ----------------------------------------------------------------------------- +# DNS related alarms + +role_recipients_email[domainadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[domainadmin]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[domainadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[domainadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[domainadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[domainadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[domainadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[domainadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[domainadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[domainadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[domainadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[domainadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[domainadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[domainadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[domainadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[domainadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[domainadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[domainadmin]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[domainadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[domainadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteams[domainadmin]="${DEFAULT_RECIPIENT_MSTEAMS}" + +role_recipients_rocketchat[domainadmin]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[domainadmin]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[domainadmin]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[domainadmin]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[domainadmin]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[domainadmin]="${DEFAULT_RECIPIENT_STACKPULSE}" + +role_recipients_gotify[domainadmin]="${DEFAULT_RECIPIENT_GOTIFY}" + +# ----------------------------------------------------------------------------- +# database servers alarms +# mysql, redis, memcached, postgres, etc + +role_recipients_email[dba]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[dba]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[dba]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[dba]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[dba]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[dba]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[dba]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[dba]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[dba]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[dba]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[dba]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[dba]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[dba]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[dba]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[dba]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[dba]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[dba]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[dba]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[dba]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[dba]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteams[dba]="${DEFAULT_RECIPIENT_MSTEAMS}" + +role_recipients_rocketchat[dba]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[dba]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[dba]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[dba]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[dba]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[dba]="${DEFAULT_RECIPIENT_STACKPULSE}" + +role_recipients_gotify[dba]="${DEFAULT_RECIPIENT_GOTIFY}" + +# ----------------------------------------------------------------------------- +# web servers alarms +# apache, nginx, lighttpd, etc + +role_recipients_email[webmaster]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[webmaster]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[webmaster]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[webmaster]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[webmaster]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[webmaster]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[webmaster]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[webmaster]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[webmaster]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[webmaster]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[webmaster]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[webmaster]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[webmaster]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[webmaster]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[webmaster]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[webmaster]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[webmaster]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[webmaster]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[webmaster]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[webmaster]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteams[webmaster]="${DEFAULT_RECIPIENT_MSTEAMS}" + +role_recipients_rocketchat[webmaster]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[webmaster]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[webmaster]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[webmaster]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[webmaster]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[webmaster]="${DEFAULT_RECIPIENT_STACKPULSE}" + +role_recipients_gotify[webmaster]="${DEFAULT_RECIPIENT_GOTIFY}" + +# ----------------------------------------------------------------------------- +# proxy servers alarms +# squid, etc + +role_recipients_email[proxyadmin]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[proxyadmin]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[proxyadmin]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[proxyadmin]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[proxyadmin]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[proxyadmin]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[proxyadmin]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[proxyadmin]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[proxyadmin]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[proxyadmin]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[proxyadmin]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[proxyadmin]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[proxyadmin]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[proxyadmin]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[proxyadmin]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_irc[proxyadmin]="${DEFAULT_RECIPIENT_IRC}" + +role_recipients_syslog[proxyadmin]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[proxyadmin]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[proxyadmin]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[proxyadmin]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteams[proxyadmin]="${DEFAULT_RECIPIENT_MSTEAMS}" + +role_recipients_rocketchat[proxyadmin]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[proxyadmin]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[proxyadmin]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[proxyadmin]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[proxyadmin]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[proxyadmin]="${DEFAULT_RECIPIENT_STACKPULSE}" + +role_recipients_gotify[proxyadmin]="${DEFAULT_RECIPIENT_GOTIFY}" + +# ----------------------------------------------------------------------------- +# peripheral devices +# UPS, photovoltaics, etc + +role_recipients_email[sitemgr]="${DEFAULT_RECIPIENT_EMAIL}" + +role_recipients_hangouts[sitemgr]="${DEFAULT_RECIPIENT_HANGOUTS}" + +role_recipients_pushover[sitemgr]="${DEFAULT_RECIPIENT_PUSHOVER}" + +role_recipients_pushbullet[sitemgr]="${DEFAULT_RECIPIENT_PUSHBULLET}" + +role_recipients_telegram[sitemgr]="${DEFAULT_RECIPIENT_TELEGRAM}" + +role_recipients_slack[sitemgr]="${DEFAULT_RECIPIENT_SLACK}" + +role_recipients_alerta[sitemgr]="${DEFAULT_RECIPIENT_ALERTA}" + +role_recipients_flock[sitemgr]="${DEFAULT_RECIPIENT_FLOCK}" + +role_recipients_discord[sitemgr]="${DEFAULT_RECIPIENT_DISCORD}" + +role_recipients_hipchat[sitemgr]="${DEFAULT_RECIPIENT_HIPCHAT}" + +role_recipients_twilio[sitemgr]="${DEFAULT_RECIPIENT_TWILIO}" + +role_recipients_messagebird[sitemgr]="${DEFAULT_RECIPIENT_MESSAGEBIRD}" + +role_recipients_kavenegar[sitemgr]="${DEFAULT_RECIPIENT_KAVENEGAR}" + +role_recipients_pd[sitemgr]="${DEFAULT_RECIPIENT_PD}" + +role_recipients_fleep[sitemgr]="${DEFAULT_RECIPIENT_FLEEP}" + +role_recipients_syslog[sitemgr]="${DEFAULT_RECIPIENT_SYSLOG}" + +role_recipients_prowl[sitemgr]="${DEFAULT_RECIPIENT_PROWL}" + +role_recipients_awssns[sitemgr]="${DEFAULT_RECIPIENT_AWSSNS}" + +role_recipients_custom[sitemgr]="${DEFAULT_RECIPIENT_CUSTOM}" + +role_recipients_msteams[sitemgr]="${DEFAULT_RECIPIENT_MSTEAMS}" + +role_recipients_rocketchat[sitemgr]="${DEFAULT_RECIPIENT_ROCKETCHAT}" + +role_recipients_sms[sitemgr]="${DEFAULT_RECIPIENT_SMS}" + +role_recipients_dynatrace[sitemgr]="${DEFAULT_RECIPIENT_DYNATRACE}" + +role_recipients_opsgenie[sitemgr]="${DEFAULT_RECIPIENT_OPSGENIE}" + +role_recipients_matrix[sitemgr]="${DEFAULT_RECIPIENT_MATRIX}" + +role_recipients_stackpulse[sitemgr]="${DEFAULT_RECIPIENT_STACKPULSE}" + +role_recipients_gotify[sitemgr]="${DEFAULT_RECIPIENT_GOTIFY}" diff --git a/health/notifications/health_email_recipients.conf b/health/notifications/health_email_recipients.conf new file mode 100644 index 0000000..f56c6c6 --- /dev/null +++ b/health/notifications/health_email_recipients.conf @@ -0,0 +1,2 @@ +# OBSOLETE FILE +# REPLACED WITH health_alarm_notify.conf diff --git a/health/notifications/irc/Makefile.inc b/health/notifications/irc/Makefile.inc new file mode 100644 index 0000000..1a68f65 --- /dev/null +++ b/health/notifications/irc/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + irc/README.md \ + irc/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/irc/README.md b/health/notifications/irc/README.md new file mode 100644 index 0000000..21c998d --- /dev/null +++ b/health/notifications/irc/README.md @@ -0,0 +1,78 @@ +<!-- +title: "IRC" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/irc/README.md +--> + +# IRC + +This is what you will get: + +IRCCloud web client:\ +![image](https://user-images.githubusercontent.com/31221999/36793487-3735673e-1ca6-11e8-8880-d1d8b6cd3bc0.png) + +Irssi terminal client: +![image](https://user-images.githubusercontent.com/31221999/36793486-3713ada6-1ca6-11e8-8c12-70d956ad801e.png) + +You need: + +1. The `nc` utility. If you do not set the path, Netdata will search for it in your system `$PATH`. + +Set the path for `nc` in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# external commands +# +# The full path of the nc command. +# If empty, the system $PATH will be searched for it. +# If not found, irc notifications will be silently disabled. +nc="/usr/bin/nc" +``` + +2. Ξn `IRC_NETWORK` to which your preferred channels belong to. +3. One or more channels ( `DEFAULT_RECIPIENT_IRC` ) to post the messages to. +4. An `IRC_NICKNAME` and an `IRC_REALNAME` to identify in IRC. + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# irc notification options +# +# irc notifications require only the nc utility to be installed. + +# multiple recipients can be given like this: +# "<irc_channel_1> <irc_channel_2> ..." + +# enable/disable sending irc notifications +SEND_IRC="YES" + +# if a role's recipients are not configured, a notification will not be sent. +# (empty = do not send a notification for unconfigured roles): +DEFAULT_RECIPIENT_IRC="#system-alarms" + +# The irc network to which the recipients belong. It must be the full network. +IRC_NETWORK="irc.freenode.net" + +# The irc nickname which is required to send the notification. It must not be +# an already registered name as the connection's MODE is defined as a 'guest'. +IRC_NICKNAME="netdata-alarm-user" + +# The irc realname which is required in order to make the connection and is an +# extra identifier. +IRC_REALNAME="netdata-user" +``` + +You can define multiple channels like this: `#system-alarms #networking-alarms`.\ +You can also filter the notifications like this: `#system-alarms|critical`.\ +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_irc[sysadmin]="#user-alarms #networking-alarms #system-alarms" +role_recipients_irc[dba]="#databases-alarms" +role_recipients_irc[webmaster]="#networking-alarms" +``` + +The keywords `#user-alarms`, `#networking-alarms`, `#system-alarms`, `#databases-alarms` are irc channels which belong to the specified IRC network. + + diff --git a/health/notifications/kavenegar/Makefile.inc b/health/notifications/kavenegar/Makefile.inc new file mode 100644 index 0000000..b98e794 --- /dev/null +++ b/health/notifications/kavenegar/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + kavenegar/README.md \ + kavenegar/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/kavenegar/README.md b/health/notifications/kavenegar/README.md new file mode 100644 index 0000000..6123eb9 --- /dev/null +++ b/health/notifications/kavenegar/README.md @@ -0,0 +1,46 @@ +<!-- +title: "Kavenegar" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/kavenegar/README.md +--> + +# Kavenegar + +[Kavenegar](https://kavenegar.com/) as service for software developers, based in Iran, provides send and receive SMS, calling voice by using its APIs. + +Will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to kavenegar.com +2. Get your APIKEY and Sender from `http://panel.kavenegar.com/client/setting/account` +3. Fill in KAVENEGAR_API_KEY="" KAVENEGAR_SENDER="" +4. Add the recipient phone numbers to DEFAULT_RECIPIENT_KAVENEGAR="" + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# Kavenegar (kavenegar.com) SMS options + +# multiple recipients can be given like this: +# "09155555555 09177777777" + +# enable/disable sending kavenegar SMS +SEND_KAVENEGAR="YES" + +# to get an access key, after selecting and purchasing your desired service +# at http://kavenegar.com/pricing.html +# login to your account, go to your dashboard and my account are +# https://panel.kavenegar.com/Client/setting/account from API Key +# copy your api key. You can generate new API Key too. +# You can find and select kevenegar sender number from this place. + +# Without an API key, Netdata cannot send KAVENEGAR text messages. +KAVENEGAR_API_KEY="" +KAVENEGAR_SENDER="" +DEFAULT_RECIPIENT_KAVENEGAR="" +``` + + diff --git a/health/notifications/matrix/Makefile.inc b/health/notifications/matrix/Makefile.inc new file mode 100644 index 0000000..9937d80 --- /dev/null +++ b/health/notifications/matrix/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + matrix/README.md \ + matrix/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/matrix/README.md b/health/notifications/matrix/README.md new file mode 100644 index 0000000..8eeecf5 --- /dev/null +++ b/health/notifications/matrix/README.md @@ -0,0 +1,58 @@ +<!-- +title: "Send Netdata notifications to Matrix network rooms" +description: "Stay aware of warning or critical anomalies by sending health alarms to Matrix network rooms with Netdata's health monitoring watchdog." +sidebar_label: "Matrix" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/matrix/README.md +--> + +# Matrix + +Send notifications to [Matrix](https://matrix.org/) network rooms. + +The requirements for this notification method are: + +1. The url of the homeserver (`https://homeserver:port`). +2. Credentials for connecting to the homeserver, in the form of a valid access token for your account (or for a + dedicated notification account). These tokens usually don't expire. +3. The room ids that you want to sent the notification to. + +To obtain the access token, you can use the following `curl` command: + +```bash +curl -XPOST -d '{"type":"m.login.password", "user":"example", "password":"wordpass"}' "https://homeserver:8448/_matrix/client/r0/login" +``` + +The room ids are unique identifiers and can be obtained from the room settings in a Matrix client (e.g. Riot). Their +format is `!uniqueid:homeserver`. + +Multiple room ids can be defined by separating with a space character. + +Detailed information about the Matrix client API is available at the [official +site](https://matrix.org/docs/guides/client-server.html). + +Your `health_alarm_notify.conf` should look like this: + +```conf +############################################################################### +# Matrix notifications +# + +# enable/disable Matrix notifications +SEND_MATRIX="YES" + +# The url of the Matrix homeserver +# e.g https://matrix.org:8448 +MATRIX_HOMESERVER="https://matrix.org:8448" + +# A access token from a valid Matrix account. Tokens usually don't expire, +# can be controlled from a Matrix client. +# See https://matrix.org/docs/guides/client-server.html +MATRIX_ACCESSTOKEN="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# Specify the default rooms to receive the notification if no rooms are provided +# in a role's recipients. +# The format is !roomid:homeservername +DEFAULT_RECIPIENT_MATRIX="!XXXXXXXXXXXX:matrix.org" +``` + + diff --git a/health/notifications/messagebird/Makefile.inc b/health/notifications/messagebird/Makefile.inc new file mode 100644 index 0000000..f8d2332 --- /dev/null +++ b/health/notifications/messagebird/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + messagebird/README.md \ + messagebird/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/messagebird/README.md b/health/notifications/messagebird/README.md new file mode 100644 index 0000000..f70e86c --- /dev/null +++ b/health/notifications/messagebird/README.md @@ -0,0 +1,45 @@ +<!-- +title: "Messagebird" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/messagebird/README.md +--> + +# Messagebird + +The messagebird notifications will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to messagebird.com +2. Pick an SMS capable number after sign up to get some free credits +3. Go to <https://www.messagebird.com/app/settings/developers/access> +4. Create a new access key under 'API ACCESS (REST)' (you will want a live key) +5. Fill in MESSAGEBIRD_ACCESS_KEY="XXXXXXXX" MESSAGEBIRD_NUMBER="+XXXXXXXXXXX" +6. Add the recipient phone numbers to DEFAULT_RECIPIENT_MESSAGEBIRD="+XXXXXXXXXXX" + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# Messagebird (messagebird.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending messagebird SMS +SEND_MESSAGEBIRD="YES" + +# to get an access key, create a free account at https://www.messagebird.com +# verify and activate the account (no CC info needed) +# login to your account and enter your phonenumber to get some free credits +# to get the API key, click on 'API' in the sidebar, then 'API Access (REST)' +# click 'Add access key' and fill in data (you want a live key to send SMS) + +# Without an access key, Netdata cannot send Messagebird text messages. +MESSAGEBIRD_ACCESS_KEY="XXXXXXXX" +MESSAGEBIRD_NUMBER="XXXXXXX" +DEFAULT_RECIPIENT_MESSAGEBIRD="XXXXXXX" +``` + + diff --git a/health/notifications/msteams/Makefile.inc b/health/notifications/msteams/Makefile.inc new file mode 100644 index 0000000..f4c6995 --- /dev/null +++ b/health/notifications/msteams/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + msteams/README.md \ + msteams/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/msteams/README.md b/health/notifications/msteams/README.md new file mode 100644 index 0000000..c9a13ba --- /dev/null +++ b/health/notifications/msteams/README.md @@ -0,0 +1,43 @@ +<!-- +title: "Microsoft Teams" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/msteams/README.md +--> + +# Microsoft Teams + +This is what you will get: +![image](https://user-images.githubusercontent.com/1122372/92710359-0385e680-f358-11ea-8c52-f366a4fb57dd.png) + +You need: + +1. The **incoming webhook URL** as given by Microsoft Teams. You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). +2. One or more channels to post the messages to. + +In Microsoft Teams the channel name is encoded in the URI after `/IncomingWebhook/` (for clarity the marked with `[]` in the following example): `https://outlook.office.com/webhook/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX@XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/IncomingWebhook/[XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX]/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX` + +You have to replace the encoded channel name by the placeholder `CHANNEL` in `MSTEAMS_WEBHOOK_URL`. The placeholder `CHANNEL` will be replaced by the actual encoded channel name before sending the notification. This makes it possible to publish to several channels in the same team. + +The encoded channel name must then be added to `DEFAULT_RECIPIENTS_MSTEAMS` or to one of the specific variables `role_recipients_msteams[]`. **At least one channel is mandatory for `DEFAULT_RECIPIENTS_MSTEAMS`.** + +Set the webhook and the recipients in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +SEND_MSTEAMS="YES" + +MSTEAMS_WEBHOOK_URL="https://outlook.office.com/webhook/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX@XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/IncomingWebhook/CHANNEL/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" + +DEFAULT_RECIPIENT_MSTEAMS="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" +``` + +You can define multiple recipients by listing the encoded channel names like this: `XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY`. +This example will send the alarm to the two channels specified by their encoded channel names. + +You can give different recipients per **role** using these (in the same file): + +``` +role_recipients_msteams[sysadmin]="XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" +role_recipients_msteams[dba]="YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY" +role_recipients_msteams[webmaster]="ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ" +``` + + diff --git a/health/notifications/opsgenie/Makefile.inc b/health/notifications/opsgenie/Makefile.inc new file mode 100644 index 0000000..c85bb7c --- /dev/null +++ b/health/notifications/opsgenie/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + opsgenie/README.md \ + opsgenie/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/opsgenie/README.md b/health/notifications/opsgenie/README.md new file mode 100644 index 0000000..640fcd4 --- /dev/null +++ b/health/notifications/opsgenie/README.md @@ -0,0 +1,62 @@ +<!-- +title: "Send notifications to Opsgenie" +description: "Send alerts to your Opsgenie incident response account any time an anomaly or performance issue strikes a node in your infrastructure." +sidebar_label: "Opsgenie" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/opsgenie/README.md +--> + +# Send notifications to Opsgenie + +[Opsgenie](https://www.atlassian.com/software/opsgenie) is an alerting and incident response tool. It is designed to +group and filter alarms, build custom routing rules for on-call teams, and correlate deployments and commits to +incidents. + +The first step is to create a [Netdata integration](https://docs.opsgenie.com/docs/api-integration) in the +[Opsgenie](https://www.atlassian.com/software/opsgenie) dashboard. After this, you need to edit +`health_alarm_notify.conf` on your system, by running the following from your [config +directory](/docs/configure/nodes.md): + +```bash +./edit-config health_alarm_notify.conf +``` + +Change the variable `OPSGENIE_API_KEY` with the API key you got from Opsgenie. `OPSGENIE_API_URL` defaults to +`https://api.opsgenie.com`, however there are region-specific API URLs such as `https://eu.api.opsgenie.com`, so set +this if required. + +```conf +SEND_OPSGENIE="YES" + +# Api key +# Default Opsgenie API +OPSGENIE_API_KEY="11111111-2222-3333-4444-555555555555" +OPSGENIE_API_URL="" +``` + +Changes to `health_alarm_notify.conf` do not require a Netdata restart. You can test your Opsgenie notifications +configuration by issuing the commands, replacing `ROLE` with your preferred role: + +```sh +# become user netdata +sudo su -s /bin/bash netdata + +# send a test alarm +/usr/libexec/netdata/plugins.d/alarm-notify.sh test ROLE +``` + +If everything works, you'll see alarms in your Opsgenie platform: + +![Example alarm notifications in +Opsgenie](https://user-images.githubusercontent.com/49162938/92184518-f725f900-ee40-11ea-9afa-e7c639c72206.png) + +If sending the test notifications fails, you can look in `/var/log/netdata/error.log` to find the relevant error +message: + +```log +2020-09-03 23:07:00: alarm-notify.sh: ERROR: failed to send opsgenie notification for: hades test.chart.test_alarm is CRITICAL, with HTTP error code 401. +``` + +You can find more details about the Opsgenie error codes in their [response +docs](https://docs.opsgenie.com/docs/response). + + diff --git a/health/notifications/pagerduty/Makefile.inc b/health/notifications/pagerduty/Makefile.inc new file mode 100644 index 0000000..ee9b091 --- /dev/null +++ b/health/notifications/pagerduty/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pagerduty/README.md \ + pagerduty/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pagerduty/README.md b/health/notifications/pagerduty/README.md new file mode 100644 index 0000000..30db637 --- /dev/null +++ b/health/notifications/pagerduty/README.md @@ -0,0 +1,63 @@ +<!-- +title: "Send alert notifications to PagerDuty" +description: "Send alerts to your PagerDuty dashboard any time an anomaly or performance issue strikes a node in your infrastructure." +sidebar_label: "PagerDuty" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pagerduty/README.md +--> + +# Send alert notifications to PagerDuty + +[PagerDuty](https://www.pagerduty.com/company/) is an enterprise incident resolution service that integrates with ITOps +and DevOps monitoring stacks to improve operational reliability and agility. From enriching and aggregating events to +correlating them into incidents, PagerDuty streamlines the incident management process by reducing alert noise and +resolution times. + +## What you need to get started + +- An installation of the open-source [Netdata](/docs/get-started.mdx) monitoring agent. +- An installation of the [PagerDuty agent](https://www.pagerduty.com/docs/guides/agent-install-guide/) on the node + running Netdata. +- A PagerDuty `Generic API` service using either the `Events API v2` or `Events API v1`. + +## Setup + +[Add a new service](https://support.pagerduty.com/docs/services-and-integrations#section-configuring-services-and-integrations) +to PagerDuty. Click **Use our API directly** and select either `Events API v2` or `Events API v1`. Once you finish +creating the service, click on the **Integrations** tab to find your **Integration Key**. + +Navigate to the [Netdata config directory](/docs/configure/nodes.md#the-netdata-config-directory) and use +[`edit-config`](/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) to open +`health_alarm_notify.conf`. + +```bash +cd /etc/netdata +sudo ./edit-config health_alarm_notify.conf +``` + +Scroll down to the `# pagerduty.com notification options` section. + +Ensure `SEND_PD` is set to `YES`, then copy your Integration Key into `DEFAULT_RECIPIENT_ID`. Change `USE_PD_VERSION` to +`2` if you chose `Events API v2` during service setup on PagerDuty. Minus comments, the section should look like this: + +```conf +SEND_PD="YES" +DEFAULT_RECIPIENT_PD="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" +USE_PD_VERSION="2" +``` + +## Testing + +To test alert notifications to PagerDuty, run the following: + +```bash +sudo su -s /bin/bash netdata +/usr/libexec/netdata/plugins.d/alarm-notify.sh test +``` + +## Configuration + +Aside from the three values set in `health_alarm_notify.conf`, there is no further configuration required to send alert +notifications to PagerDuty. + +To configure individual alarms, read our [alert configuration](/docs/monitor/configure-alarms.md) doc or +the [health entity reference](/health/REFERENCE.md) doc. diff --git a/health/notifications/prowl/Makefile.inc b/health/notifications/prowl/Makefile.inc new file mode 100644 index 0000000..64a1deb --- /dev/null +++ b/health/notifications/prowl/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + prowl/README.md \ + prowl/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/prowl/README.md b/health/notifications/prowl/README.md new file mode 100644 index 0000000..dc13682 --- /dev/null +++ b/health/notifications/prowl/README.md @@ -0,0 +1,27 @@ +<!-- +title: "Prowl" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/prowl/README.md +--> + +# Prowl + +[Prowl](https://www.prowlapp.com/) is a push notification service for iOS devices. Netdata +supports delivering notifications to iOS devices through Prowl. + +Because of how Netdata integrates with Prowl, there is a hard limit of +at most 1000 notifications per hour (starting from the first notification +sent). Any alerts beyond the first thousand in an hour will be dropped. + +Warning messages will be sent with the 'High' priority, critical messages +will be sent with the 'Emergency' priority, and all other messages will +be sent with the normal priority. Opening the notification's associated +URL will take you to the Netdata dashboard of the system that issued +the alert, directly to the chart that it triggered on. + +## configuration + +To use this, you will need a Prowl API key, which can be requested through +the Prowl website after registering. + +Once you have an API key, simply specify that as a recipient for Prowl +notifications. diff --git a/health/notifications/pushbullet/Makefile.inc b/health/notifications/pushbullet/Makefile.inc new file mode 100644 index 0000000..d3a9459 --- /dev/null +++ b/health/notifications/pushbullet/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pushbullet/README.md \ + pushbullet/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pushbullet/README.md b/health/notifications/pushbullet/README.md new file mode 100644 index 0000000..194050b --- /dev/null +++ b/health/notifications/pushbullet/README.md @@ -0,0 +1,50 @@ +<!-- +title: "PushBullet" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pushbullet/README.md +--> + +# PushBullet + +Will look like this on your browser: +![image](https://cloud.githubusercontent.com/assets/4300670/19109636/278b1c0c-8aee-11e6-8a09-7fc94fdbfec8.png) + +And like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/4300670/19109635/278a1dde-8aee-11e6-9984-0bc87a13312d.png) + +You will need: + +1. Sign up and log in to [pushbullet.com](https://www.pushbullet.com/) +2. Create a new access token in your [account settings](https://www.pushbullet.com/#settings/account). +3. Fill in the `PUSHBULLET_ACCESS_TOKEN` with the newly generated access token. +4. Add the recipient emails or channel tags (each channel tag must be prefixed with #, e.g. #channeltag) to `DEFAULT_RECIPIENT_PUSHBULLET`. + > π¨ The pushbullet notification service will send emails to the email recipient, regardless of if they have a pushbullet account. + +To add notification channels, run `/etc/netdata/edit-config health_alarm_notify.conf` + +You can change the configuration like this: + +``` +############################################################################### +# pushbullet (pushbullet.com) push notification options + +# multiple recipients (a combination of email addresses or channel tags) can be given like this: +# "user1@email.com user2@mail.com #channel1 #channel2" + +# enable/disable sending pushbullet notifications +SEND_PUSHBULLET="YES" + +# Signup and Login to pushbullet.com +# To get your Access Token, go to https://www.pushbullet.com/#settings/account +# And create a new access token +# Then just set the recipients emails and/or channel tags (channel tags must be prefixed with #) +# Please note that the if an email in the DEFAULT_RECIPIENT_PUSHBULLET does +# not have a pushbullet account, the pushbullet service will send an email +# to that address instead + +# Without an access token, Netdata cannot send pushbullet notifications. +PUSHBULLET_ACCESS_TOKEN="o.Sometokenhere" +DEFAULT_RECIPIENT_PUSHBULLET="admin1@example.com admin3@somemail.com #examplechanneltag #anotherchanneltag" +``` + + diff --git a/health/notifications/pushover/Makefile.inc b/health/notifications/pushover/Makefile.inc new file mode 100644 index 0000000..9b703a1 --- /dev/null +++ b/health/notifications/pushover/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + pushover/README.md \ + pushover/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/pushover/README.md b/health/notifications/pushover/README.md new file mode 100644 index 0000000..1e50f71 --- /dev/null +++ b/health/notifications/pushover/README.md @@ -0,0 +1,23 @@ +<!-- +title: "PushOver" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pushover/README.md +--> + +# PushOver + +pushover.net allows you to receive push notifications on your mobile phone. The service seems free for up to 7.500 messages per month. + +Netdata will send warning messages with priority `0` and critical messages with priority `1`. pushover.net allows you to select do-not-disturb hours. The way this is configured, critical notifications will ring and vibrate your phone, even during the do-not-disturb-hours. All other notifications will be delivered silently. + +You need: + +1. APP TOKEN. You can use the same on all your Netdata servers. +2. USER TOKEN for each user you are going to send notifications to. This is the actual recipient of the notification. + +The configuration is like above (slack messages). + +pushover.net notifications look like this: + +![image](https://cloud.githubusercontent.com/assets/2662304/18407319/839c10c4-7715-11e6-92c0-12f8215128d3.png) + + diff --git a/health/notifications/rocketchat/Makefile.inc b/health/notifications/rocketchat/Makefile.inc new file mode 100644 index 0000000..58f210b --- /dev/null +++ b/health/notifications/rocketchat/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + rocketchat/README.md \ + rocketchat/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/rocketchat/README.md b/health/notifications/rocketchat/README.md new file mode 100644 index 0000000..96d6160 --- /dev/null +++ b/health/notifications/rocketchat/README.md @@ -0,0 +1,52 @@ +<!-- +title: "Rocket.Chat" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/rocketchat/README.md +--> + +# Rocket.Chat + +This is what you will get: +![Netdata on RocketChat](https://i.imgur.com/Zu4t3j3.png) +You need: + +1. The **incoming webhook URL** as given by RocketChat. You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). +2. One or more channels to post the messages to. + +Get them here: <https://rocket.chat/docs/administrator-guides/integrations/index.html#how-to-create-a-new-incoming-webhook> + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +#------------------------------------------------------------------------------ +# rocketchat (rocket.chat) global notification options + +# multiple recipients can be given like this: +# "CHANNEL1 CHANNEL2 ..." + +# enable/disable sending rocketchat notifications +SEND_ROCKETCHAT="YES" + +# Login to rocket.chat and create an incoming webhook. You need only one for all +# your Netdata servers (or you can have one for each of your Netdata). +# Without it, Netdata cannot send rocketchat notifications. +ROCKETCHAT_WEBHOOK_URL="<your_incoming_webhook_url>" + +# if a role's recipients are not configured, a notification will be send to +# this rocketchat channel (empty = do not send a notification for unconfigured +# roles). +DEFAULT_RECIPIENT_ROCKETCHAT="monitoring_alarms" +``` + +You can define multiple channels like this: `alarms systems`. +You can give different channels per **role** using these (at the same file): + +``` +role_recipients_rocketchat[sysadmin]="systems" +role_recipients_rocketchat[dba]="databases systems" +role_recipients_rocketchat[webmaster]="marketing development" +``` + +The keywords `systems`, `databases`, `marketing`, `development` are RocketChat channels (they should already exist). +Both public and private channels can be used, even if they differ from the channel configured in your RocketChat incoming webhook. + + diff --git a/health/notifications/slack/Makefile.inc b/health/notifications/slack/Makefile.inc new file mode 100644 index 0000000..043bfaf --- /dev/null +++ b/health/notifications/slack/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + slack/README.md \ + slack/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/slack/README.md b/health/notifications/slack/README.md new file mode 100644 index 0000000..ad36ce3 --- /dev/null +++ b/health/notifications/slack/README.md @@ -0,0 +1,50 @@ +<!-- +title: "Slack" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/slack/README.md +--> + +# Slack + +This is what you will get: +![image](https://cloud.githubusercontent.com/assets/2662304/18407116/bbd0fee6-7710-11e6-81cf-58c0defaee2b.png) + +You need: + +1. The **incoming webhook URL** as given by slack.com. You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). +2. One or more channels to post the messages to. + +To get a webhook that works on multiple channels, you will need to login to your slack.com workspace and create an incoming webhook using the [Incoming Webhooks App](https://slack.com/apps/A0F7XDUAZ-incoming-webhooks). +Do NOT use the instructions in <https://api.slack.com/incoming-webhooks#enable_webhooks>, as the particular webhooks work only for a single channel. + +Set the webhook and the recipients in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +SEND_SLACK="YES" + +SLACK_WEBHOOK_URL="https://hooks.slack.com/services/XXXXXXXX/XXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" + +# if a role's recipients are not configured, a notification will be send to: +# - A slack channel (syntax: '#channel' or 'channel') +# - A slack user (syntax: '@user') +# - The channel or user defined in slack for the webhook (syntax: '#') +# empty = do not send a notification for unconfigured roles +DEFAULT_RECIPIENT_SLACK="alarms" +``` + +You can define multiple recipients like this: `# #alarms systems @myuser`. +This example will send the alarm to: + +- The recipient defined in slack for the webhook (not known to Netdata) +- The channel 'alarms' +- The channel 'systems' +- The user @myuser + +You can give different recipients per **role** using these (at the same file): + +``` +role_recipients_slack[sysadmin]="systems" +role_recipients_slack[dba]="databases systems" +role_recipients_slack[webmaster]="marketing development" +``` + + diff --git a/health/notifications/smstools3/Makefile.inc b/health/notifications/smstools3/Makefile.inc new file mode 100644 index 0000000..4764b9e --- /dev/null +++ b/health/notifications/smstools3/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + smstools3/README.md \ + smstools3/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/smstools3/README.md b/health/notifications/smstools3/README.md new file mode 100644 index 0000000..6618dfa --- /dev/null +++ b/health/notifications/smstools3/README.md @@ -0,0 +1,44 @@ +<!-- +title: "SMS Server Tools 3" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/smstools3/README.md +--> + +# SMS Server Tools 3 + +The [SMS Server Tools 3](http://smstools3.kekekasvi.com/) is a SMS Gateway software which can send and receive short messages through GSM modems and mobile phones. + +To have Netdata send notifications via SMS Server Tools 3, you'll first need to [install](http://smstools3.kekekasvi.com/index.php?p=compiling) and [configure](http://smstools3.kekekasvi.com/index.php?p=configure) smsd. + +Ensure that the user `netdata` can execute `sendsms`. Any user executing `sendsms` needs to: + +- Have write permissions to `/tmp` and `/var/spool/sms/outgoing` +- Be a member of group `smsd` + +To ensure that the steps above are successful, just `su netdata` and execute `sendsms phone message`. + +You then just need to configure the recipient phone numbers in `health_alarm_notify.conf`: + +```sh +#------------------------------------------------------------------------------ +# SMS Server Tools 3 (smstools3) global notification options + +# enable/disable sending SMS Server Tools 3 SMS notifications +SEND_SMS="YES" + +# if a role's recipients are not configured, a notification will be sent to +# this SMS channel (empty = do not send a notification for unconfigured +# roles). Multiple recipients can be given like this: "PHONE1 PHONE2 ..." + +DEFAULT_RECIPIENT_SMS="" +``` + +Netdata uses the script `sendsms` that is installed by `smstools3` and just passes a phone number and a message to it. If `sendsms` is not in `$PATH`, you can pass its location in `health_alarm_notify.conf`: + +```sh +# The full path of the sendsms command (smstools3). +# If empty, the system $PATH will be searched for it. +# If not found, SMS notifications will be silently disabled. +sendsms="" +``` + + diff --git a/health/notifications/stackpulse/Makefile.inc b/health/notifications/stackpulse/Makefile.inc new file mode 100644 index 0000000..eabcb4b --- /dev/null +++ b/health/notifications/stackpulse/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + stackpulse/README.md \ + stackpulse/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/stackpulse/README.md b/health/notifications/stackpulse/README.md new file mode 100644 index 0000000..c478fd5 --- /dev/null +++ b/health/notifications/stackpulse/README.md @@ -0,0 +1,81 @@ +<!-- +title: "Send notifications to StackPulse" +description: "Send alerts to your StackPulse Netdata integration any time an anomaly or performance issue strikes a node in your infrastructure." +sidebar_label: "StackPulse" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/stackpulse/README.md +--> + +# Send notifications to StackPulse + +[StackPulse](https://stackpulse.com/) is a software-as-a-service platform for site reliability engineering. +It helps SREs, DevOps Engineers and Software Developers reduce toil and alert fatigue while improving reliability of +software services by managing, analyzing and automating incident response activities. + +Sending Netdata alarm notifications to StackPulse allows you to create smart automated response workflows +(StackPulse playbooks) that will help you drive down your MTTD and MTTR by performing any of the following: + +- Enriching the incident with data from multiple sources +- Performing triage actions and analyzing their results +- Orchestrating incident management and notification flows +- Performing automatic and semi-automatic remediation actions +- Analyzing incident data and remediation patterns to improve reliability of your services + +To send the notification you need: + +1. Create a Netdata integration in the `StackPulse Administration Portal`, and copy the `Endpoint` URL. + +![Creating a Netdata integration in StackPulse](https://user-images.githubusercontent.com/49162938/93023348-d9455a80-f5dd-11ea-8e05-67d07dce93e4.png) + +2. On your node, navigate to `/etc/netdata/` and run the following command: + +```sh +$ ./edit-config health_alarm_notify.conf +``` + +3. Set the `STACKPULSE_WEBHOOK` variable to `Endpoint` URL you copied earlier: + +``` +SEND_STACKPULSE="YES" +STACKPULSE_WEBHOOK="https://hooks.stackpulse.io/v1/webhooks/YOUR_UNIQUE_ID" +``` + +4. Now restart Netdata using `sudo systemctl restart netdata`, or the [appropriate + method](/docs/configure/start-stop-restart.md) for your system. When your node creates an alarm, you can see the + associated notification on your StackPulse Administration Portal + +## React to alarms with playbooks + +StackPulse allow users to create `Playbooks` giving additional information about events that happen in specific +scenarios. For example, you could create a Playbook that responds to a "low disk space" alarm by compressing and +cleaning up storage partitions with dynamic data. + +![image](https://user-images.githubusercontent.com/49162938/93207961-4c201400-f74b-11ea-94d1-42a29d007b62.png) + +![The StackPulse Administration Portal with a Netdata +alarm](https://user-images.githubusercontent.com/49162938/93208199-bfc22100-f74b-11ea-83c4-728be23dcf4d.png) +### Create Playbooks for Netdata alarms + +To create a Playbook, you need to access the StackPulse Administration Portal. After the initial setup, you need to +access the **TRIGGER** tab to define the scenarios used to trigger the event. The following variables are available: + +- `Hostname`: The host that generated the event. +- `Chart`: The name of the chart. +- `OldValue` : The previous value of the alarm. +- `Value`: The current value of the alarm. +- `Units` : The units of the value. +- `OldStatus` : The previous status: REMOVED, UNINITIALIZED, UNDEFINED, CLEAR, WARNING, CRITICAL. +- `State`: The current alarm status, the acceptable values are the same of `OldStatus`. +- `Alarm` : The name of the alarm, as given in Netdata's health.d entries. +- `Date` : The timestamp this event occurred. +- `Duration` : The duration in seconds of the previous alarm state. +- `NonClearDuration` : The total duration in seconds this is/was non-clear. +- `Description` : A short description of the alarm copied from the alarm definition. +- `CalcExpression` : The expression that was evaluated to trigger the alarm. +- `CalcParamValues` : The values of the parameters in the expression, at the time of the evaluation. +- `TotalWarnings` : Total number of alarms in WARNING state. +- `TotalCritical` : Total number of alarms in CRITICAL state. +- `ID` : The unique id of the alarm that generated this event. + +For more details how to create a scenario, take a look at the [StackPulse documentation](https://docs.stackpulse.io). + + diff --git a/health/notifications/syslog/Makefile.inc b/health/notifications/syslog/Makefile.inc new file mode 100644 index 0000000..94a8acc --- /dev/null +++ b/health/notifications/syslog/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + syslog/README.md \ + syslog/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/syslog/README.md b/health/notifications/syslog/README.md new file mode 100644 index 0000000..8b7863a --- /dev/null +++ b/health/notifications/syslog/README.md @@ -0,0 +1,34 @@ +<!-- +title: "Syslog" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/syslog/README.md +--> + +# Syslog + +You need a working `logger` command for this to work. This is the case on pretty much every Linux system in existence, and most BSD systems. + +Logged messages will look like this: + +``` +netdata WARNING on hostname at Tue Apr 3 09:00:00 EDT 2018: disk_space._ out of disk space time = 5h +``` + +## configuration + +System log targets are configured as recipients in [`/etc/netdata/health_alarm_notify.conf`](https://github.com/netdata/netdata/blob/36bedc044584dea791fd29455bdcd287c3306cb2/conf.d/health_alarm_notify.conf#L534) (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`). + +You can also configure per-role targets in the same file a bit further down. + +Targets are defined as follows: + +``` +[[facility.level][@host[:port]]/]prefix +``` + +`prefix` defines what the log messages are prefixed with. By default, all lines are prefixed with 'netdata'. + +The `facility` and `level` are the standard syslog facility and level options, for more info on them see your local `logger` and `syslog` documentation. By default, Netdata will log to the `local6` facility, with a log level dependent on the type of message (`crit` for CRITICAL, `warning` for WARNING, and `info` for everything else). + +You can configure sending directly to remote log servers by specifying a host (and optionally a port). However, this has a somewhat high overhead, so it is much preferred to use your local syslog daemon to handle the forwarding of messages to remote systems (pretty much all of them allow at least simple forwarding, and most of the really popular ones support complex queueing and routing of messages to remote log servers). + + diff --git a/health/notifications/telegram/Makefile.inc b/health/notifications/telegram/Makefile.inc new file mode 100644 index 0000000..ffca071 --- /dev/null +++ b/health/notifications/telegram/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + telegram/README.md \ + telegram/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/telegram/README.md b/health/notifications/telegram/README.md new file mode 100644 index 0000000..2a2ed56 --- /dev/null +++ b/health/notifications/telegram/README.md @@ -0,0 +1,45 @@ +<!-- +title: "Telegram" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/telegram/README.md +--> + +# Telegram + +[Telegram](https://telegram.org/) is a messaging app with a focus on speed and security, itβs super-fast, simple and free. You can use Telegram on all your devices at the same time β your messages sync seamlessly across any number of your phones, tablets or computers. + +With Telegram, you can send messages, photos, videos and files of any type (doc, zip, mp3, etc), as well as create groups for up to 100,000 people or channels for broadcasting to unlimited audiences. You can write to your phone contacts and find people by their usernames. As a result, Telegram is like SMS and email combined β and can take care of all your personal or business messaging needs. + +Netdata will send warning messages without vibration. + +You need to: + +1. Get a bot token. To get one, contact the [@BotFather](https://t.me/BotFather) bot and send the command `/newbot`. Follow the instructions. +2. Start a conversation with your bot or invite it into a group where you want it to send messages. +3. Find the chat ID for every chat you want to send messages to. Contact the [@myidbot](https://t.me/myidbot) bot and send the `/getid` command to get your personal chat ID or invite it into a group and use the `/getgroupid` command to get the group chat ID. Group IDs start with a hyphen, supergroup IDs start with `-100`. + Alternatively, you can get the chat ID directly from the bot API. Send *your* bot a command in the chat you want to use, then check `https://api.telegram.org/bot{YourBotToken}/getUpdates`, eg. `https://api.telegram.org/bot111122223:7OpFlFFRzRBbrUUmIjj5HF9Ox2pYJZy5/getUpdates` +4. Set the bot token and the chat ID of the recipient in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: +``` +SEND_TELEGRAM="YES" +TELEGRAM_BOT_TOKEN="111122223:7OpFlFFRzRBbrUUmIjj5HF9Ox2pYJZy5" +DEFAULT_RECIPIENT_TELEGRAM="-100233335555" +``` + +You can define multiple recipients like this: `"-100311112222 212341234|critical"`. +This example will send: + +- All alerts to the group with ID -100311112222 +- Critical alerts to the user with ID 212341234 + +You can give different recipients per **role** using these (in the same file): + +``` +role_recipients_telegram[sysadmin]="212341234" +role_recipients_telegram[dba]="-1004444333321" +role_recipients_telegram[webmaster]="49999333322 -1009999222255" +``` + +Telegram messages look like this: + +![Netdata notifications via Telegram](https://user-images.githubusercontent.com/1153921/66612223-f07dfb80-eb75-11e9-976f-5734ffd93ecd.png) + + diff --git a/health/notifications/twilio/Makefile.inc b/health/notifications/twilio/Makefile.inc new file mode 100644 index 0000000..0f2d8d8 --- /dev/null +++ b/health/notifications/twilio/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + twilio/README.md \ + twilio/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/twilio/README.md b/health/notifications/twilio/README.md new file mode 100644 index 0000000..b563c66 --- /dev/null +++ b/health/notifications/twilio/README.md @@ -0,0 +1,47 @@ +<!-- +title: "Twilio" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/twilio/README.md +--> + +# Twilio + +Will look like this on your Android device: + +![image](https://cloud.githubusercontent.com/assets/17090999/20034652/620b6100-a39b-11e6-96af-4f83b8e830e2.png) + +You will need: + +1. Signup and Login to twilio.com +2. Pick an SMS capable number during sign up. +3. Get your SID, and Token from <https://www.twilio.com/console> +4. Fill in TWILIO_ACCOUNT_SID="XXXXXXXX" TWILIO_ACCOUNT_TOKEN="XXXXXXXXX" TWILIO_NUMBER="+XXXXXXXXXXX" +5. Add the recipient phone numbers to DEFAULT_RECIPIENT_TWILIO="+XXXXXXXXXXX" + +!!PLEASE NOTE THAT IF YOUR ACCOUNT IS A TRIAL ACCOUNT YOU WILL ONLY BE ABLE TO SEND NOTIFICATIONS TO THE NUMBER YOU SIGNED UP WITH + +Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: + +``` +############################################################################### +# Twilio (twilio.com) SMS options + +# multiple recipients can be given like this: +# "+15555555555 +17777777777" + +# enable/disable sending twilio SMS +SEND_TWILIO="YES" + +# Signup for free trial and select a SMS capable Twilio Number +# To get your Account SID and Token, go to https://www.twilio.com/console +# Place your sid, token and number below. +# Then just set the recipients' phone numbers. +# The trial account is only allowed to use the number specified when set up. + +# Without an account sid and token, Netdata cannot send Twilio text messages. +TWILIO_ACCOUNT_SID="xxxxxxxxx" +TWILIO_ACCOUNT_TOKEN="xxxxxxxxxx" +TWILIO_NUMBER="xxxxxxxxxxx" +DEFAULT_RECIPIENT_TWILIO="+15555555555" +``` + + diff --git a/health/notifications/web/Makefile.inc b/health/notifications/web/Makefile.inc new file mode 100644 index 0000000..b564d83 --- /dev/null +++ b/health/notifications/web/Makefile.inc @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-3.0-or-later + +# THIS IS NOT A COMPLETE Makefile +# IT IS INCLUDED BY ITS PARENT'S Makefile.am +# IT IS REQUIRED TO REFERENCE ALL FILES RELATIVE TO THE PARENT + +# install these files +dist_noinst_DATA += \ + web/README.md \ + web/Makefile.inc \ + $(NULL) + diff --git a/health/notifications/web/README.md b/health/notifications/web/README.md new file mode 100644 index 0000000..185843a --- /dev/null +++ b/health/notifications/web/README.md @@ -0,0 +1,13 @@ +<!-- +title: "Dashboard" +custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/web/README.md +--> + +# Dashboard + +The Netdata dashboard shows HTML notifications, when it is open. + +Such web notifications look like this: +![image](https://cloud.githubusercontent.com/assets/2662304/18407279/82bac6a6-7714-11e6-847e-c2e84eeacbfb.png) + + |