diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-02-06 16:11:30 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-02-06 16:11:30 +0000 |
commit | aa2fe8ccbfcb117efa207d10229eeeac5d0f97c7 (patch) | |
tree | 941cbdd387b41c1a81587c20a6df9f0e5e0ff7ab /health | |
parent | Adding upstream version 1.37.1. (diff) | |
download | netdata-aa2fe8ccbfcb117efa207d10229eeeac5d0f97c7.tar.xz netdata-aa2fe8ccbfcb117efa207d10229eeeac5d0f97c7.zip |
Adding upstream version 1.38.0.upstream/1.38.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'health')
55 files changed, 1168 insertions, 1407 deletions
diff --git a/health/Makefile.am b/health/Makefile.am index 7c8d7f9d2..f0cbb7715 100644 --- a/health/Makefile.am +++ b/health/Makefile.am @@ -36,13 +36,14 @@ dist_healthconfig_DATA = \ health.d/cgroups.conf \ health.d/cpu.conf \ health.d/cockroachdb.conf \ + health.d/consul.conf \ health.d/disks.conf \ health.d/dnsmasq_dhcp.conf \ health.d/dns_query.conf \ health.d/dockerd.conf \ + health.d/elasticsearch.conf \ health.d/entropy.conf \ health.d/exporting.conf \ - health.d/fping.conf \ health.d/geth.conf \ health.d/ioping.conf \ health.d/gearman.conf \ diff --git a/health/QUICKSTART.md b/health/QUICKSTART.md deleted file mode 100644 index bc2da2df1..000000000 --- a/health/QUICKSTART.md +++ /dev/null @@ -1,143 +0,0 @@ -<!-- -title: "Health quickstart" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/QUICKSTART.md ---> - -# Health quickstart - -In this quickstart guide, you'll learn the basics of editing health configuration files. With this knowledge, you -will be able to customize how and when Netdata triggers alarms based on the health and performance of your system or -infrastructure. - -To learn about more advanced health configurations, visit the [health reference guide](/health/REFERENCE.md). - -## Edit health configuration files - -You should [use `edit-config`](/docs/configure/nodes.md) to edit Netdata's health configuration files. `edit-config` -will open your system's default terminal editor for you to make your changes. Once you've saved and closed the editor, -`edit-config` will copy your edited file into `/etc/netdata/health.d/`, which will override the stock file in -`/usr/lib/netdata/conf.d/health.d/` and ensure your customizations are persistent between updates. - -For example, to edit the `cpu.conf` health configuration file, you would run: - -```bash -cd /etc/netdata/ # Replace with your Netdata configuration directory, if not /etc/netdata/ -./edit-config health.d/cpu.conf -``` - -Each health configuration file contains one or more health entities, which always begin with an `alarm:` or `template:` -line. You can edit these entities based on your needs. To make any changes live, be sure to [reload your health -configuration](#reload-health-configuration). - -## Reference Netdata's stock health configuration files - -While you should always [use `edit-config`](#edit-health-configuration-files), you might also want to view the stock -health configuration files Netdata ships with. Stock files can be useful as reference material, or to determine which -file you should edit with `edit-config`. - -By default, Netdata will put health configuration files in `/usr/lib/netdata/conf.d/health.d`. However, you can -double-check the location of these files by navigating to `http://NODE:19999/netdata.conf`, replacing `NODE` with the IP -address or hostname for your Agent dashboard, looking for the `stock health configuration directory` option. The value -here will show the correct path for your installation. - -```conf -[directories] - ... - # stock health config = /usr/lib/netdata/conf.d/health.d -``` - -Navigate to the health configuration directory to see all the available files and open them for reading. - -```bash -cd /usr/lib/netdata/conf.d/health.d/ -ls -adaptec_raid.conf entropy.conf memory.conf squid.conf -am2320.conf fping.conf mongodb.conf -apache.conf mysql.conf swap.conf -... -``` - -> ⚠️ If you edit configuration files in your stock health configuration directory, Netdata will overwrite them during -> any updates. Please use `edit-config` as described in the [section above](#edit-health-configuration-files). - -## Write a new health entity - -While tuning existing alarms may work in some cases, you may need to write entirely new health entities based on how -your systems and applications work. - -To write a new health entity, let's create a new file inside of the `health.d/` directory. We'll name our file -`example.conf` for now. - -```bash -./edit-config health.d/example.conf -``` - -As an example, let's build a health entity that triggers an alarm your system's RAM usage goes above 80%. Copy and paste -the following into the editor: - -```yaml - alarm: ram_usage - on: system.ram -lookup: average -1m percentage of used - units: % - every: 1m - warn: $this > 80 - crit: $this > 90 - info: The percentage of RAM used by the system. -``` - -Let's look into each of the lines to see how they create a working health entity. - -- `alarm`: The name for your new entity. The name needs to follow these requirements: - - Any alphabet letter or number. - - The symbols `.` and `_`. - - Cannot be `chart name`, `dimension name`, `family name`, or `chart variable names`. -- `on`: Which chart the entity listens to. -- `lookup`: Which metrics the alarm monitors, the duration of time to monitor, and how to process the metrics into a - usable format. - - `average`: Calculate the average of all the metrics collected. - - `-1m`: Use metrics from 1 minute ago until now to calculate that average. - - `percentage`: Clarify that we're calculating a percentage of RAM usage. - - `of used`: Specify which dimension (`used`) on the `system.ram` chart you want to monitor with this entity. -- `units`: Use percentages rather than absolute units. -- `every`: How often to perform the `lookup` calculation to decide whether or not to trigger this alarm. -- `warn`/`crit`: The value at which Netdata should trigger a warning or critical alarm. -- `info`: A description of the alarm, which will appear in the dashboard and notifications. - -Let's put all these lines into a human-readable format. - -This health entity, named **ram_usage**, watches at the **system.ram** chart. It looks up the last **1 minute** of -metrics from the **used** dimension and calculates the **average** of all those metrics in a **percentage** format, -using a **% unit**. The entity performs this lookup **every minute**. If the average RAM usage percentage over the last -1 minute is **more than 80%**, the entity triggers a warning alarm. If the usage is **more than 90%**, the entity -triggers a critical alarm. - -Now that you've written a new health entity, you need to reload it to see it live on the dashboard. - -## Reload health configuration - -To make any changes to your health configuration live, you must reload Netdata's health monitoring system. To do that -without restarting all of Netdata, run the following: - -```bash -netdatacli reload-health -``` - -If you receive an error like `command not found`, this means that `netdatacli` is not installed in your `$PATH`. In that - case, you can reload only the health component by sending a `SIGUSR2` to Netdata: - -```bash -killall -USR2 netdata -``` -## What's next? - -To learn about all of Netdata's health configuration options, view the [reference guide](/health/REFERENCE.md) and -[daemon configuration](/daemon/config/README.md#health-section-options) for additional options available in the -`[health]` section of `netdata.conf`. - -Or, get guided insights into specific health configurations with our [health guides](/health/README.md#guides). - -Finally, move on to Netdata's [notification system](/health/notifications/README.md) to learn more about how Netdata can -let you know when the health of your systems or apps goes awry. - - diff --git a/health/README.md b/health/README.md index 2b1caf548..460f65680 100644 --- a/health/README.md +++ b/health/README.md @@ -1,6 +1,10 @@ <!-- title: "Health monitoring" custom_edit_url: https://github.com/netdata/netdata/edit/master/health/README.md +sidebar_label: "Health monitoring" +learn_status: "Published" +learn_topic_type: "Concepts" +learn_rel_path: "Concepts" --> # Health monitoring @@ -10,15 +14,13 @@ worked closely with our community of DevOps engineers, SREs, and developers to d alarms that work without any configuration. The Agent's health monitoring system is also dynamic and fully customizable. You can write entirely new alarms, tune the -community-configured alarms for every app/service [the Agent collects metrics from](/collectors/COLLECTORS.md), or +community-configured alarms for every app/service [the Agent collects metrics from](https://github.com/netdata/netdata/blob/master/collectors/COLLECTORS.md), or silence anything you're not interested in. You can even power complex lookups by running statistical algorithms against your metrics. Ready to take the next steps with health monitoring? -[Quickstart](/health/QUICKSTART.md) - -[Configuration reference](/health/REFERENCE.md) +[Configuration reference](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md) ## Guides @@ -26,13 +28,13 @@ Every infrastructure is different, so we're not interested in mandating how you monitoring features. Instead, these guides should give you the details you need to tweak alarms to your heart's content. -[Stopping notifications for individual alarms](/docs/guides/monitor/stop-notifications-alarms.md) +[Stopping notifications for individual alarms](https://github.com/netdata/netdata/blob/master/docs/guides/monitor/stop-notifications-alarms.md) -[Use dimension templates to create dynamic alarms](/docs/guides/monitor/dimension-templates.md) +[Use dimension templates to create dynamic alarms](https://github.com/netdata/netdata/blob/master/docs/guides/monitor/dimension-templates.md) ## Related features -**[Notifications](/health/notifications/README.md)**: Get notified about ongoing alarms from your Agents via your +**[Notifications](https://github.com/netdata/netdata/blob/master/health/notifications/README.md)**: Get notified about ongoing alarms from your Agents via your favorite platform(s), such as Slack, Discord, PagerDuty, email, and much more. diff --git a/health/REFERENCE.md b/health/REFERENCE.md index 90da4102a..27031cd19 100644 --- a/health/REFERENCE.md +++ b/health/REFERENCE.md @@ -1,6 +1,10 @@ <!-- title: "Health configuration reference" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/REFERENCE.md +sidebar_label: "Health" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/REFERENCE.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Operations/Alerts" --> # Health configuration reference @@ -11,7 +15,7 @@ This guide contains information about editing health configuration files to twea entities that are customized to the needs of your infrastructure. To learn the basics of locating and editing health configuration files, see the [health -quickstart](/health/QUICKSTART.md). +quickstart](https://github.com/netdata/netdata/blob/master/health/QUICKSTART.md). ## Health configuration files @@ -19,7 +23,7 @@ You can configure the Agent's health watchdog service by editing files in two lo - The `[health]` section in `netdata.conf`. By editing the daemon's behavior, you can disable health monitoring altogether, run health checks more or less often, and more. See [daemon - configuration](/daemon/config/README.md#health-section-options) for a table of all the available settings, their + configuration](https://github.com/netdata/netdata/blob/master/daemon/config/README.md#health-section-options) for a table of all the available settings, their default values, and what they control. - The individual `.conf` files in `health.d/`. These health entity files are organized by the type of metric they are performing calculations on or their associated collector. You should edit these files using the `edit-config` @@ -52,7 +56,7 @@ Netdata parses the following lines. Beneath the table is an in-depth explanation - The `every` line is **required** if not using `lookup`. - Each entity **must** have at least one of the following lines: `lookup`, `calc`, `warn`, or `crit`. - A few lines use space-separated lists to define how the entity behaves. You can use `*` as a wildcard or prefix with - `!` for a negative match. Order is important, too! See our [simple patterns docs](/libnetdata/simple_pattern/README.md) for + `!` for a negative match. Order is important, too! See our [simple patterns docs](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md) for more examples. - Lines terminated by a `\` are spliced together with the next line. The backslash is removed and the following line is joined with the current one. No space is inserted, so you may split a line anywhere, even in the middle of a word. @@ -236,7 +240,7 @@ hosts: server1 server2 database* !redis3 redis* #### Alarm line `plugin` The `plugin` line filters which plugin within the context this alarm should apply to. The value is a space-separated -list of [simple patterns](/libnetdata/simple_pattern/README.md). For example, +list of [simple patterns](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md). For example, you can create a filter for an alarm that applies specifically to `python.d.plugin`: ```yaml @@ -250,7 +254,7 @@ comprehensive example using both. #### Alarm line `module` The `module` line filters which module within the context this alarm should apply to. The value is a space-separated -list of [simple patterns](/libnetdata/simple_pattern/README.md). For +list of [simple patterns](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md). For example, you can create an alarm that applies only on the `isc_dhcpd` module started by `python.d.plugin`: ```yaml @@ -262,7 +266,7 @@ module: isc_dhcpd The `charts` line filters which chart this alarm should apply to. It is only available on entities using the [`template`](#alarm-line-alarm-or-template) line. -The value is a space-separated list of [simple patterns](/libnetdata/simple_pattern/README.md). For +The value is a space-separated list of [simple patterns](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md). For example, a template that applies to `disk.svctm` (Average Service Time) context, but excludes the disk `sdb` from alarms: ```yaml @@ -276,7 +280,7 @@ template: disk_svctm_alarm The `families` line, used only alongside templates, filters which families within the context this alarm should apply to. The value is a space-separated list. -The value is a space-separate list of simple patterns. See our [simple patterns docs](/libnetdata/simple_pattern/README.md) for +The value is a space-separate list of simple patterns. See our [simple patterns docs](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md) for some examples. For example, you can create a template on the `disk.io` context, but filter it to only the `sda` and `sdb` families: @@ -295,7 +299,7 @@ The format is: lookup: METHOD AFTER [at BEFORE] [every DURATION] [OPTIONS] [of DIMENSIONS] [foreach DIMENSIONS] ``` -Everything is the same with [badges](/web/api/badges/README.md). In short: +Everything is the same with [badges](https://github.com/netdata/netdata/blob/master/web/api/badges/README.md). In short: - `METHOD` is one of `average`, `min`, `max`, `sum`, `incremental-sum`. This is required. @@ -312,7 +316,7 @@ Everything is the same with [badges](/web/api/badges/README.md). In short: above too). - `OPTIONS` is a space separated list of `percentage`, `absolute`, `min2max`, `unaligned`, - `match-ids`, `match-names`. Check the [badges](/web/api/badges/README.md) documentation for more info. + `match-ids`, `match-names`. Check the [badges](https://github.com/netdata/netdata/blob/master/web/api/badges/README.md) documentation for more info. - `of DIMENSIONS` is optional and has to be the last parameter. Dimensions have to be separated by `,` or `|`. The space characters found in dimensions will be kept as-is (a few dimensions @@ -499,7 +503,7 @@ good idea to tell Netdata to not clear the notification, by using the `no-clear- #### Alarm line `host labels` -Defines the list of labels present on a host. See our [host labels guide](/docs/guides/using-host-labels.md) for +Defines the list of labels present on a host. See our [host labels guide](https://github.com/netdata/netdata/blob/master/docs/guides/using-host-labels.md) for an explanation of host labels and how to implement them. For example, let's suppose that `netdata.conf` is configured with the following labels: @@ -532,7 +536,7 @@ that will be applied to all hosts installed in the last decade with the followin host labels: installed = 201* ``` -See our [simple patterns docs](/libnetdata/simple_pattern/README.md) for more examples. +See our [simple patterns docs](https://github.com/netdata/netdata/blob/master/libnetdata/simple_pattern/README.md) for more examples. #### Alarm line `info` @@ -548,13 +552,13 @@ alert information. Current variables supported are: | variable | description | | ---------| ----------- | -| $family | Will be replaced by the family instance for the alert (e.g. eth0) | -| $label: | Followed by a chart label name, this will replace the variable with the chart label's value | +| ${family} | Will be replaced by the family instance for the alert (e.g. eth0) | +| ${label:LABEL_NAME} | The variable will be replaced with the value of the label | For example, an info field like the following: ```yaml -info: average inbound utilization for the network interface $family over the last minute +info: average inbound utilization for the network interface ${family} over the last minute ``` Will be rendered on the alert acting on interface `eth0` as: @@ -567,7 +571,7 @@ An alert acting on a chart that has a chart label named e.g. `target`, with a va can be enriched as follows: ```yaml -info: average ratio of HTTP responses with unexpected status over the last 5 minutes for the site $label:target +info: average ratio of HTTP responses with unexpected status over the last 5 minutes for the site ${label:target} ``` Will become: @@ -647,15 +651,15 @@ You can find all the variables that can be used for a given chart, using Agent dashboard. For example, [variables for the `system.cpu` chart of the registry](https://registry.my-netdata.io/api/v1/alarm_variables?chart=system.cpu). -> If you don't know how to find the CHART_NAME, you can read about it [here](/web/README.md#charts). +> If you don't know how to find the CHART_NAME, you can read about it [here](https://github.com/netdata/netdata/blob/master/web/README.md#charts). Netdata supports 3 internal indexes for variables that will be used in health monitoring. <details markdown="1"><summary>The variables below can be used in both chart alarms and context templates.</summary> Although the `alarm_variables` link shows you variables for a particular chart, the same variables can also be used in -templates for charts belonging to a given [context](/web/README.md#contexts). The reason is that all charts of a given -context are essentially identical, with the only difference being the [family](/web/README.md#families) that +templates for charts belonging to a given [context](https://github.com/netdata/netdata/blob/master/web/README.md#contexts). The reason is that all charts of a given +context are essentially identical, with the only difference being the [family](https://github.com/netdata/netdata/blob/master/web/README.md#families) that identifies a particular hardware or software instance. Charts and templates do not apply to specific families anyway, unless if you explicitly limit an alarm with the [alarm line `families`](#alarm-line-families). @@ -995,7 +999,7 @@ The `lookup` line will use the `anomaly_rate` dimension of the `anomaly_detectio ## Troubleshooting -You can compile Netdata with [debugging](/daemon/README.md#debugging) and then set in `netdata.conf`: +You can compile Netdata with [debugging](https://github.com/netdata/netdata/blob/master/daemon/README.md#debugging) and then set in `netdata.conf`: ```yaml [global] @@ -1018,6 +1022,6 @@ expression. It's currently not possible to schedule notifications from within the alarm template. For those scenarios where you need to temporary disable notifications (for instance when running backups triggers a disk alert) you can disable or silence notifications are runtime. The health checks can be controlled at runtime via the [health management -api](/web/api/health/README.md). +api](https://github.com/netdata/netdata/blob/master/web/api/health/README.md). diff --git a/health/health.c b/health/health.c index 3784e0f31..b34f54ab5 100644 --- a/health/health.c +++ b/health/health.c @@ -159,9 +159,10 @@ static bool prepare_command(BUFFER *wb, unsigned int default_health_enabled = 1; char *silencers_filename; +SIMPLE_PATTERN *conf_enabled_alarms = NULL; // the queue of executed alarm notifications that haven't been waited for yet -static __thread struct { +static struct { ALARM_ENTRY *head; // oldest ALARM_ENTRY *tail; // latest } alarm_notifications_in_progress = {NULL, NULL}; @@ -301,7 +302,7 @@ void health_init(void) { * @param host the structure of the host that the function will reload the configuration. */ static void health_reload_host(RRDHOST *host) { - if(unlikely(!host->health_enabled) && !rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH)) + if(unlikely(!host->health.health_enabled) && !rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH)) return; log_health("[%s]: Reloading health.", rrdhost_hostname(host)); @@ -345,7 +346,6 @@ static void health_reload_host(RRDHOST *host) { rrdcalctemplate_link_matching_templates_to_rrdset(st); } rrdset_foreach_done(st); - host->aclk_alert_reloaded = 1; } /** @@ -363,6 +363,12 @@ void health_reload(void) { health_reload_host(host); rrd_unlock(); + +#ifdef ENABLE_ACLK + if (netdata_cloud_setting) { + aclk_alert_reloaded = 1; + } +#endif } // ---------------------------------------------------------------------------- @@ -444,8 +450,8 @@ static inline void health_alarm_execute(RRDHOST *host, ALARM_ENTRY *ae) { log_health("[%s]: Sending notification for alarm '%s.%s' status %s.", rrdhost_hostname(host), ae_chart_name(ae), ae_name(ae), rrdcalc_status2string(ae->new_status)); - const char *exec = (ae->exec) ? ae_exec(ae) : string2str(host->health_default_exec); - const char *recipient = (ae->recipient) ? ae_recipient(ae) : string2str(host->health_default_recipient); + const char *exec = (ae->exec) ? ae_exec(ae) : string2str(host->health.health_default_exec); + const char *recipient = (ae->recipient) ? ae_recipient(ae) : string2str(host->health.health_default_recipient); int n_warn=0, n_crit=0; RRDCALC *rc; @@ -453,8 +459,8 @@ static inline void health_alarm_execute(RRDHOST *host, ALARM_ENTRY *ae) { BUFFER *warn_alarms, *crit_alarms; active_alerts_t *active_alerts = callocz(ACTIVE_ALARMS_LIST_EXAMINE, sizeof(active_alerts_t)); - warn_alarms = buffer_create(NETDATA_WEB_RESPONSE_INITIAL_SIZE); - crit_alarms = buffer_create(NETDATA_WEB_RESPONSE_INITIAL_SIZE); + warn_alarms = buffer_create(NETDATA_WEB_RESPONSE_INITIAL_SIZE, &netdata_buffers_statistics.buffers_health); + crit_alarms = buffer_create(NETDATA_WEB_RESPONSE_INITIAL_SIZE, &netdata_buffers_statistics.buffers_health); foreach_rrdcalc_in_rrdhost_read(host, rc) { if(unlikely(!rc->rrdset || !rc->rrdset->last_collected_time.tv_sec)) @@ -511,7 +517,7 @@ static inline void health_alarm_execute(RRDHOST *host, ALARM_ENTRY *ae) { char *edit_command = ae->source ? health_edit_command_from_source(ae_source(ae)) : strdupz("UNKNOWN=0=UNKNOWN"); - BUFFER *wb = buffer_create(8192); + BUFFER *wb = buffer_create(8192, &netdata_buffers_statistics.buffers_health); bool ok = prepare_command(wb, exec, recipient, @@ -692,8 +698,8 @@ static inline int rrdcalc_isrunnable(RRDCALC *rc, time_t now, time_t *next_run) } int update_every = rc->rrdset->update_every; - time_t first = rrdset_first_entry_t(rc->rrdset); - time_t last = rrdset_last_entry_t(rc->rrdset); + time_t first = rrdset_first_entry_s(rc->rrdset); + time_t last = rrdset_last_entry_s(rc->rrdset); if(unlikely(now + update_every < first /* || now - update_every > last */)) { debug(D_HEALTH @@ -719,7 +725,7 @@ static inline int rrdcalc_isrunnable(RRDCALC *rc, time_t now, time_t *next_run) } static inline int check_if_resumed_from_suspension(void) { - static __thread usec_t last_realtime = 0, last_monotonic = 0; + static usec_t last_realtime = 0, last_monotonic = 0; usec_t realtime = now_realtime_usec(), monotonic = now_monotonic_usec(); int ret = 0; @@ -735,25 +741,29 @@ static inline int check_if_resumed_from_suspension(void) { return ret; } -static void health_thread_cleanup(void *ptr) { +static void health_main_cleanup(void *ptr) { worker_unregister(); - struct health_state *h = ptr; - h->host->health_spawn = 0; + struct netdata_static_thread *static_thread = (struct netdata_static_thread *)ptr; + static_thread->enabled = NETDATA_MAIN_THREAD_EXITING; + info("cleaning up..."); + static_thread->enabled = NETDATA_MAIN_THREAD_EXITED; - netdata_thread_cancel(netdata_thread_self()); - log_health("[%s]: Health thread ended.", rrdhost_hostname(h->host)); - debug(D_HEALTH, "HEALTH %s: Health thread ended.", rrdhost_hostname(h->host)); + log_health("Health thread ended."); } static void initialize_health(RRDHOST *host, int is_localhost) { - if(!host->health_enabled || rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH)) return; + if(!host->health.health_enabled || + rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH) || + !service_running(SERVICE_HEALTH)) + return; + rrdhost_flag_set(host, RRDHOST_FLAG_INITIALIZED_HEALTH); log_health("[%s]: Initializing health.", rrdhost_hostname(host)); - host->health_default_warn_repeat_every = config_get_duration(CONFIG_SECTION_HEALTH, "default repeat warning", "never"); - host->health_default_crit_repeat_every = config_get_duration(CONFIG_SECTION_HEALTH, "default repeat critical", "never"); + host->health.health_default_warn_repeat_every = config_get_duration(CONFIG_SECTION_HEALTH, "default repeat warning", "never"); + host->health.health_default_crit_repeat_every = config_get_duration(CONFIG_SECTION_HEALTH, "default repeat critical", "never"); host->health_log.next_log_id = 1; host->health_log.next_alarm_id = 1; @@ -769,6 +779,8 @@ static void initialize_health(RRDHOST *host, int is_localhost) { else host->health_log.max = (unsigned int)n; + conf_enabled_alarms = simple_pattern_create(config_get(CONFIG_SECTION_HEALTH, "enabled alarms", "*"), NULL, SIMPLE_PATTERN_EXACT); + netdata_rwlock_init(&host->health_log.alarm_log_rwlock); char filename[FILENAME_MAX + 1]; @@ -785,30 +797,15 @@ static void initialize_health(RRDHOST *host, int is_localhost) { if(r != 0 && errno != EEXIST) error("Host '%s': cannot create directory '%s'", rrdhost_hostname(host), filename); } - snprintfz(filename, FILENAME_MAX, "%s/health/health-log.db", host->varlib_dir); - host->health_log_filename = strdupz(filename); snprintfz(filename, FILENAME_MAX, "%s/alarm-notify.sh", netdata_configured_primary_plugins_dir); - host->health_default_exec = string_strdupz(config_get(CONFIG_SECTION_HEALTH, "script to execute on alarm", filename)); - host->health_default_recipient = string_strdupz("root"); - - if (!file_is_migrated(host->health_log_filename)) { - int rc = sql_create_health_log_table(host); - if (unlikely(rc)) { - log_health("[%s]: Failed to create health log table in the database", rrdhost_hostname(host)); - health_alarm_log_load(host); - health_alarm_log_open(host); - } - else { - health_alarm_log_load(host); - add_migrated_file(host->health_log_filename, 0); - } - } else { - // TODO: This needs to go to the metadata thread - // Health should wait before accessing the table (needs to be created by the metadata thread) - sql_create_health_log_table(host); - sql_health_alarm_log_load(host); - } + host->health.health_default_exec = string_strdupz(config_get(CONFIG_SECTION_HEALTH, "script to execute on alarm", filename)); + host->health.health_default_recipient = string_strdupz("root"); + + // TODO: This needs to go to the metadata thread + // Health should wait before accessing the table (needs to be created by the metadata thread) + sql_create_health_log_table(host); + sql_health_alarm_log_load(host); // ------------------------------------------------------------------------ // load health configuration @@ -828,16 +825,14 @@ static void initialize_health(RRDHOST *host, int is_localhost) { //Discard alarms with labels that do not apply to host rrdcalc_delete_alerts_not_matching_host_labels_from_this_host(host); - - health_silencers_init(); } -static void health_sleep(time_t next_run, unsigned int loop __maybe_unused, RRDHOST *host) { +static void health_sleep(time_t next_run, unsigned int loop __maybe_unused) { time_t now = now_realtime_sec(); if(now < next_run) { worker_is_idle(); debug(D_HEALTH, "Health monitoring iteration no %u done. Next iteration in %d secs", loop, (int) (next_run - now)); - while (now < next_run && host->health_enabled && !netdata_exit) { + while (now < next_run && service_running(SERVICE_HEALTH)) { sleep_usec(USEC_PER_SEC); now = now_realtime_sec(); } @@ -995,555 +990,567 @@ void *health_main(void *ptr) { worker_register_job_name(WORKER_HEALTH_JOB_DELAYED_INIT_RRDSET, "rrdset init"); worker_register_job_name(WORKER_HEALTH_JOB_DELAYED_INIT_RRDDIM, "rrddim init"); - struct health_state *h = ptr; - netdata_thread_cleanup_push(health_thread_cleanup, ptr); - - RRDHOST *host = h->host; - initialize_health(host, host == localhost); + netdata_thread_cleanup_push(health_main_cleanup, ptr); int min_run_every = (int)config_get_number(CONFIG_SECTION_HEALTH, "run at least every seconds", 10); if(min_run_every < 1) min_run_every = 1; - int cleanup_sql_every_loop = 7200 / min_run_every; - - time_t now = now_realtime_sec(); time_t hibernation_delay = config_get_number(CONFIG_SECTION_HEALTH, "postpone alarms during hibernation for seconds", 60); bool health_running_logged = false; - rrdhost_rdlock(host); //CHECK - rrdcalc_delete_alerts_not_matching_host_labels_from_this_host(host); - rrdhost_unlock(host); + rrdcalc_delete_alerts_not_matching_host_labels_from_all_hosts(); unsigned int loop = 0; #ifdef ENABLE_ACLK unsigned int marked_aclk_reload_loop = 0; #endif - while(!netdata_exit && host->health_enabled) { + while(service_running(SERVICE_HEALTH)) { loop++; debug(D_HEALTH, "Health monitoring iteration no %u started", loop); - now = now_realtime_sec(); + time_t now = now_realtime_sec(); int runnable = 0, apply_hibernation_delay = 0; time_t next_run = now + min_run_every; RRDCALC *rc; + RRDHOST *host; if (unlikely(check_if_resumed_from_suspension())) { apply_hibernation_delay = 1; log_health( - "[%s]: Postponing alarm checks for %"PRId64" seconds, " + "Postponing alarm checks for %"PRId64" seconds, " "because it seems that the system was just resumed from suspension.", - rrdhost_hostname(host), (int64_t)hibernation_delay); } if (unlikely(silencers->all_alarms && silencers->stype == STYPE_DISABLE_ALARMS)) { - static __thread int logged=0; + static int logged=0; if (!logged) { - log_health("[%s]: Skipping health checks, because all alarms are disabled via a %s command.", - rrdhost_hostname(host), + log_health("Skipping health checks, because all alarms are disabled via a %s command.", HEALTH_CMDAPI_CMD_DISABLEALL); logged = 1; } } #ifdef ENABLE_ACLK - if (host->aclk_alert_reloaded && !marked_aclk_reload_loop) + if (aclk_alert_reloaded && !marked_aclk_reload_loop) marked_aclk_reload_loop = loop; #endif - if (unlikely(apply_hibernation_delay)) { - log_health( - "[%s]: Postponing health checks for %"PRId64" seconds.", - rrdhost_hostname(host), - (int64_t)hibernation_delay); - - host->health_delay_up_to = now + hibernation_delay; - next_run = now + hibernation_delay; - health_sleep(next_run, loop, host); - } + worker_is_busy(WORKER_HEALTH_JOB_RRD_LOCK); + rrd_rdlock(); - if (unlikely(host->health_delay_up_to)) { - if (unlikely(now < host->health_delay_up_to)) { - next_run = host->health_delay_up_to; - health_sleep(next_run, loop, host); - continue; - } + rrdhost_foreach_read(host) { - log_health("[%s]: Resuming health checks after delay.", rrdhost_hostname(host)); - host->health_delay_up_to = 0; - } + if(unlikely(!service_running(SERVICE_HEALTH))) + break; - // wait until cleanup of obsolete charts on children is complete - if (host != localhost) { - if (unlikely(host->trigger_chart_obsoletion_check == 1)) { - log_health("[%s]: Waiting for chart obsoletion check.", rrdhost_hostname(host)); - health_sleep(next_run, loop, host); + if (unlikely(!host->health.health_enabled)) continue; - } - } - if (!health_running_logged) { - log_health("[%s]: Health is running.", rrdhost_hostname(host)); - health_running_logged = true; - } - - if(likely(!host->health_log_fp) && (loop == 1 || loop % cleanup_sql_every_loop == 0)) - sql_health_alarm_log_cleanup(host); + if (unlikely(!rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH))) { + rrd_unlock(); + initialize_health(host, host == localhost); + rrd_rdlock(); + } - health_execute_delayed_initializations(host); + health_execute_delayed_initializations(host); - worker_is_busy(WORKER_HEALTH_JOB_HOST_LOCK); + rrdcalc_delete_alerts_not_matching_host_labels_from_this_host(host); - // the first loop is to lookup values from the db - foreach_rrdcalc_in_rrdhost_read(host, rc) { + if (unlikely(apply_hibernation_delay)) { + log_health( + "[%s]: Postponing health checks for %"PRId64" seconds.", + rrdhost_hostname(host), + (int64_t)hibernation_delay); - rrdcalc_update_info_using_rrdset_labels(rc); + host->health.health_delay_up_to = now + hibernation_delay; + } - if (update_disabled_silenced(host, rc)) - continue; + if (unlikely(host->health.health_delay_up_to)) { + if (unlikely(now < host->health.health_delay_up_to)) { + continue; + } - // create an alert removed event if the chart is obsolete and - // has stopped being collected for 60 seconds - if (unlikely(rc->rrdset && rc->status != RRDCALC_STATUS_REMOVED && - rrdset_flag_check(rc->rrdset, RRDSET_FLAG_OBSOLETE) && - now > (rc->rrdset->last_collected_time.tv_sec + 60))) { - if (!rrdcalc_isrepeating(rc)) { - worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); - time_t now = now_realtime_sec(); - - ALARM_ENTRY *ae = health_create_alarm_entry( - host, - rc->id, - rc->next_event_id++, - rc->config_hash_id, - now, - rc->name, - rc->rrdset->id, - rc->rrdset->context, - rc->rrdset->family, - rc->classification, - rc->component, - rc->type, - rc->exec, - rc->recipient, - now - rc->last_status_change, - rc->value, - NAN, - rc->status, - RRDCALC_STATUS_REMOVED, - rc->source, - rc->units, - rc->info, - 0, - rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0); - - if (ae) { - health_alarm_log_add_entry(host, ae); - rc->old_status = rc->status; - rc->status = RRDCALC_STATUS_REMOVED; - rc->last_status_change = now; - rc->last_updated = now; - rc->value = NAN; + log_health("[%s]: Resuming health checks after delay.", rrdhost_hostname(host)); + host->health.health_delay_up_to = 0; + } -#ifdef ENABLE_ACLK - if (netdata_cloud_setting && likely(!host->aclk_alert_reloaded)) - sql_queue_alarm_to_aclk(host, ae, 1); -#endif - } + // wait until cleanup of obsolete charts on children is complete + if (host != localhost) { + if (unlikely(host->trigger_chart_obsoletion_check == 1)) { + log_health("[%s]: Waiting for chart obsoletion check.", rrdhost_hostname(host)); + continue; } } - if (unlikely(!rrdcalc_isrunnable(rc, now, &next_run))) { - if (unlikely(rc->run_flags & RRDCALC_FLAG_RUNNABLE)) - rc->run_flags &= ~RRDCALC_FLAG_RUNNABLE; - continue; + if (!health_running_logged) { + log_health("[%s]: Health is running.", rrdhost_hostname(host)); + health_running_logged = true; } - runnable++; - rc->old_value = rc->value; - rc->run_flags |= RRDCALC_FLAG_RUNNABLE; + worker_is_busy(WORKER_HEALTH_JOB_HOST_LOCK); - // ------------------------------------------------------------ - // if there is database lookup, do it + // the first loop is to lookup values from the db + foreach_rrdcalc_in_rrdhost_read(host, rc) { - if (unlikely(RRDCALC_HAS_DB_LOOKUP(rc))) { - worker_is_busy(WORKER_HEALTH_JOB_DB_QUERY); + if(unlikely(!service_running(SERVICE_HEALTH))) + break; - /* time_t old_db_timestamp = rc->db_before; */ - int value_is_null = 0; + rrdcalc_update_info_using_rrdset_labels(rc); - int ret = rrdset2value_api_v1(rc->rrdset, NULL, &rc->value, rrdcalc_dimensions(rc), 1, - rc->after, rc->before, rc->group, NULL, - 0, rc->options, - &rc->db_after,&rc->db_before, - NULL, NULL, NULL, - &value_is_null, NULL, 0, 0, - QUERY_SOURCE_HEALTH); + if (update_disabled_silenced(host, rc)) + continue; - if (unlikely(ret != 200)) { - // database lookup failed - rc->value = NAN; - rc->run_flags |= RRDCALC_FLAG_DB_ERROR; + // create an alert removed event if the chart is obsolete and + // has stopped being collected for 60 seconds + if (unlikely(rc->rrdset && rc->status != RRDCALC_STATUS_REMOVED && + rrdset_flag_check(rc->rrdset, RRDSET_FLAG_OBSOLETE) && + now > (rc->rrdset->last_collected_time.tv_sec + 60))) { + if (!rrdcalc_isrepeating(rc)) { + worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); + time_t now = now_realtime_sec(); + + ALARM_ENTRY *ae = health_create_alarm_entry( + host, + rc->id, + rc->next_event_id++, + rc->config_hash_id, + now, + rc->name, + rc->rrdset->id, + rc->rrdset->context, + rc->rrdset->family, + rc->classification, + rc->component, + rc->type, + rc->exec, + rc->recipient, + now - rc->last_status_change, + rc->value, + NAN, + rc->status, + RRDCALC_STATUS_REMOVED, + rc->source, + rc->units, + rc->info, + 0, + rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0); + + if (ae) { + health_alarm_log_add_entry(host, ae); + rc->old_status = rc->status; + rc->status = RRDCALC_STATUS_REMOVED; + rc->last_status_change = now; + rc->last_updated = now; + rc->value = NAN; - debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup returned error %d", - rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), ret - ); - } else - rc->run_flags &= ~RRDCALC_FLAG_DB_ERROR; - - /* - RRDCALC_FLAG_DB_STALE not currently used - if (unlikely(old_db_timestamp == rc->db_before)) { - // database is stale - - debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database is stale", host->hostname, rc->chart?rc->chart:"NOCHART", rc->name); - - if (unlikely(!(rc->rrdcalc_flags & RRDCALC_FLAG_DB_STALE))) { - rc->rrdcalc_flags |= RRDCALC_FLAG_DB_STALE; - error("Health on host '%s', alarm '%s.%s': database is stale", host->hostname, rc->chart?rc->chart:"NOCHART", rc->name); - } - } - else if (unlikely(rc->rrdcalc_flags & RRDCALC_FLAG_DB_STALE)) - rc->rrdcalc_flags &= ~RRDCALC_FLAG_DB_STALE; - */ - - if (unlikely(value_is_null)) { - // collected value is null - rc->value = NAN; - rc->run_flags |= RRDCALC_FLAG_DB_NAN; - - debug(D_HEALTH, - "Health on host '%s', alarm '%s.%s': database lookup returned empty value (possibly value is not collected yet)", - rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc) - ); - } else - rc->run_flags &= ~RRDCALC_FLAG_DB_NAN; +#ifdef ENABLE_ACLK + if (netdata_cloud_setting && likely(!aclk_alert_reloaded)) + sql_queue_alarm_to_aclk(host, ae, 1); +#endif + } + } + } - debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup gave value " NETDATA_DOUBLE_FORMAT, - rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), rc->value - ); - } + if (unlikely(!rrdcalc_isrunnable(rc, now, &next_run))) { + if (unlikely(rc->run_flags & RRDCALC_FLAG_RUNNABLE)) + rc->run_flags &= ~RRDCALC_FLAG_RUNNABLE; + continue; + } - // ------------------------------------------------------------ - // if there is calculation expression, run it + runnable++; + rc->old_value = rc->value; + rc->run_flags |= RRDCALC_FLAG_RUNNABLE; - if (unlikely(rc->calculation)) { - worker_is_busy(WORKER_HEALTH_JOB_CALC_EVAL); + // ------------------------------------------------------------ + // if there is database lookup, do it - if (unlikely(!expression_evaluate(rc->calculation))) { - // calculation failed - rc->value = NAN; - rc->run_flags |= RRDCALC_FLAG_CALC_ERROR; + if (unlikely(RRDCALC_HAS_DB_LOOKUP(rc))) { + worker_is_busy(WORKER_HEALTH_JOB_DB_QUERY); - debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' failed: %s", - rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), - rc->calculation->parsed_as, buffer_tostring(rc->calculation->error_msg) - ); - } else { - rc->run_flags &= ~RRDCALC_FLAG_CALC_ERROR; + /* time_t old_db_timestamp = rc->db_before; */ + int value_is_null = 0; - debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' gave value " - NETDATA_DOUBLE_FORMAT - ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), - rc->calculation->parsed_as, rc->calculation->result, - buffer_tostring(rc->calculation->error_msg), rrdcalc_source(rc) - ); + int ret = rrdset2value_api_v1(rc->rrdset, NULL, &rc->value, rrdcalc_dimensions(rc), 1, + rc->after, rc->before, rc->group, NULL, + 0, rc->options, + &rc->db_after,&rc->db_before, + NULL, NULL, NULL, + &value_is_null, NULL, 0, 0, + QUERY_SOURCE_HEALTH, STORAGE_PRIORITY_LOW); - rc->value = rc->calculation->result; - } - } - } - foreach_rrdcalc_in_rrdhost_done(rc); + if (unlikely(ret != 200)) { + // database lookup failed + rc->value = NAN; + rc->run_flags |= RRDCALC_FLAG_DB_ERROR; - if (unlikely(runnable && !netdata_exit)) { - foreach_rrdcalc_in_rrdhost_read(host, rc) { - if (unlikely(!(rc->run_flags & RRDCALC_FLAG_RUNNABLE))) - continue; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup returned error %d", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), ret + ); + } else + rc->run_flags &= ~RRDCALC_FLAG_DB_ERROR; - if (rc->run_flags & RRDCALC_FLAG_DISABLED) { - continue; + if (unlikely(value_is_null)) { + // collected value is null + rc->value = NAN; + rc->run_flags |= RRDCALC_FLAG_DB_NAN; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': database lookup returned empty value (possibly value is not collected yet)", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc) + ); + } else + rc->run_flags &= ~RRDCALC_FLAG_DB_NAN; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': database lookup gave value " NETDATA_DOUBLE_FORMAT, + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), rc->value + ); } - RRDCALC_STATUS warning_status = RRDCALC_STATUS_UNDEFINED; - RRDCALC_STATUS critical_status = RRDCALC_STATUS_UNDEFINED; - // -------------------------------------------------------- - // check the warning expression + // ------------------------------------------------------------ + // if there is calculation expression, run it - if (likely(rc->warning)) { - worker_is_busy(WORKER_HEALTH_JOB_WARNING_EVAL); + if (unlikely(rc->calculation)) { + worker_is_busy(WORKER_HEALTH_JOB_CALC_EVAL); - if (unlikely(!expression_evaluate(rc->warning))) { + if (unlikely(!expression_evaluate(rc->calculation))) { // calculation failed - rc->run_flags |= RRDCALC_FLAG_WARN_ERROR; + rc->value = NAN; + rc->run_flags |= RRDCALC_FLAG_CALC_ERROR; - debug(D_HEALTH, - "Health on host '%s', alarm '%s.%s': warning expression failed with error: %s", + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' failed: %s", rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), - buffer_tostring(rc->warning->error_msg) + rc->calculation->parsed_as, buffer_tostring(rc->calculation->error_msg) ); } else { - rc->run_flags &= ~RRDCALC_FLAG_WARN_ERROR; - debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': warning expression gave value " + rc->run_flags &= ~RRDCALC_FLAG_CALC_ERROR; + + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': expression '%s' gave value " NETDATA_DOUBLE_FORMAT - ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), - rrdcalc_name(rc), rc->warning->result, buffer_tostring(rc->warning->error_msg), rrdcalc_source(rc) + ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), + rc->calculation->parsed_as, rc->calculation->result, + buffer_tostring(rc->calculation->error_msg), rrdcalc_source(rc) ); - warning_status = rrdcalc_value2status(rc->warning->result); + + rc->value = rc->calculation->result; } } + } + foreach_rrdcalc_in_rrdhost_done(rc); - // -------------------------------------------------------- - // check the critical expression + if (unlikely(runnable && service_running(SERVICE_HEALTH))) { + foreach_rrdcalc_in_rrdhost_read(host, rc) { + if(unlikely(!service_running(SERVICE_HEALTH))) + break; - if (likely(rc->critical)) { - worker_is_busy(WORKER_HEALTH_JOB_CRITICAL_EVAL); + if (unlikely(!(rc->run_flags & RRDCALC_FLAG_RUNNABLE))) + continue; - if (unlikely(!expression_evaluate(rc->critical))) { - // calculation failed - rc->run_flags |= RRDCALC_FLAG_CRIT_ERROR; + if (rc->run_flags & RRDCALC_FLAG_DISABLED) { + continue; + } + RRDCALC_STATUS warning_status = RRDCALC_STATUS_UNDEFINED; + RRDCALC_STATUS critical_status = RRDCALC_STATUS_UNDEFINED; + + // -------------------------------------------------------- + // check the warning expression + + if (likely(rc->warning)) { + worker_is_busy(WORKER_HEALTH_JOB_WARNING_EVAL); + + if (unlikely(!expression_evaluate(rc->warning))) { + // calculation failed + rc->run_flags |= RRDCALC_FLAG_WARN_ERROR; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': warning expression failed with error: %s", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), + buffer_tostring(rc->warning->error_msg) + ); + } else { + rc->run_flags &= ~RRDCALC_FLAG_WARN_ERROR; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': warning expression gave value " + NETDATA_DOUBLE_FORMAT + ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), + rrdcalc_name(rc), rc->warning->result, buffer_tostring(rc->warning->error_msg), rrdcalc_source(rc) + ); + warning_status = rrdcalc_value2status(rc->warning->result); + } + } - debug(D_HEALTH, - "Health on host '%s', alarm '%s.%s': critical expression failed with error: %s", - rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), - buffer_tostring(rc->critical->error_msg) - ); - } else { - rc->run_flags &= ~RRDCALC_FLAG_CRIT_ERROR; - debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': critical expression gave value " - NETDATA_DOUBLE_FORMAT - ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), - rrdcalc_name(rc), rc->critical->result, buffer_tostring(rc->critical->error_msg), - rrdcalc_source(rc) - ); - critical_status = rrdcalc_value2status(rc->critical->result); + // -------------------------------------------------------- + // check the critical expression + + if (likely(rc->critical)) { + worker_is_busy(WORKER_HEALTH_JOB_CRITICAL_EVAL); + + if (unlikely(!expression_evaluate(rc->critical))) { + // calculation failed + rc->run_flags |= RRDCALC_FLAG_CRIT_ERROR; + + debug(D_HEALTH, + "Health on host '%s', alarm '%s.%s': critical expression failed with error: %s", + rrdhost_hostname(host), rrdcalc_chart_name(rc), rrdcalc_name(rc), + buffer_tostring(rc->critical->error_msg) + ); + } else { + rc->run_flags &= ~RRDCALC_FLAG_CRIT_ERROR; + debug(D_HEALTH, "Health on host '%s', alarm '%s.%s': critical expression gave value " + NETDATA_DOUBLE_FORMAT + ": %s (source: %s)", rrdhost_hostname(host), rrdcalc_chart_name(rc), + rrdcalc_name(rc), rc->critical->result, buffer_tostring(rc->critical->error_msg), + rrdcalc_source(rc) + ); + critical_status = rrdcalc_value2status(rc->critical->result); + } } - } - // -------------------------------------------------------- - // decide the final alarm status + // -------------------------------------------------------- + // decide the final alarm status - RRDCALC_STATUS status = RRDCALC_STATUS_UNDEFINED; + RRDCALC_STATUS status = RRDCALC_STATUS_UNDEFINED; - switch (warning_status) { - case RRDCALC_STATUS_CLEAR: - status = RRDCALC_STATUS_CLEAR; - break; + switch (warning_status) { + case RRDCALC_STATUS_CLEAR: + status = RRDCALC_STATUS_CLEAR; + break; - case RRDCALC_STATUS_RAISED: - status = RRDCALC_STATUS_WARNING; - break; + case RRDCALC_STATUS_RAISED: + status = RRDCALC_STATUS_WARNING; + break; - default: - break; - } + default: + break; + } - switch (critical_status) { - case RRDCALC_STATUS_CLEAR: - if (status == RRDCALC_STATUS_UNDEFINED) - status = RRDCALC_STATUS_CLEAR; - break; + switch (critical_status) { + case RRDCALC_STATUS_CLEAR: + if (status == RRDCALC_STATUS_UNDEFINED) + status = RRDCALC_STATUS_CLEAR; + break; - case RRDCALC_STATUS_RAISED: - status = RRDCALC_STATUS_CRITICAL; - break; + case RRDCALC_STATUS_RAISED: + status = RRDCALC_STATUS_CRITICAL; + break; - default: - break; - } + default: + break; + } - // -------------------------------------------------------- - // check if the new status and the old differ + // -------------------------------------------------------- + // check if the new status and the old differ - if (status != rc->status) { - worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); - int delay = 0; + if (status != rc->status) { + worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); + int delay = 0; - // apply trigger hysteresis + // apply trigger hysteresis - if (now > rc->delay_up_to_timestamp) { - rc->delay_up_current = rc->delay_up_duration; - rc->delay_down_current = rc->delay_down_duration; - rc->delay_last = 0; - rc->delay_up_to_timestamp = 0; - } else { - rc->delay_up_current = (int) (rc->delay_up_current * rc->delay_multiplier); - if (rc->delay_up_current > rc->delay_max_duration) - rc->delay_up_current = rc->delay_max_duration; + if (now > rc->delay_up_to_timestamp) { + rc->delay_up_current = rc->delay_up_duration; + rc->delay_down_current = rc->delay_down_duration; + rc->delay_last = 0; + rc->delay_up_to_timestamp = 0; + } else { + rc->delay_up_current = (int) (rc->delay_up_current * rc->delay_multiplier); + if (rc->delay_up_current > rc->delay_max_duration) + rc->delay_up_current = rc->delay_max_duration; - rc->delay_down_current = (int) (rc->delay_down_current * rc->delay_multiplier); - if (rc->delay_down_current > rc->delay_max_duration) - rc->delay_down_current = rc->delay_max_duration; - } + rc->delay_down_current = (int) (rc->delay_down_current * rc->delay_multiplier); + if (rc->delay_down_current > rc->delay_max_duration) + rc->delay_down_current = rc->delay_max_duration; + } - if (status > rc->status) - delay = rc->delay_up_current; - else - delay = rc->delay_down_current; - - // COMMENTED: because we do need to send raising alarms - // if(now + delay < rc->delay_up_to_timestamp) - // delay = (int)(rc->delay_up_to_timestamp - now); - - rc->delay_last = delay; - rc->delay_up_to_timestamp = now + delay; - - ALARM_ENTRY *ae = health_create_alarm_entry( - host, - rc->id, - rc->next_event_id++, - rc->config_hash_id, - now, - rc->name, - rc->rrdset->id, - rc->rrdset->context, - rc->rrdset->family, - rc->classification, - rc->component, - rc->type, - rc->exec, - rc->recipient, - now - rc->last_status_change, - rc->old_value, - rc->value, - rc->status, - status, - rc->source, - rc->units, - rc->info, - rc->delay_last, - ( - ((rc->options & RRDCALC_OPTION_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | - ((rc->run_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) | - (rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0) - ) - ); - - health_alarm_log_add_entry(host, ae); - - log_health("[%s]: Alert event for [%s.%s], value [%s], status [%s].", rrdhost_hostname(host), ae_chart_name(ae), ae_name(ae), ae_new_value_string(ae), rrdcalc_status2string(ae->new_status)); - - rc->last_status_change = now; - rc->old_status = rc->status; - rc->status = status; - } + if (status > rc->status) + delay = rc->delay_up_current; + else + delay = rc->delay_down_current; + + // COMMENTED: because we do need to send raising alarms + // if(now + delay < rc->delay_up_to_timestamp) + // delay = (int)(rc->delay_up_to_timestamp - now); + + rc->delay_last = delay; + rc->delay_up_to_timestamp = now + delay; + + ALARM_ENTRY *ae = health_create_alarm_entry( + host, + rc->id, + rc->next_event_id++, + rc->config_hash_id, + now, + rc->name, + rc->rrdset->id, + rc->rrdset->context, + rc->rrdset->family, + rc->classification, + rc->component, + rc->type, + rc->exec, + rc->recipient, + now - rc->last_status_change, + rc->old_value, + rc->value, + rc->status, + status, + rc->source, + rc->units, + rc->info, + rc->delay_last, + ( + ((rc->options & RRDCALC_OPTION_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | + ((rc->run_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) | + (rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0) + ) + ); - rc->last_updated = now; - rc->next_update = now + rc->update_every; + health_alarm_log_add_entry(host, ae); - if (next_run > rc->next_update) - next_run = rc->next_update; - } - foreach_rrdcalc_in_rrdhost_done(rc); + log_health("[%s]: Alert event for [%s.%s], value [%s], status [%s].", rrdhost_hostname(host), ae_chart_name(ae), ae_name(ae), ae_new_value_string(ae), rrdcalc_status2string(ae->new_status)); - // process repeating alarms - foreach_rrdcalc_in_rrdhost_read(host, rc) { - int repeat_every = 0; - if(unlikely(rrdcalc_isrepeating(rc) && rc->delay_up_to_timestamp <= now)) { - if(unlikely(rc->status == RRDCALC_STATUS_WARNING)) { - rc->run_flags &= ~RRDCALC_FLAG_RUN_ONCE; - repeat_every = rc->warn_repeat_every; - } else if(unlikely(rc->status == RRDCALC_STATUS_CRITICAL)) { - rc->run_flags &= ~RRDCALC_FLAG_RUN_ONCE; - repeat_every = rc->crit_repeat_every; - } else if(unlikely(rc->status == RRDCALC_STATUS_CLEAR)) { - if(!(rc->run_flags & RRDCALC_FLAG_RUN_ONCE)) { - if(rc->old_status == RRDCALC_STATUS_CRITICAL) { - repeat_every = 1; - } else if (rc->old_status == RRDCALC_STATUS_WARNING) { - repeat_every = 1; + rc->last_status_change = now; + rc->old_status = rc->status; + rc->status = status; + } + + rc->last_updated = now; + rc->next_update = now + rc->update_every; + + if (next_run > rc->next_update) + next_run = rc->next_update; + } + foreach_rrdcalc_in_rrdhost_done(rc); + + // process repeating alarms + foreach_rrdcalc_in_rrdhost_read(host, rc) { + if(unlikely(!service_running(SERVICE_HEALTH))) + break; + + int repeat_every = 0; + if(unlikely(rrdcalc_isrepeating(rc) && rc->delay_up_to_timestamp <= now)) { + if(unlikely(rc->status == RRDCALC_STATUS_WARNING)) { + rc->run_flags &= ~RRDCALC_FLAG_RUN_ONCE; + repeat_every = rc->warn_repeat_every; + } else if(unlikely(rc->status == RRDCALC_STATUS_CRITICAL)) { + rc->run_flags &= ~RRDCALC_FLAG_RUN_ONCE; + repeat_every = rc->crit_repeat_every; + } else if(unlikely(rc->status == RRDCALC_STATUS_CLEAR)) { + if(!(rc->run_flags & RRDCALC_FLAG_RUN_ONCE)) { + if(rc->old_status == RRDCALC_STATUS_CRITICAL) { + repeat_every = 1; + } else if (rc->old_status == RRDCALC_STATUS_WARNING) { + repeat_every = 1; + } } } + } else { + continue; } - } else { - continue; - } - if(unlikely(repeat_every > 0 && (rc->last_repeat + repeat_every) <= now)) { - worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); - rc->last_repeat = now; - if (likely(rc->times_repeat < UINT32_MAX)) rc->times_repeat++; - - ALARM_ENTRY *ae = health_create_alarm_entry( - host, - rc->id, - rc->next_event_id++, - rc->config_hash_id, - now, - rc->name, - rc->rrdset->id, - rc->rrdset->context, - rc->rrdset->family, - rc->classification, - rc->component, - rc->type, - rc->exec, - rc->recipient, - now - rc->last_status_change, - rc->old_value, - rc->value, - rc->old_status, - rc->status, - rc->source, - rc->units, - rc->info, - rc->delay_last, - ( - ((rc->options & RRDCALC_OPTION_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | - ((rc->run_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) | - (rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0) - ) - ); - - ae->last_repeat = rc->last_repeat; - if (!(rc->run_flags & RRDCALC_FLAG_RUN_ONCE) && rc->status == RRDCALC_STATUS_CLEAR) { - ae->flags |= HEALTH_ENTRY_RUN_ONCE; + if(unlikely(repeat_every > 0 && (rc->last_repeat + repeat_every) <= now)) { + worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_ENTRY); + rc->last_repeat = now; + if (likely(rc->times_repeat < UINT32_MAX)) rc->times_repeat++; + + ALARM_ENTRY *ae = health_create_alarm_entry( + host, + rc->id, + rc->next_event_id++, + rc->config_hash_id, + now, + rc->name, + rc->rrdset->id, + rc->rrdset->context, + rc->rrdset->family, + rc->classification, + rc->component, + rc->type, + rc->exec, + rc->recipient, + now - rc->last_status_change, + rc->old_value, + rc->value, + rc->old_status, + rc->status, + rc->source, + rc->units, + rc->info, + rc->delay_last, + ( + ((rc->options & RRDCALC_OPTION_NO_CLEAR_NOTIFICATION)? HEALTH_ENTRY_FLAG_NO_CLEAR_NOTIFICATION : 0) | + ((rc->run_flags & RRDCALC_FLAG_SILENCED)? HEALTH_ENTRY_FLAG_SILENCED : 0) | + (rrdcalc_isrepeating(rc)?HEALTH_ENTRY_FLAG_IS_REPEATING:0) + ) + ); + + ae->last_repeat = rc->last_repeat; + if (!(rc->run_flags & RRDCALC_FLAG_RUN_ONCE) && rc->status == RRDCALC_STATUS_CLEAR) { + ae->flags |= HEALTH_ENTRY_RUN_ONCE; + } + rc->run_flags |= RRDCALC_FLAG_RUN_ONCE; + health_process_notifications(host, ae); + debug(D_HEALTH, "Notification sent for the repeating alarm %u.", ae->alarm_id); + health_alarm_wait_for_execution(ae); + health_alarm_log_free_one_nochecks_nounlink(ae); } - rc->run_flags |= RRDCALC_FLAG_RUN_ONCE; - health_process_notifications(host, ae); - debug(D_HEALTH, "Notification sent for the repeating alarm %u.", ae->alarm_id); - health_alarm_wait_for_execution(ae); - health_alarm_log_free_one_nochecks_nounlink(ae); } + foreach_rrdcalc_in_rrdhost_done(rc); } - foreach_rrdcalc_in_rrdhost_done(rc); - } - if (unlikely(netdata_exit)) - break; + if (unlikely(!service_running(SERVICE_HEALTH))) + break; - // execute notifications - // and cleanup - worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_PROCESS); - health_alarm_log_process(host); + // execute notifications + // and cleanup + worker_is_busy(WORKER_HEALTH_JOB_ALARM_LOG_PROCESS); + health_alarm_log_process(host); - if (unlikely(netdata_exit)) { - // wait for all notifications to finish before allowing health to be cleaned up - ALARM_ENTRY *ae; - while (NULL != (ae = alarm_notifications_in_progress.head)) { - health_alarm_wait_for_execution(ae); + if (unlikely(!service_running(SERVICE_HEALTH))) { + // wait for all notifications to finish before allowing health to be cleaned up + ALARM_ENTRY *ae; + while (NULL != (ae = alarm_notifications_in_progress.head)) { + if(unlikely(!service_running(SERVICE_HEALTH))) + break; + + health_alarm_wait_for_execution(ae); + } + break; } - break; - } + } //for each host + + rrd_unlock(); // wait for all notifications to finish before allowing health to be cleaned up ALARM_ENTRY *ae; while (NULL != (ae = alarm_notifications_in_progress.head)) { + if(unlikely(!service_running(SERVICE_HEALTH))) + break; + health_alarm_wait_for_execution(ae); } #ifdef ENABLE_ACLK - if (netdata_cloud_setting && unlikely(host->aclk_alert_reloaded) && loop > (marked_aclk_reload_loop + 2)) { - sql_queue_removed_alerts_to_aclk(host); - host->aclk_alert_reloaded = 0; + if (netdata_cloud_setting && unlikely(aclk_alert_reloaded) && loop > (marked_aclk_reload_loop + 2)) { + rrdhost_foreach_read(host) { + if(unlikely(!service_running(SERVICE_HEALTH))) + break; + + if (unlikely(!host->health.health_enabled)) + continue; + + sql_queue_removed_alerts_to_aclk(host); + } + aclk_alert_reloaded = 0; marked_aclk_reload_loop = 0; } #endif - if(unlikely(netdata_exit)) + if(unlikely(!service_running(SERVICE_HEALTH))) break; - health_sleep(next_run, loop, host); + health_sleep(next_run, loop); } // forever @@ -1554,28 +1561,13 @@ void *health_main(void *ptr) { void health_add_host_labels(void) { DICTIONARY *labels = localhost->rrdlabels; + // The source should be CONF, but when it is set, these labels are exported by default ('send configured labels' in exporting.conf). + // Their export seems to break exporting to Graphite, see https://github.com/netdata/netdata/issues/14084. + int is_ephemeral = appconfig_get_boolean(&netdata_config, CONFIG_SECTION_HEALTH, "is ephemeral", CONFIG_BOOLEAN_NO); - rrdlabels_add(labels, "_is_ephemeral", is_ephemeral ? "true" : "false", RRDLABEL_SRC_CONFIG); + rrdlabels_add(labels, "_is_ephemeral", is_ephemeral ? "true" : "false", RRDLABEL_SRC_AUTO); int has_unstable_connection = appconfig_get_boolean(&netdata_config, CONFIG_SECTION_HEALTH, "has unstable connection", CONFIG_BOOLEAN_NO); - rrdlabels_add(labels, "_has_unstable_connection", has_unstable_connection ? "true" : "false", RRDLABEL_SRC_CONFIG); + rrdlabels_add(labels, "_has_unstable_connection", has_unstable_connection ? "true" : "false", RRDLABEL_SRC_AUTO); } -void health_thread_spawn(RRDHOST * host) { - if(!host->health_spawn) { - char tag[NETDATA_THREAD_TAG_MAX + 1]; - snprintfz(tag, NETDATA_THREAD_TAG_MAX, "HEALTH[%s]", rrdhost_hostname(host)); - struct health_state *health = callocz(1, sizeof(*health)); - health->host = host; - - if(netdata_thread_create(&host->health_thread, tag, NETDATA_THREAD_OPTION_JOINABLE, health_main, (void *) health)) { - log_health("[%s]: Failed to create new thread for client.", rrdhost_hostname(host)); - error("HEALTH [%s]: Failed to create new thread for client.", rrdhost_hostname(host)); - } - else { - log_health("[%s]: Created new thread for client.", rrdhost_hostname(host)); - host->health_spawn = 1; - host->aclk_alert_reloaded = 1; - } - } -} diff --git a/health/health.d/cgroups.conf b/health/health.d/cgroups.conf index 4bfe38b65..08260ff6d 100644 --- a/health/health.d/cgroups.conf +++ b/health/health.d/cgroups.conf @@ -51,7 +51,7 @@ component: Network lookup: average -1m unaligned of received units: packets every: 10s - info: average number of packets received by the network interface $family over the last minute + info: average number of packets received by the network interface ${label:device} over the last minute template: cgroup_10s_received_packets_storm on: cgroup.net_packets @@ -66,7 +66,7 @@ component: Network warn: $this > (($status >= $WARNING)?(200):(5000)) crit: $this > (($status == $CRITICAL)?(5000):(6000)) options: no-clear-notification - info: ratio of average number of received packets for the network interface $family over the last 10 seconds, \ + info: ratio of average number of received packets for the network interface ${label:device} over the last 10 seconds, \ compared to the rate over the last minute to: sysadmin @@ -121,7 +121,7 @@ component: Network lookup: average -1m unaligned of received units: packets every: 10s - info: average number of packets received by the network interface $family over the last minute + info: average number of packets received by the network interface ${label:device} over the last minute template: k8s_cgroup_10s_received_packets_storm on: k8s.cgroup.net_packets @@ -136,6 +136,6 @@ component: Network warn: $this > (($status >= $WARNING)?(200):(5000)) crit: $this > (($status == $CRITICAL)?(5000):(6000)) options: no-clear-notification - info: ratio of average number of received packets for the network interface $family over the last 10 seconds, \ + info: ratio of average number of received packets for the network interface ${label:device} over the last 10 seconds, \ compared to the rate over the last minute to: sysadmin diff --git a/health/health.d/consul.conf b/health/health.d/consul.conf new file mode 100644 index 000000000..dff6d2df3 --- /dev/null +++ b/health/health.d/consul.conf @@ -0,0 +1,159 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: consul_license_expiration_time + on: consul.license_expiration_time + class: Errors + type: ServiceMesh +component: Consul + calc: $license_expiration + every: 60m + units: seconds + warn: $this < 14*24*60*60 + crit: $this < 7*24*60*60 + info: Consul Enterprise licence expiration time on node ${label:node_name} datacenter ${label:datacenter} + to: sysadmin + + template: consul_autopilot_health_status + on: consul.autopilot_health_status + class: Errors + type: ServiceMesh +component: Consul + calc: $unhealthy + every: 10s + units: status + warn: $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: datacenter ${label:datacenter} cluster is unhealthy as reported by server ${label:node_name} + to: sysadmin + + template: consul_autopilot_server_health_status + on: consul.autopilot_server_health_status + class: Errors + type: ServiceMesh +component: Consul + calc: $unhealthy + every: 10s + units: status + warn: $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: server ${label:node_name} from datacenter ${label:datacenter} is unhealthy + to: sysadmin + + template: consul_raft_leader_last_contact_time + on: consul.raft_leader_last_contact_time + class: Errors + type: ServiceMesh +component: Consul + lookup: average -1m unaligned of quantile_0.5 + every: 10s + units: milliseconds + warn: $this > (($status >= $WARNING) ? (150) : (200)) + crit: $this > (($status == $CRITICAL) ? (200) : (500)) + delay: down 5m multiplier 1.5 max 1h + info: median time elapsed since leader server ${label:node_name} datacenter ${label:datacenter} was last able to contact the follower nodes + to: sysadmin + + template: consul_raft_leadership_transitions + on: consul.raft_leadership_transitions_rate + class: Errors + type: ServiceMesh +component: Consul + lookup: sum -1m unaligned + every: 10s + units: transitions + warn: $this > 0 + delay: down 5m multiplier 1.5 max 1h + info: there has been a leadership change and server ${label:node_name} datacenter ${label:datacenter} has become the leader + to: sysadmin + + template: consul_raft_thread_main_saturation + on: consul.raft_thread_main_saturation_perc + class: Utilization + type: ServiceMesh +component: Consul + lookup: average -1m unaligned of quantile_0.9 + every: 10s + units: percentage + warn: $this > (($status >= $WARNING) ? (40) : (50)) + delay: down 5m multiplier 1.5 max 1h + info: average saturation of the main Raft goroutine on server ${label:node_name} datacenter ${label:datacenter} + to: sysadmin + + template: consul_raft_thread_fsm_saturation + on: consul.raft_thread_fsm_saturation_perc + class: Utilization + type: ServiceMesh +component: Consul + lookup: average -1m unaligned of quantile_0.9 + every: 10s + units: milliseconds + warn: $this > (($status >= $WARNING) ? (40) : (50)) + delay: down 5m multiplier 1.5 max 1h + info: average saturation of the FSM Raft goroutine on server ${label:node_name} datacenter ${label:datacenter} + to: sysadmin + + template: consul_client_rpc_requests_exceeded + on: consul.client_rpc_requests_exceeded_rate + class: Errors + type: ServiceMesh +component: Consul + lookup: sum -1m unaligned + every: 10s + units: requests + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: down 5m multiplier 1.5 max 1h + info: number of rate-limited RPC requests made by server ${label:node_name} datacenter ${label:datacenter} + to: sysadmin + + template: consul_client_rpc_requests_failed + on: consul.client_rpc_requests_failed_rate + class: Errors + type: ServiceMesh +component: Consul + lookup: sum -1m unaligned + every: 10s + units: requests + warn: $this > (($status >= $WARNING) ? (0) : (5)) + delay: down 5m multiplier 1.5 max 1h + info: number of failed RPC requests made by server ${label:node_name} datacenter ${label:datacenter} + to: sysadmin + + template: consul_node_health_check_status + on: consul.node_health_check_status + class: Errors + type: ServiceMesh +component: Consul + calc: $warning + $critical + every: 10s + units: status + warn: $this != nan AND $this != 0 + delay: down 5m multiplier 1.5 max 1h + info: node health check ${label:check_name} has failed on server ${label:node_name} datacenter ${label:datacenter} + to: sysadmin + + template: consul_service_health_check_status + on: consul.service_health_check_status + class: Errors + type: ServiceMesh +component: Consul + calc: $warning + $critical + every: 10s + units: status + warn: $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: service health check ${label:check_name} for service ${label:service_name} has failed on server ${label:node_name} datacenter ${label:datacenter} + to: sysadmin + + template: consul_gc_pause_time + on: consul.gc_pause_time + class: Errors + type: ServiceMesh +component: Consul + lookup: sum -1m unaligned + every: 10s + units: seconds + warn: $this > (($status >= $WARNING) ? (1) : (2)) + crit: $this > (($status >= $WARNING) ? (2) : (5)) + delay: down 5m multiplier 1.5 max 1h + info: time spent in stop-the-world garbage collection pauses on server ${label:node_name} datacenter ${label:datacenter} + to: sysadmin diff --git a/health/health.d/disks.conf b/health/health.d/disks.conf index 5daff61a1..fd207fbc1 100644 --- a/health/health.d/disks.conf +++ b/health/health.d/disks.conf @@ -23,7 +23,7 @@ component: Disk warn: $this > (($status >= $WARNING ) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: up 1m down 15m multiplier 1.5 max 1h - info: disk $family space utilization + info: disk ${label:mount_point} space utilization to: sysadmin template: disk_inode_usage @@ -40,7 +40,7 @@ component: Disk warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: up 1m down 15m multiplier 1.5 max 1h - info: disk $family inode utilization + info: disk ${label:mount_point} inode utilization to: sysadmin @@ -147,7 +147,7 @@ component: Disk every: 1m warn: $this > 98 * (($status >= $WARNING) ? (0.7) : (1)) delay: down 15m multiplier 1.2 max 1h - info: average percentage of time $family disk was busy over the last 10 minutes + info: average percentage of time ${label:device} disk was busy over the last 10 minutes to: silent @@ -169,5 +169,5 @@ component: Disk every: 1m warn: $this > 5000 * (($status >= $WARNING) ? (0.7) : (1)) delay: down 15m multiplier 1.2 max 1h - info: average backlog size of the $family disk over the last 10 minutes + info: average backlog size of the ${label:device} disk over the last 10 minutes to: silent diff --git a/health/health.d/dns_query.conf b/health/health.d/dns_query.conf index b9d6c2374..bf9397d85 100644 --- a/health/health.d/dns_query.conf +++ b/health/health.d/dns_query.conf @@ -10,5 +10,5 @@ component: DNS every: 10s warn: $this != nan && $this != 1 delay: up 30s down 5m multiplier 1.5 max 1h - info: DNS request type $label:record_type to server $label:server is unsuccessful + info: DNS request type ${label:record_type} to server ${label:server} is unsuccessful to: sysadmin diff --git a/health/health.d/elasticsearch.conf b/health/health.d/elasticsearch.conf new file mode 100644 index 000000000..47f8e1eb9 --- /dev/null +++ b/health/health.d/elasticsearch.conf @@ -0,0 +1,73 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + +# 'red' is a threshold, can't lookup the 'red' dimension - using simple pattern is a workaround. + + template: elasticsearch_cluster_health_status_red + on: elasticsearch.cluster_health_status + class: Errors + type: SearchEngine +component: Elasticsearch + lookup: average -5s unaligned of *ed + every: 10s + units: status + warn: $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: cluster health status is red. + to: sysadmin + +# the idea of '-10m' is to handle yellow status after node restart, +# (usually) no action is required because Elasticsearch will automatically restore the green status. + template: elasticsearch_cluster_health_status_yellow + on: elasticsearch.cluster_health_status + class: Errors + type: SearchEngine +component: Elasticsearch + lookup: average -10m unaligned of yellow + every: 1m + units: status + warn: $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: cluster health status is yellow. + to: sysadmin + + template: elasticsearch_node_index_health_red + on: elasticsearch.node_index_health + class: Errors + type: SearchEngine +component: Elasticsearch + lookup: average -5s unaligned of *ed + every: 10s + units: status + warn: $this == 1 + delay: down 5m multiplier 1.5 max 1h + info: node index $label:index health status is red. + to: sysadmin + +# don't convert 'lookup' value to seconds in 'calc' due to UI showing seconds as hh:mm:ss (0 as now). + + template: elasticsearch_node_indices_search_time_query + on: elasticsearch.node_indices_search_time + class: Workload + type: SearchEngine +component: Elasticsearch + lookup: average -10m unaligned of query + every: 10s + units: milliseconds + warn: $this > (($status >= $WARNING) ? (20 * 1000) : (30 * 1000)) + delay: down 5m multiplier 1.5 max 1h + info: search performance is degraded, queries run slowly. + to: sysadmin + + template: elasticsearch_node_indices_search_time_fetch + on: elasticsearch.node_indices_search_time + class: Workload + type: SearchEngine +component: Elasticsearch + lookup: average -10m unaligned of fetch + every: 10s + units: milliseconds + warn: $this > (($status >= $WARNING) ? (3 * 1000) : (5 * 1000)) + crit: $this > (($status == $CRITICAL) ? (5 * 1000) : (30 * 1000)) + delay: down 5m multiplier 1.5 max 1h + info: search performance is degraded, fetches run slowly. + to: sysadmin diff --git a/health/health.d/fping.conf b/health/health.d/fping.conf deleted file mode 100644 index bb22419fa..000000000 --- a/health/health.d/fping.conf +++ /dev/null @@ -1,64 +0,0 @@ - - template: fping_last_collected_secs - families: * - on: fping.latency - class: Latency - type: Other -component: Network - calc: $now - $last_collected_t - units: seconds ago - every: 10s - warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) - crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) - delay: down 5m multiplier 1.5 max 1h - info: number of seconds since the last successful data collection - to: sysadmin - - template: fping_host_reachable - families: * - on: fping.latency - class: Errors - type: Other -component: Network - calc: $average != nan - units: up/down - every: 10s - crit: $this == 0 - delay: down 30m multiplier 1.5 max 2h - info: reachability status of the network host (0: unreachable, 1: reachable) - to: sysadmin - - template: fping_host_latency - families: * - on: fping.latency - class: Latency - type: Other -component: Network - lookup: average -10s unaligned of average - units: ms - every: 10s - green: 500 - red: 1000 - warn: $this > $green OR $max > $red - crit: $this > $red - delay: down 30m multiplier 1.5 max 2h - info: average latency to the network host over the last 10 seconds - to: sysadmin - - template: fping_packet_loss - families: * - on: fping.quality - class: Errors - type: System -component: Network - lookup: average -10m unaligned of returned - calc: 100 - $this - green: 1 - red: 10 - units: % - every: 10s - warn: $this > $green - crit: $this > $red - delay: down 30m multiplier 1.5 max 2h - info: packet loss ratio to the network host over the last 10 minutes - to: sysadmin diff --git a/health/health.d/httpcheck.conf b/health/health.d/httpcheck.conf index 599c47acc..2008b000d 100644 --- a/health/health.d/httpcheck.conf +++ b/health/health.d/httpcheck.conf @@ -10,7 +10,7 @@ component: HTTP endpoint calc: ($this < 75) ? (0) : ($this) every: 5s units: up/down - info: average ratio of successful HTTP requests over the last minute (at least 75%) + info: HTTP endpoint ${label:url} liveness status to: silent template: httpcheck_web_service_bad_content @@ -25,8 +25,7 @@ component: HTTP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: average ratio of HTTP responses with unexpected content over the last 5 minutes - options: no-clear-notification + info: percentage of HTTP responses from ${label:url} with unexpected content in the last 5 minutes to: webmaster template: httpcheck_web_service_bad_status @@ -41,8 +40,7 @@ component: HTTP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: average ratio of HTTP responses with unexpected status over the last 5 minutes - options: no-clear-notification + info: percentage of HTTP responses from ${label:url} with unexpected status in the last 5 minutes to: webmaster template: httpcheck_web_service_timeouts @@ -54,9 +52,13 @@ component: HTTP endpoint lookup: average -5m unaligned percentage of timeout every: 10s units: % - info: average ratio of HTTP request timeouts over the last 5 minutes + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 + delay: down 5m multiplier 1.5 max 1h + info: percentage of timed-out HTTP requests to ${label:url} in the last 5 minutes + to: webmaster - template: httpcheck_no_web_service_connections + template: httpcheck_web_service_no_connection families: * on: httpcheck.status class: Errors @@ -65,48 +67,8 @@ component: HTTP endpoint lookup: average -5m unaligned percentage of no_connection every: 10s units: % - info: average ratio of failed requests during the last 5 minutes - -# combined timeout & no connection alarm - template: httpcheck_web_service_unreachable - families: * - on: httpcheck.status - class: Errors - type: Web Server -component: HTTP endpoint - calc: ($httpcheck_no_web_service_connections >= $httpcheck_web_service_timeouts) ? ($httpcheck_no_web_service_connections) : ($httpcheck_web_service_timeouts) - units: % - every: 10s - warn: ($httpcheck_no_web_service_connections >= 10 OR $httpcheck_web_service_timeouts >= 10) AND ($httpcheck_no_web_service_connections < 40 OR $httpcheck_web_service_timeouts < 40) - crit: $httpcheck_no_web_service_connections >= 40 OR $httpcheck_web_service_timeouts >= 40 - delay: down 5m multiplier 1.5 max 1h - info: ratio of failed requests either due to timeouts or no connection over the last 5 minutes - options: no-clear-notification - to: webmaster - - template: httpcheck_1h_web_service_response_time - families: * - on: httpcheck.responsetime - class: Latency - type: Other -component: HTTP endpoint - lookup: average -1h unaligned of time - every: 30s - units: ms - info: average HTTP response time over the last hour - - template: httpcheck_web_service_slow - families: * - on: httpcheck.responsetime - class: Latency - type: Web Server -component: HTTP endpoint - lookup: average -3m unaligned of time - units: ms - every: 10s - warn: ($this > ($httpcheck_1h_web_service_response_time * 2) ) - crit: ($this > ($httpcheck_1h_web_service_response_time * 3) ) + warn: $this >= 10 AND $this < 40 + crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: average HTTP response time over the last 3 minutes, compared to the average over the last hour - options: no-clear-notification + info: percentage of failed HTTP requests to ${label:url} in the last 5 minutes to: webmaster diff --git a/health/health.d/kubelet.conf b/health/health.d/kubelet.conf index c2778cc5e..428b6ee91 100644 --- a/health/health.d/kubelet.conf +++ b/health/health.d/kubelet.conf @@ -9,7 +9,7 @@ class: Errors type: Kubernetes component: Kubelet - calc: $kubelet_node_config_error + calc: $experiencing_error units: bool every: 10s warn: $this == 1 @@ -20,12 +20,12 @@ component: Kubelet # Failed Token() requests to the alternate token source template: kubelet_token_requests - lookup: sum -10s of token_fail_count on: k8s_kubelet.kubelet_token_requests class: Errors type: Kubernetes component: Kubelet - units: failed requests + lookup: sum -10s of failed + units: requests every: 10s warn: $this > 0 delay: down 1m multiplier 1.5 max 2h @@ -35,11 +35,11 @@ component: Kubelet # Docker and runtime operation errors template: kubelet_operations_error - lookup: sum -1m on: k8s_kubelet.kubelet_operations_errors class: Errors type: Kubernetes component: Kubelet + lookup: sum -1m units: errors every: 10s warn: $this > (($status >= $WARNING) ? (0) : (20)) @@ -67,7 +67,7 @@ component: Kubelet class: Latency type: Kubernetes component: Kubelet - lookup: average -1m unaligned of kubelet_pleg_relist_latency_05 + lookup: average -1m unaligned of 0.5 units: microseconds every: 10s info: average Pod Lifecycle Event Generator relisting latency over the last minute (quantile 0.5) @@ -77,7 +77,7 @@ component: Kubelet class: Latency type: Kubernetes component: Kubelet - lookup: average -10s unaligned of kubelet_pleg_relist_latency_05 + lookup: average -10s unaligned of 0.5 calc: $this * 100 / (($kubelet_1m_pleg_relist_latency_quantile_05 < 1000)?(1000):($kubelet_1m_pleg_relist_latency_quantile_05)) every: 10s units: % @@ -95,7 +95,7 @@ component: Kubelet class: Latency type: Kubernetes component: Kubelet - lookup: average -1m unaligned of kubelet_pleg_relist_latency_09 + lookup: average -1m unaligned of 0.9 units: microseconds every: 10s info: average Pod Lifecycle Event Generator relisting latency over the last minute (quantile 0.9) @@ -105,7 +105,7 @@ component: Kubelet class: Latency type: Kubernetes component: Kubelet - lookup: average -10s unaligned of kubelet_pleg_relist_latency_09 + lookup: average -10s unaligned of 0.9 calc: $this * 100 / (($kubelet_1m_pleg_relist_latency_quantile_09 < 1000)?(1000):($kubelet_1m_pleg_relist_latency_quantile_09)) every: 10s units: % @@ -123,7 +123,7 @@ component: Kubelet class: Latency type: Kubernetes component: Kubelet - lookup: average -1m unaligned of kubelet_pleg_relist_latency_099 + lookup: average -1m unaligned of 0.99 units: microseconds every: 10s info: average Pod Lifecycle Event Generator relisting latency over the last minute (quantile 0.99) @@ -133,7 +133,7 @@ component: Kubelet class: Latency type: Kubernetes component: Kubelet - lookup: average -10s unaligned of kubelet_pleg_relist_latency_099 + lookup: average -10s unaligned of 0.99 calc: $this * 100 / (($kubelet_1m_pleg_relist_latency_quantile_099 < 1000)?(1000):($kubelet_1m_pleg_relist_latency_quantile_099)) every: 10s units: % diff --git a/health/health.d/load.conf b/health/health.d/load.conf index 0bd872f85..75989c57f 100644 --- a/health/health.d/load.conf +++ b/health/health.d/load.conf @@ -11,7 +11,7 @@ component: Load os: linux hosts: * - calc: ($active_processors == nan or $active_processors == inf or $active_processors < 2) ? ( 2 ) : ( $active_processors ) + calc: ($active_processors == nan or $active_processors == 0) ? (nan) : ( ($active_processors < 2) ? ( 2 ) : ( $active_processors ) ) units: cpus every: 1m info: number of active CPU cores in the system @@ -28,6 +28,7 @@ component: Load os: linux hosts: * lookup: max -1m unaligned of load15 + calc: ($load_cpu_number == nan) ? (nan) : ($this) units: load every: 1m warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 175 : 200) @@ -43,6 +44,7 @@ component: Load os: linux hosts: * lookup: max -1m unaligned of load5 + calc: ($load_cpu_number == nan) ? (nan) : ($this) units: load every: 1m warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 350 : 400) @@ -58,6 +60,7 @@ component: Load os: linux hosts: * lookup: max -1m unaligned of load1 + calc: ($load_cpu_number == nan) ? (nan) : ($this) units: load every: 1m warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 700 : 800) diff --git a/health/health.d/mdstat.conf b/health/health.d/mdstat.conf index cedaa000e..ed980a26a 100644 --- a/health/health.d/mdstat.conf +++ b/health/health.d/mdstat.conf @@ -20,7 +20,7 @@ component: RAID every: 10s calc: $down crit: $this > 0 - info: number of devices in the down state for the $family array. \ + info: number of devices in the down state for the ${label:device} ${label:raid_level} array. \ Any number > 0 indicates that the array is degraded. to: sysadmin @@ -35,7 +35,7 @@ component: RAID every: 60s warn: $this > 1024 delay: up 30m - info: number of unsynchronized blocks for the $family array + info: number of unsynchronized blocks for the ${label:device} ${label:raid_level} array to: sysadmin template: mdstat_nonredundant_last_collected diff --git a/health/health.d/net.conf b/health/health.d/net.conf index 9d5b3b8d3..a0723f303 100644 --- a/health/health.d/net.conf +++ b/health/health.d/net.conf @@ -15,7 +15,7 @@ component: Network calc: ( $nic_speed_max > 0 ) ? ( $nic_speed_max) : ( nan ) units: Mbit every: 10s - info: network interface $family current speed + info: network interface ${label:device} current speed template: 1m_received_traffic_overflow on: net.net @@ -31,7 +31,7 @@ component: Network every: 10s warn: $this > (($status >= $WARNING) ? (85) : (90)) delay: up 1m down 1m multiplier 1.5 max 1h - info: average inbound utilization for the network interface $family over the last minute + info: average inbound utilization for the network interface ${label:device} over the last minute to: sysadmin template: 1m_sent_traffic_overflow @@ -48,7 +48,7 @@ component: Network every: 10s warn: $this > (($status >= $WARNING) ? (85) : (90)) delay: up 1m down 1m multiplier 1.5 max 1h - info: average outbound utilization for the network interface $family over the last minute + info: average outbound utilization for the network interface ${label:device} over the last minute to: sysadmin # ----------------------------------------------------------------------------- @@ -72,7 +72,7 @@ component: Network lookup: sum -10m unaligned absolute of inbound units: packets every: 1m - info: number of inbound dropped packets for the network interface $family in the last 10 minutes + info: number of inbound dropped packets for the network interface ${label:device} in the last 10 minutes template: outbound_packets_dropped on: net.drops @@ -85,7 +85,7 @@ component: Network lookup: sum -10m unaligned absolute of outbound units: packets every: 1m - info: number of outbound dropped packets for the network interface $family in the last 10 minutes + info: number of outbound dropped packets for the network interface ${label:device} in the last 10 minutes template: inbound_packets_dropped_ratio on: net.packets @@ -101,7 +101,7 @@ component: Network every: 1m warn: $this >= 2 delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of inbound dropped packets for the network interface $family over the last 10 minutes + info: ratio of inbound dropped packets for the network interface ${label:device} over the last 10 minutes to: sysadmin template: outbound_packets_dropped_ratio @@ -118,7 +118,7 @@ component: Network every: 1m warn: $this >= 2 delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of outbound dropped packets for the network interface $family over the last 10 minutes + info: ratio of outbound dropped packets for the network interface ${label:device} over the last 10 minutes to: sysadmin template: wifi_inbound_packets_dropped_ratio @@ -135,7 +135,7 @@ component: Network every: 1m warn: $this >= 10 delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of inbound dropped packets for the network interface $family over the last 10 minutes + info: ratio of inbound dropped packets for the network interface ${label:device} over the last 10 minutes to: sysadmin template: wifi_outbound_packets_dropped_ratio @@ -152,7 +152,7 @@ component: Network every: 1m warn: $this >= 10 delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of outbound dropped packets for the network interface $family over the last 10 minutes + info: ratio of outbound dropped packets for the network interface ${label:device} over the last 10 minutes to: sysadmin # ----------------------------------------------------------------------------- @@ -171,7 +171,7 @@ component: Network every: 1m warn: $this >= 5 delay: down 1h multiplier 1.5 max 2h - info: number of inbound errors for the network interface $family in the last 10 minutes + info: number of inbound errors for the network interface ${label:device} in the last 10 minutes to: sysadmin template: interface_outbound_errors @@ -187,7 +187,7 @@ component: Network every: 1m warn: $this >= 5 delay: down 1h multiplier 1.5 max 2h - info: number of outbound errors for the network interface $family in the last 10 minutes + info: number of outbound errors for the network interface ${label:device} in the last 10 minutes to: sysadmin # ----------------------------------------------------------------------------- @@ -211,7 +211,7 @@ component: Network every: 1m warn: $this > 0 delay: down 1h multiplier 1.5 max 2h - info: number of FIFO errors for the network interface $family in the last 10 minutes + info: number of FIFO errors for the network interface ${label:device} in the last 10 minutes to: sysadmin # ----------------------------------------------------------------------------- @@ -234,7 +234,7 @@ component: Network lookup: average -1m unaligned of received units: packets every: 10s - info: average number of packets received by the network interface $family over the last minute + info: average number of packets received by the network interface ${label:device} over the last minute template: 10s_received_packets_storm on: net.packets @@ -251,6 +251,6 @@ component: Network warn: $this > (($status >= $WARNING)?(200):(5000)) crit: $this > (($status == $CRITICAL)?(5000):(6000)) options: no-clear-notification - info: ratio of average number of received packets for the network interface $family over the last 10 seconds, \ + info: ratio of average number of received packets for the network interface ${label:device} over the last 10 seconds, \ compared to the rate over the last minute to: sysadmin diff --git a/health/health.d/nvme.conf b/health/health.d/nvme.conf index 5f729d52b..b7c0e6fd4 100644 --- a/health/health.d/nvme.conf +++ b/health/health.d/nvme.conf @@ -11,5 +11,5 @@ component: Disk every: 10s crit: $this != nan AND $this != 0 delay: down 5m multiplier 1.5 max 2h - info: NVMe device $label:device has critical warnings + info: NVMe device ${label:device} has critical warnings to: sysadmin diff --git a/health/health.d/ping.conf b/health/health.d/ping.conf index cbe7c30c9..fa8213ad3 100644 --- a/health/health.d/ping.conf +++ b/health/health.d/ping.conf @@ -12,7 +12,7 @@ component: Network every: 10s crit: $this == 0 delay: down 30m multiplier 1.5 max 2h - info: network host $label:host reachability status + info: network host ${label:host} reachability status to: sysadmin template: ping_packet_loss @@ -29,7 +29,7 @@ component: Network warn: $this > $green crit: $this > $red delay: down 30m multiplier 1.5 max 2h - info: packet loss percentage to the network host $label:host over the last 10 minutes + info: packet loss percentage to the network host ${label:host} over the last 10 minutes to: sysadmin template: ping_host_latency @@ -46,5 +46,5 @@ component: Network warn: $this > $green OR $max > $red crit: $this > $red delay: down 30m multiplier 1.5 max 2h - info: average latency to the network host $label:host over the last 10 seconds + info: average latency to the network host ${label:host} over the last 10 seconds to: sysadmin diff --git a/health/health.d/portcheck.conf b/health/health.d/portcheck.conf index 8cbd7729c..e8908404c 100644 --- a/health/health.d/portcheck.conf +++ b/health/health.d/portcheck.conf @@ -10,7 +10,7 @@ component: TCP endpoint calc: ($this < 75) ? (0) : ($this) every: 5s units: up/down - info: average ratio of successful connections over the last minute (at least 75%) + info: TCP host ${label:host} port ${label:port} liveness status to: silent template: portcheck_connection_timeouts @@ -25,7 +25,7 @@ component: TCP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: average ratio of timeouts over the last 5 minutes + info: percentage of timed-out TCP connections to host ${label:host} port ${label:port} in the last 5 minutes to: sysadmin template: portcheck_connection_fails @@ -40,5 +40,5 @@ component: TCP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: average ratio of failed connections over the last 5 minutes + info: percentage of failed TCP connections to host ${label:host} port ${label:port} in the last 5 minutes to: sysadmin diff --git a/health/health.d/postgres.conf b/health/health.d/postgres.conf index 66d034cfe..67b25673b 100644 --- a/health/health.d/postgres.conf +++ b/health/health.d/postgres.conf @@ -58,7 +58,7 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average cache hit ratio in db $label:database over the last minute + info: average cache hit ratio in db ${label:database} over the last minute to: dba template: postgres_db_transactions_rollback_ratio @@ -72,7 +72,7 @@ component: PostgreSQL every: 1m warn: $this > (($status >= $WARNING) ? (0) : (2)) delay: down 15m multiplier 1.5 max 1h - info: average aborted transactions percentage in db $label:database over the last five minutes + info: average aborted transactions percentage in db ${label:database} over the last five minutes to: dba template: postgres_db_deadlocks_rate @@ -86,7 +86,7 @@ component: PostgreSQL every: 1m warn: $this > (($status >= $WARNING) ? (0) : (10)) delay: down 15m multiplier 1.5 max 1h - info: number of deadlocks detected in db $label:database in the last minute + info: number of deadlocks detected in db ${label:database} in the last minute to: dba # Table alarms @@ -104,7 +104,7 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average cache hit ratio in db $label:database table $label:table over the last minute + info: average cache hit ratio in db ${label:database} table ${label:table} over the last minute to: dba template: postgres_table_index_cache_io_ratio @@ -120,7 +120,7 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average index cache hit ratio in db $label:database table $label:table over the last minute + info: average index cache hit ratio in db ${label:database} table ${label:table} over the last minute to: dba template: postgres_table_toast_cache_io_ratio @@ -136,7 +136,7 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average TOAST hit ratio in db $label:database table $label:table over the last minute + info: average TOAST hit ratio in db ${label:database} table ${label:table} over the last minute to: dba template: postgres_table_toast_index_cache_io_ratio @@ -152,7 +152,7 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average index TOAST hit ratio in db $label:database table $label:table over the last minute + info: average index TOAST hit ratio in db ${label:database} table ${label:table} over the last minute to: dba template: postgres_table_bloat_size_perc @@ -161,13 +161,13 @@ component: PostgreSQL type: Database component: PostgreSQL hosts: * - calc: $bloat + calc: ($table_size > (1024 * 1024 * 100)) ? ($bloat) : (0) units: % every: 1m warn: $this > (($status >= $WARNING) ? (60) : (70)) crit: $this > (($status == $CRITICAL) ? (70) : (80)) delay: down 15m multiplier 1.5 max 1h - info: bloat size percentage in db $label:database table $label:table + info: bloat size percentage in db ${label:database} table ${label:table} to: dba template: postgres_table_last_autovacuum_time @@ -180,7 +180,7 @@ component: PostgreSQL units: seconds every: 1m warn: $this != nan AND $this > (60 * 60 * 24 * 7) - info: time elapsed since db $label:database table $label:table was vacuumed by the autovacuum daemon + info: time elapsed since db ${label:database} table ${label:table} was vacuumed by the autovacuum daemon to: dba template: postgres_table_last_autoanalyze_time @@ -193,7 +193,7 @@ component: PostgreSQL units: seconds every: 1m warn: $this != nan AND $this > (60 * 60 * 24 * 7) - info: time elapsed since db $label:database table $label:table was analyzed by the autovacuum daemon + info: time elapsed since db ${label:database} table ${label:table} was analyzed by the autovacuum daemon to: dba # Index alarms @@ -204,11 +204,11 @@ component: PostgreSQL type: Database component: PostgreSQL hosts: * - calc: $bloat + calc: ($index_size > (1024 * 1024 * 10)) ? ($bloat) : (0) units: % every: 1m warn: $this > (($status >= $WARNING) ? (60) : (70)) crit: $this > (($status == $CRITICAL) ? (70) : (80)) delay: down 15m multiplier 1.5 max 1h - info: bloat size percentage in db $label:database table $label:table index $label:index + info: bloat size percentage in db ${label:database} table ${label:table} index ${label:index} to: dba diff --git a/health/health.d/zfs.conf b/health/health.d/zfs.conf index 785838d47..7f8ea2793 100644 --- a/health/health.d/zfs.conf +++ b/health/health.d/zfs.conf @@ -24,7 +24,7 @@ component: File system every: 10s warn: $this > 0 delay: down 1m multiplier 1.5 max 1h - info: ZFS pool $family state is degraded + info: ZFS pool ${label:pool} state is degraded to: sysadmin template: zfs_pool_state_crit @@ -37,5 +37,5 @@ component: File system every: 10s crit: $this > 0 delay: down 1m multiplier 1.5 max 1h - info: ZFS pool $family state is faulted or unavail + info: ZFS pool ${label:pool} state is faulted or unavail to: sysadmin diff --git a/health/health.h b/health/health.h index 15d8326ee..50c3e3452 100644 --- a/health/health.h +++ b/health/health.h @@ -31,6 +31,7 @@ extern unsigned int default_health_enabled; #define HEALTH_SILENCERS_MAX_FILE_LEN 10000 extern char *silencers_filename; +extern SIMPLE_PATTERN *conf_enabled_alarms; void health_init(void); @@ -48,9 +49,6 @@ int health_alarm_log_open(RRDHOST *host); void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae); void health_alarm_log_load(RRDHOST *host); -void health_thread_spawn(RRDHOST *host); -void health_thread_stop(RRDHOST *host); - ALARM_ENTRY* health_create_alarm_entry( RRDHOST *host, uint32_t alarm_id, @@ -79,11 +77,6 @@ ALARM_ENTRY* health_create_alarm_entry( void health_alarm_log_add_entry(RRDHOST *host, ALARM_ENTRY *ae); -struct health_state { - RRDHOST *host; - netdata_thread_t thread; -}; - void health_readdir(RRDHOST *host, const char *user_path, const char *stock_path, const char *subpath); char *health_user_config_dir(void); char *health_stock_config_dir(void); diff --git a/health/health_config.c b/health/health_config.c index f9decfad5..55d5e10eb 100644 --- a/health/health_config.c +++ b/health/health_config.c @@ -553,33 +553,37 @@ static int health_readfile(const char *filename, void *data) { rt = NULL; } - rc = callocz(1, sizeof(RRDCALC)); - rc->next_event_id = 1; - - { - char *tmp = strdupz(value); - if(rrdvar_fix_name(tmp)) - error("Health configuration renamed alarm '%s' to '%s'", value, tmp); - - rc->name = string_strdupz(tmp); - freez(tmp); - } - - rc->source = health_source_file(line, filename); - rc->green = NAN; - rc->red = NAN; - rc->value = NAN; - rc->old_value = NAN; - rc->delay_multiplier = 1.0; - rc->old_status = RRDCALC_STATUS_UNINITIALIZED; - rc->warn_repeat_every = host->health_default_warn_repeat_every; - rc->crit_repeat_every = host->health_default_crit_repeat_every; - if (alert_cfg) - alert_config_free(alert_cfg); - alert_cfg = callocz(1, sizeof(struct alert_config)); - - alert_cfg->alarm = string_dup(rc->name); - ignore_this = 0; + if (simple_pattern_matches(conf_enabled_alarms, value)) { + rc = callocz(1, sizeof(RRDCALC)); + rc->next_event_id = 1; + + { + char *tmp = strdupz(value); + if(rrdvar_fix_name(tmp)) + error("Health configuration renamed alarm '%s' to '%s'", value, tmp); + + rc->name = string_strdupz(tmp); + freez(tmp); + } + + rc->source = health_source_file(line, filename); + rc->green = NAN; + rc->red = NAN; + rc->value = NAN; + rc->old_value = NAN; + rc->delay_multiplier = 1.0; + rc->old_status = RRDCALC_STATUS_UNINITIALIZED; + rc->warn_repeat_every = host->health.health_default_warn_repeat_every; + rc->crit_repeat_every = host->health.health_default_crit_repeat_every; + if (alert_cfg) + alert_config_free(alert_cfg); + alert_cfg = callocz(1, sizeof(struct alert_config)); + + alert_cfg->alarm = string_dup(rc->name); + ignore_this = 0; + } else { + rc = NULL; + } } else if(hash == hash_template && !strcasecmp(key, HEALTH_TEMPLATE_KEY)) { if(rc) { @@ -599,29 +603,33 @@ static int health_readfile(const char *filename, void *data) { rrdcalctemplate_add_from_config(host, rt); } - rt = callocz(1, sizeof(RRDCALCTEMPLATE)); + if (simple_pattern_matches(conf_enabled_alarms, value)) { + rt = callocz(1, sizeof(RRDCALCTEMPLATE)); - { - char *tmp = strdupz(value); - if(rrdvar_fix_name(tmp)) - error("Health configuration renamed template '%s' to '%s'", value, tmp); - - rt->name = string_strdupz(tmp); - freez(tmp); - } + { + char *tmp = strdupz(value); + if(rrdvar_fix_name(tmp)) + error("Health configuration renamed template '%s' to '%s'", value, tmp); - rt->source = health_source_file(line, filename); - rt->green = NAN; - rt->red = NAN; - rt->delay_multiplier = (float)1.0; - rt->warn_repeat_every = host->health_default_warn_repeat_every; - rt->crit_repeat_every = host->health_default_crit_repeat_every; - if (alert_cfg) - alert_config_free(alert_cfg); - alert_cfg = callocz(1, sizeof(struct alert_config)); + rt->name = string_strdupz(tmp); + freez(tmp); + } - alert_cfg->template_key = string_dup(rt->name); - ignore_this = 0; + rt->source = health_source_file(line, filename); + rt->green = NAN; + rt->red = NAN; + rt->delay_multiplier = (float)1.0; + rt->warn_repeat_every = host->health.health_default_warn_repeat_every; + rt->crit_repeat_every = host->health.health_default_crit_repeat_every; + if (alert_cfg) + alert_config_free(alert_cfg); + alert_cfg = callocz(1, sizeof(struct alert_config)); + + alert_cfg->template_key = string_dup(rt->name); + ignore_this = 0; + } else { + rt = NULL; + } } else if(hash == hash_os && !strcasecmp(key, HEALTH_OS_KEY)) { char *os_match = value; @@ -1163,7 +1171,8 @@ void sql_refresh_hashes(void) } void health_readdir(RRDHOST *host, const char *user_path, const char *stock_path, const char *subpath) { - if(unlikely(!host->health_enabled) && !rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH)) { + if(unlikely((!host->health.health_enabled) && !rrdhost_flag_check(host, RRDHOST_FLAG_INITIALIZED_HEALTH)) || + !service_running(SERVICE_HEALTH)) { debug(D_HEALTH, "CONFIG health is not enabled for host '%s'", rrdhost_hostname(host)); return; } diff --git a/health/health_json.c b/health/health_json.c index 2dd59fd46..8cabaa0bf 100644 --- a/health/health_json.c +++ b/health/health_json.c @@ -75,8 +75,8 @@ void health_alarm_entry2json_nolock(BUFFER *wb, ALARM_ENTRY *ae, RRDHOST *host) , (ae->flags & HEALTH_ENTRY_FLAG_UPDATED)?"true":"false" , (unsigned long)ae->exec_run_timestamp , (ae->flags & HEALTH_ENTRY_FLAG_EXEC_FAILED)?"true":"false" - , ae->exec?ae_exec(ae):string2str(host->health_default_exec) - , ae->recipient?ae_recipient(ae):string2str(host->health_default_recipient) + , ae->exec?ae_exec(ae):string2str(host->health.health_default_exec) + , ae->recipient?ae_recipient(ae):string2str(host->health.health_default_recipient) , ae->exec_code , ae_source(ae) , edit_command @@ -219,8 +219,8 @@ static inline void health_rrdcalc2json_nolock(RRDHOST *host, BUFFER *wb, RRDCALC , (rc->rrdset)?"true":"false" , (rc->run_flags & RRDCALC_FLAG_DISABLED)?"true":"false" , (rc->run_flags & RRDCALC_FLAG_SILENCED)?"true":"false" - , rc->exec?rrdcalc_exec(rc):string2str(host->health_default_exec) - , rc->recipient?rrdcalc_recipient(rc):string2str(host->health_default_recipient) + , rc->exec?rrdcalc_exec(rc):string2str(host->health.health_default_exec) + , rc->recipient?rrdcalc_recipient(rc):string2str(host->health.health_default_recipient) , rrdcalc_source(rc) , rrdcalc_units(rc) , rrdcalc_info(rc) @@ -372,7 +372,7 @@ void health_alarms2json(RRDHOST *host, BUFFER *wb, int all) { "\n\t\"alarms\": {\n", rrdhost_hostname(host), (host->health_log.next_log_id > 0)?(host->health_log.next_log_id - 1):0, - host->health_enabled?"true":"false", + host->health.health_enabled?"true":"false", (unsigned long)now_realtime_sec()); health_alarms2json_fill_alarms(host, wb, all, health_rrdcalc2json_nolock); diff --git a/health/health_log.c b/health/health_log.c index 8105e01ae..d3417493b 100644 --- a/health/health_log.c +++ b/health/health_log.c @@ -3,149 +3,10 @@ #include "health.h" // ---------------------------------------------------------------------------- -// health alarm log load/save -// no need for locking - only one thread is reading / writing the alarms log - -inline int health_alarm_log_open(RRDHOST *host) { - if(host->health_log_fp) - fclose(host->health_log_fp); - - host->health_log_fp = fopen(host->health_log_filename, "a"); - - if(host->health_log_fp) { - if (setvbuf(host->health_log_fp, NULL, _IOLBF, 0) != 0) - error("HEALTH [%s]: cannot set line buffering on health log file '%s'.", rrdhost_hostname(host), host->health_log_filename); - return 0; - } - - error("HEALTH [%s]: cannot open health log file '%s'. Health data will be lost in case of netdata or server crash.", rrdhost_hostname(host), host->health_log_filename); - return -1; -} - -static inline void health_alarm_log_close(RRDHOST *host) { - if(host->health_log_fp) { - fclose(host->health_log_fp); - host->health_log_fp = NULL; - } -} - -static inline void health_log_rotate(RRDHOST *host) { - static size_t rotate_every = 0; - - if(unlikely(rotate_every == 0)) { - rotate_every = (size_t)config_get_number(CONFIG_SECTION_HEALTH, "rotate log every lines", 2000); - if(rotate_every < 100) rotate_every = 100; - } - - if(unlikely(host->health_log_entries_written > rotate_every)) { - if(unlikely(host->health_log_fp)) { - health_alarm_log_close(host); - - char old_filename[FILENAME_MAX + 1]; - snprintfz(old_filename, FILENAME_MAX, "%s.old", host->health_log_filename); - - if(unlink(old_filename) == -1 && errno != ENOENT) - error("HEALTH [%s]: cannot remove old alarms log file '%s'", rrdhost_hostname(host), old_filename); - - if(link(host->health_log_filename, old_filename) == -1 && errno != ENOENT) - error("HEALTH [%s]: cannot move file '%s' to '%s'.", rrdhost_hostname(host), host->health_log_filename, old_filename); - - if(unlink(host->health_log_filename) == -1 && errno != ENOENT) - error("HEALTH [%s]: cannot remove old alarms log file '%s'", rrdhost_hostname(host), host->health_log_filename); - - // open it with truncate - host->health_log_fp = fopen(host->health_log_filename, "w"); - - if(host->health_log_fp) - fclose(host->health_log_fp); - else - error("HEALTH [%s]: cannot truncate health log '%s'", rrdhost_hostname(host), host->health_log_filename); - - host->health_log_fp = NULL; - - host->health_log_entries_written = 0; - health_alarm_log_open(host); - } - } -} - -inline void health_label_log_save(RRDHOST *host) { - health_log_rotate(host); - - if(unlikely(host->health_log_fp)) { - BUFFER *wb = buffer_create(1024); - - rrdlabels_to_buffer(localhost->rrdlabels, wb, "", "=", "", "\t ", NULL, NULL, NULL, NULL); - char *write = (char *) buffer_tostring(wb); - - if (unlikely(fprintf(host->health_log_fp, "L\t%s", write) < 0)) - error("HEALTH [%s]: failed to save alarm log entry to '%s'. Health data may be lost in case of abnormal restart.", - rrdhost_hostname(host), host->health_log_filename); - else - host->health_log_entries_written++; - - buffer_free(wb); - } -} inline void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae) { - health_log_rotate(host); - if(unlikely(host->health_log_fp)) { - if(unlikely(fprintf(host->health_log_fp - , "%c\t%s" - "\t%08x\t%08x\t%08x\t%08x\t%08x" - "\t%08x\t%08x\t%08x" - "\t%08x\t%08x\t%08x" - "\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s" - "\t%d\t%d\t%d\t%d" - "\t" NETDATA_DOUBLE_FORMAT_AUTO "\t" NETDATA_DOUBLE_FORMAT_AUTO - "\t%016"PRIx64"" - "\t%s\t%s\t%s" - "\n" - , (ae->flags & HEALTH_ENTRY_FLAG_SAVED)?'U':'A' - , rrdhost_hostname(host) - - , ae->unique_id - , ae->alarm_id - , ae->alarm_event_id - , ae->updated_by_id - , ae->updates_id - - , (uint32_t)ae->when - , (uint32_t)ae->duration - , (uint32_t)ae->non_clear_duration - , (uint32_t)ae->flags - , (uint32_t)ae->exec_run_timestamp - , (uint32_t)ae->delay_up_to_timestamp - - , ae_name(ae) - , ae_chart_name(ae) - , ae_family(ae) - , ae_exec(ae) - , ae_recipient(ae) - , ae_source(ae) - , ae_units(ae) - , ae_info(ae) - , ae->exec_code - , ae->new_status - , ae->old_status - , ae->delay - - , ae->new_value - , ae->old_value - , (uint64_t)ae->last_repeat - , (ae->classification)?ae_classification(ae):"Unknown" - , (ae->component)?ae_component(ae):"Unknown" - , (ae->type)?ae_type(ae):"Unknown" - ) < 0)) - error("HEALTH [%s]: failed to save alarm log entry to '%s'. Health data may be lost in case of abnormal restart.", rrdhost_hostname(host), host->health_log_filename); - else { - ae->flags |= HEALTH_ENTRY_FLAG_SAVED; - host->health_log_entries_written++; - } - }else - sql_health_alarm_log_save(host, ae); + sql_health_alarm_log_save(host, ae); #ifdef ENABLE_ACLK if (netdata_cloud_setting) { @@ -154,291 +15,6 @@ inline void health_alarm_log_save(RRDHOST *host, ALARM_ENTRY *ae) { #endif } -static uint32_t is_valid_alarm_id(RRDHOST *host, const char *chart, const char *name, uint32_t alarm_id) -{ - STRING *chart_string = string_strdupz(chart); - STRING *name_string = string_strdupz(name); - - uint32_t ret = 1; - - ALARM_ENTRY *ae; - for(ae = host->health_log.alarms; ae ;ae = ae->next) { - if (unlikely(ae->alarm_id == alarm_id && (!(chart_string == ae->chart && name_string == ae->name)))) { - ret = 0; - break; - } - } - - string_freez(chart_string); - string_freez(name_string); - - return ret; -} - -static inline ssize_t health_alarm_log_read(RRDHOST *host, FILE *fp, const char *filename) { - errno = 0; - - char *s, *buf = mallocz(65536 + 1); - size_t line = 0, len = 0; - ssize_t loaded = 0, updated = 0, errored = 0, duplicate = 0; - - DICTIONARY *all_rrdcalcs = dictionary_create( - DICT_OPTION_NAME_LINK_DONT_CLONE | DICT_OPTION_VALUE_LINK_DONT_CLONE | DICT_OPTION_DONT_OVERWRITE_VALUE); - RRDCALC *rc; - foreach_rrdcalc_in_rrdhost_read(host, rc) { - dictionary_set(all_rrdcalcs, rrdcalc_name(rc), rc, sizeof(*rc)); - } - foreach_rrdcalc_in_rrdhost_done(rc); - - netdata_rwlock_rdlock(&host->health_log.alarm_log_rwlock); - - while((s = fgets_trim_len(buf, 65536, fp, &len))) { - host->health_log_entries_written++; - line++; - - int max_entries = 33, entries = 0; - char *pointers[max_entries]; - - pointers[entries++] = s++; - while(*s) { - if(unlikely(*s == '\t')) { - *s = '\0'; - pointers[entries++] = ++s; - if(entries >= max_entries) { - error("HEALTH [%s]: line %zu of file '%s' has more than %d entries. Ignoring excessive entries.", rrdhost_hostname(host), line, filename, max_entries); - break; - } - } - else s++; - } - - if(likely(*pointers[0] == 'L')) - continue; - - if(likely(*pointers[0] == 'U' || *pointers[0] == 'A')) { - ALARM_ENTRY *ae = NULL; - - if(entries < 27) { - error("HEALTH [%s]: line %zu of file '%s' should have at least 27 entries, but it has %d. Ignoring it.", rrdhost_hostname(host), line, filename, entries); - errored++; - continue; - } - - // check that we have valid ids - uint32_t unique_id = (uint32_t)strtoul(pointers[2], NULL, 16); - if(!unique_id) { - error("HEALTH [%s]: line %zu of file '%s' states alarm entry with invalid unique id %u (%s). Ignoring it.", rrdhost_hostname(host), line, filename, unique_id, pointers[2]); - errored++; - continue; - } - - uint32_t alarm_id = (uint32_t)strtoul(pointers[3], NULL, 16); - if(!alarm_id) { - error("HEALTH [%s]: line %zu of file '%s' states alarm entry for invalid alarm id %u (%s). Ignoring it.", rrdhost_hostname(host), line, filename, alarm_id, pointers[3]); - errored++; - continue; - } - - // Check if we got last_repeat field - time_t last_repeat = 0; - if(entries > 27) { - char* alarm_name = pointers[13]; - last_repeat = (time_t)strtoul(pointers[27], NULL, 16); - - rc = dictionary_get(all_rrdcalcs, alarm_name); - if(unlikely(rc)) { - if (rrdcalc_isrepeating(rc)) { - rc->last_repeat = last_repeat; - // We iterate through repeating alarm entries only to - // find the latest last_repeat timestamp. Otherwise, - // there is no need to keep them in memory. - continue; - } - } - } - - if(unlikely(*pointers[0] == 'A')) { - // make sure it is properly numbered - if(unlikely(host->health_log.alarms && unique_id < host->health_log.alarms->unique_id)) { - error( "HEALTH [%s]: line %zu of file '%s' has alarm log entry %u in wrong order. Ignoring it." - , rrdhost_hostname(host), line, filename, unique_id); - errored++; - continue; - } - - ae = callocz(1, sizeof(ALARM_ENTRY)); - } - else if(unlikely(*pointers[0] == 'U')) { - // find the original - for(ae = host->health_log.alarms; ae ; ae = ae->next) { - if(unlikely(unique_id == ae->unique_id)) { - if(unlikely(*pointers[0] == 'A')) { - error("HEALTH [%s]: line %zu of file '%s' adds duplicate alarm log entry %u. Using the later." - , rrdhost_hostname(host), line, filename, unique_id); - *pointers[0] = 'U'; - duplicate++; - } - break; - } - else if(unlikely(unique_id > ae->unique_id)) { - // no need to continue - // the linked list is sorted - ae = NULL; - break; - } - } - } - - // if not found, skip this line - if(unlikely(!ae)) { - // error("HEALTH [%s]: line %zu of file '%s' updates alarm log entry with unique id %u, but it is not found.", host->hostname, line, filename, unique_id); - continue; - } - - // check for a possible host mismatch - //if(strcmp(pointers[1], host->hostname)) - // error("HEALTH [%s]: line %zu of file '%s' provides an alarm for host '%s' but this is named '%s'.", host->hostname, line, filename, pointers[1], host->hostname); - - ae->unique_id = unique_id; - if (!is_valid_alarm_id(host, pointers[14], pointers[13], alarm_id)) { - STRING *chart = string_strdupz(pointers[14]); - STRING *name = string_strdupz(pointers[13]); - alarm_id = rrdcalc_get_unique_id(host, chart, name, NULL); - string_freez(chart); - string_freez(name); - } - ae->alarm_id = alarm_id; - ae->alarm_event_id = (uint32_t)strtoul(pointers[4], NULL, 16); - ae->updated_by_id = (uint32_t)strtoul(pointers[5], NULL, 16); - ae->updates_id = (uint32_t)strtoul(pointers[6], NULL, 16); - - ae->when = (uint32_t)strtoul(pointers[7], NULL, 16); - ae->duration = (uint32_t)strtoul(pointers[8], NULL, 16); - ae->non_clear_duration = (uint32_t)strtoul(pointers[9], NULL, 16); - - ae->flags = (uint32_t)strtoul(pointers[10], NULL, 16); - ae->flags |= HEALTH_ENTRY_FLAG_SAVED; - - ae->exec_run_timestamp = (uint32_t)strtoul(pointers[11], NULL, 16); - ae->delay_up_to_timestamp = (uint32_t)strtoul(pointers[12], NULL, 16); - - string_freez(ae->name); - ae->name = string_strdupz(pointers[13]); - - string_freez(ae->chart); - ae->chart = string_strdupz(pointers[14]); - - string_freez(ae->family); - ae->family = string_strdupz(pointers[15]); - - string_freez(ae->exec); - ae->exec = string_strdupz(pointers[16]); - - string_freez(ae->recipient); - ae->recipient = string_strdupz(pointers[17]); - - string_freez(ae->source); - ae->source = string_strdupz(pointers[18]); - - string_freez(ae->units); - ae->units = string_strdupz(pointers[19]); - - string_freez(ae->info); - ae->info = string_strdupz(pointers[20]); - - ae->exec_code = str2i(pointers[21]); - ae->new_status = str2i(pointers[22]); - ae->old_status = str2i(pointers[23]); - ae->delay = str2i(pointers[24]); - - ae->new_value = str2l(pointers[25]); - ae->old_value = str2l(pointers[26]); - - ae->last_repeat = last_repeat; - - if (likely(entries > 30)) { - string_freez(ae->classification); - ae->classification = string_strdupz(pointers[28]); - - string_freez(ae->component); - ae->component = string_strdupz(pointers[29]); - - string_freez(ae->type); - ae->type = string_strdupz(pointers[30]); - } - - char value_string[100 + 1]; - string_freez(ae->old_value_string); - string_freez(ae->new_value_string); - ae->old_value_string = string_strdupz(format_value_and_unit(value_string, 100, ae->old_value, ae_units(ae), -1)); - ae->new_value_string = string_strdupz(format_value_and_unit(value_string, 100, ae->new_value, ae_units(ae), -1)); - - // add it to host if not already there - if(unlikely(*pointers[0] == 'A')) { - ae->next = host->health_log.alarms; - host->health_log.alarms = ae; - sql_health_alarm_log_insert(host, ae); - loaded++; - } - else { - sql_health_alarm_log_update(host, ae); - updated++; - } - - if(unlikely(ae->unique_id > host->health_max_unique_id)) - host->health_max_unique_id = ae->unique_id; - - if(unlikely(ae->alarm_id >= host->health_max_alarm_id)) - host->health_max_alarm_id = ae->alarm_id; - } - else { - error("HEALTH [%s]: line %zu of file '%s' is invalid (unrecognized entry type '%s').", rrdhost_hostname(host), line, filename, pointers[0]); - errored++; - } - } - - netdata_rwlock_unlock(&host->health_log.alarm_log_rwlock); - - dictionary_destroy(all_rrdcalcs); - all_rrdcalcs = NULL; - - freez(buf); - - if(!host->health_max_unique_id) host->health_max_unique_id = (uint32_t)now_realtime_sec(); - if(!host->health_max_alarm_id) host->health_max_alarm_id = (uint32_t)now_realtime_sec(); - - host->health_log.next_log_id = host->health_max_unique_id + 1; - if (unlikely(!host->health_log.next_alarm_id || host->health_log.next_alarm_id <= host->health_max_alarm_id)) - host->health_log.next_alarm_id = host->health_max_alarm_id + 1; - - debug(D_HEALTH, "HEALTH [%s]: loaded file '%s' with %zd new alarm entries, updated %zd alarms, errors %zd entries, duplicate %zd", rrdhost_hostname(host), filename, loaded, updated, errored, duplicate); - return loaded; -} - -inline void health_alarm_log_load(RRDHOST *host) { - health_alarm_log_close(host); - - char filename[FILENAME_MAX + 1]; - snprintfz(filename, FILENAME_MAX, "%s.old", host->health_log_filename); - FILE *fp = fopen(filename, "r"); - if(!fp) - error("HEALTH [%s]: cannot open health file: %s", rrdhost_hostname(host), filename); - else { - health_alarm_log_read(host, fp, filename); - fclose(fp); - } - - host->health_log_entries_written = 0; - fp = fopen(host->health_log_filename, "r"); - if(!fp) - error("HEALTH [%s]: cannot open health file: %s", rrdhost_hostname(host), host->health_log_filename); - else { - health_alarm_log_read(host, fp, host->health_log_filename); - fclose(fp); - } -} - - // ---------------------------------------------------------------------------- // health alarm log management diff --git a/health/notifications/README.md b/health/notifications/README.md index 0bd6c7649..c59fecced 100644 --- a/health/notifications/README.md +++ b/health/notifications/README.md @@ -1,7 +1,11 @@ <!-- title: "Alarm notifications" description: "Reference documentation for Netdata's alarm notification feature, which supports dozens of endpoints, user roles, and more." -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/README.md +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/README.md" +sidebar_label: "Notifications Reference" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Operations/Alerts" --> # Alarm notifications diff --git a/health/notifications/alarm-notify.sh.in b/health/notifications/alarm-notify.sh.in index 3edf3d083..0090427a0 100755 --- a/health/notifications/alarm-notify.sh.in +++ b/health/notifications/alarm-notify.sh.in @@ -18,7 +18,7 @@ # - emails by @ktsaou # - slack.com notifications by @ktsaou # - alerta.io notifications by @kattunga -# - discordapp.com notifications by @lowfive +# - discord.com notifications by @lowfive # - pushover.net notifications by @ktsaou # - pushbullet.com push notifications by Tiago Peralta @tperalta82 #1070 # - telegram.org notifications by @hashworks #1002 @@ -484,53 +484,105 @@ msteams_migration # filter a recipient based on alarm event severity filter_recipient_by_criticality() { - local method="${1}" x="${2}" r s - shift - - r="${x/|*/}" # the recipient - s="${x/*|/}" # the severity required for notifying this recipient + local method="${1}" recipient_arg="${2}" + local tracking_dir tracking_file modifier modifiers recipient="${recipient_arg/|*/}" + local mod_critical=0 mod_noclear=0 mod_nowarn=0 # no severity filtering for this person - [ "${r}" = "${s}" ] && return 0 + [ "${recipient}" = "${recipient_arg}" ] && return 0 + + # find out which modifiers are set + modifiers="${recipient_arg#*|}" + modifiers="${modifiers//|/ }" # replace pipes with spaces + modifiers="${modifiers,,}" # lowercase + for modifier in ${modifiers}; do + case "${modifier}" in + critical) mod_critical=1 ;; + noclear) mod_noclear=1 ;; + nowarn) mod_nowarn=1 ;; + + *) + error "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: invalid modifier '${modifier}'." + # invalid modifier, always send notification + return 0 + ;; + esac + done - # the severity is invalid - s="${s^^}" - if [ "${s}" != "CRITICAL" ]; then - error "SEVERITY FILTERING for ${x} VIA ${method}: invalid severity '${s,,}', only 'critical' is supported." - return 0 - fi + # set status tracking directory/file var + tracking_dir="${NETDATA_CACHE_DIR}/alarm-notify/${method}/${recipient}" + tracking_file="${tracking_dir}/${alarm_id}" - # create the status tracking directory for this user - [ ! -d "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}" ] && - mkdir -p "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}" + # create the status tracking directory for this user if "critical" modifier is set + [ "${mod_critical}" == "1" ] && [ ! -d "${tracking_dir}" ] && mkdir -p "${tracking_dir}" case "${status}" in - CRITICAL) - # make sure he will get future notifications for this alarm too - touch "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" - debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: the alarm is CRITICAL (will now receive next status change)" - return 0 - ;; - - WARNING) - if [ -f "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" ]; then - # we do not remove the file, so that he will get future notifications of this alarm - debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: recipient has been notified for this alarm in the past (will still receive next status change)" - return 0 - fi - ;; + CRITICAL) + # "critical" modifier set, create tracking file for future status changes + if [ "${mod_critical}" == "1" ]; then + touch "${tracking_file}" + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: ALLOW: the alarm is CRITICAL (will now receive next status change)" + return 0 + fi - *) - if [ -f "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" ]; then - # remove the file, so that he will only receive notifications for CRITICAL states for this alarm - rm "${NETDATA_CACHE_DIR}/alarm-notify/${method}/${r}/${alarm_id}" - debug "SEVERITY FILTERING for ${x} VIA ${method}: ALLOW: recipient has been notified for this alarm (will only receive CRITICAL notifications from now on)" + # always send CRITICAL notification + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: ALLOW: the alarm is CRITICAL" return 0 - fi - ;; + ;; + + WARNING) + # "nowarn" modifier set, block notification + if [ "${mod_nowarn}" == "1" ]; then + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: BLOCK: recipient should not receive this notification (nowarn modifier set)" + return 1 + fi + + # "critical" modifier not set, send notification + if [ "${mod_critical}" == "0" ]; then + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: ALLOW: the alarm is WARNING" + return 0 + fi + + # "critical" modifier set, send notification if tracking file exists + if [ "${mod_critical}" == "1" ] && [ -f "${tracking_file}" ]; then + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: ALLOW: recipient has been notified for this alarm in the past (will still receive next status change)" + return 0 + fi + ;; + + CLEAR) + # remove tracking file + [ -f "${tracking_file}" ] && rm "${tracking_file}" + + # "noclear" modifier set, block notification + if [ "${mod_noclear}" == "1" ]; then + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: BLOCK: recipient should not receive this notification (noclear modifier set)" + return 1 + fi + + # "critical" modifier not set, send notification + if [ "${mod_critical}" == "0" ]; then + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: ALLOW: the alarm is CLEAR" + return 0 + fi + + # "critical" modifier set, send notification if tracking file exists + if [ "${mod_critical}" == "1" ] && [ -f "${tracking_file}" ]; then + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: ALLOW: recipient has been notified for this alarm in the past (no status change will be sent from now)" + return 0 + fi + ;; + + *) + # "critical" modifier set, send notification if tracking file exists + if [ "${mod_critical}" == "1" ] && [ -f "${tracking_file}" ]; then + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: ALLOW: recipient has been notified for this alarm in the past (will still receive next status change)" + return 0 + fi + ;; esac - debug "SEVERITY FILTERING for ${x} VIA ${method}: BLOCK: recipient should not receive this notification" + debug "SEVERITY FILTERING for ${recipient_arg} VIA ${method}: BLOCK: recipient should not receive this notification" return 1 } @@ -1480,10 +1532,12 @@ send_slack() { "fields": [ { "title": "${chart}", + "value": "chart", "short": true }, { "title": "${family}", + "value": "family", "short": true } ], diff --git a/health/notifications/alerta/README.md b/health/notifications/alerta/README.md index 9603aae01..5ecf55eea 100644 --- a/health/notifications/alerta/README.md +++ b/health/notifications/alerta/README.md @@ -1,7 +1,12 @@ <!-- title: "alerta.io" +sidebar_label: "Alerta" description: "Send alarm notifications to Alerta to see the latest health status updates from multiple nodes in a single interface." -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/alerta/README.md +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/alerta/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # alerta.io diff --git a/health/notifications/awssns/README.md b/health/notifications/awssns/README.md index fc4a665e9..97768399e 100644 --- a/health/notifications/awssns/README.md +++ b/health/notifications/awssns/README.md @@ -1,7 +1,12 @@ <!-- title: "Amazon SNS" +sidebar_label: "Amazon SNS" description: "hello" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/awssns/README.md +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/awssns/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Amazon SNS diff --git a/health/notifications/custom/README.md b/health/notifications/custom/README.md index edc42623d..df8f88e40 100644 --- a/health/notifications/custom/README.md +++ b/health/notifications/custom/README.md @@ -1,6 +1,11 @@ <!-- title: "Custom" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/custom/README.md +sidebar_label: "Custom endpoint" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/custom/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Custom @@ -8,8 +13,8 @@ custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notificat Netdata allows you to send custom notifications to any endpoint you choose. To configure custom notifications, you will need to customize `health_alarm_notify.conf`. Open the file for editing -using [`edit-config`](/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) from the [Netdata config -directory](/docs/configure/nodes.md#the-netdata-config-directory), which is typically at `/etc/netdata`. +using [`edit-config`](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) from the [Netdata config +directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md#the-netdata-config-directory), which is typically at `/etc/netdata`. You can look at the other senders in `/usr/libexec/netdata/plugins.d/alarm-notify.sh` for examples of how to modify the `custom_sender()` function in `health_alarm_notify.conf`. diff --git a/health/notifications/discord/README.md b/health/notifications/discord/README.md index 568d03bc3..b4cbce533 100644 --- a/health/notifications/discord/README.md +++ b/health/notifications/discord/README.md @@ -1,9 +1,14 @@ <!-- -title: "Discordapp.com" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/discord/README.md +title: "Discord.com" +sidebar_label: "Discord" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/discord/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> -# Discordapp.com +# Discord.com This is what you will get: @@ -11,7 +16,7 @@ This is what you will get: You need: -1. The **incoming webhook URL** as given by Discord. Create a webhook by following the official [Discord documentation](https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks). You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). +1. The **incoming webhook URL** as given by Discord. Create a webhook by following the official [Discord documentation](https://support.discord.com/hc/en-us/articles/228383668-Intro-to-Webhooks). You can use the same on all your Netdata servers (or you can have multiple if you like - your decision). 2. One or more Discord channels to post the messages to. Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system run `/etc/netdata/edit-config health_alarm_notify.conf`), like this: @@ -27,8 +32,8 @@ Set them in `/etc/netdata/health_alarm_notify.conf` (to edit it on your system r SEND_DISCORD="YES" # Create a webhook by following the official documentation - -# https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks -DISCORD_WEBHOOK_URL="https://discordapp.com/api/webhooks/XXXXXXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" +# https://support.discord.com/hc/en-us/articles/228383668-Intro-to-Webhooks +DISCORD_WEBHOOK_URL="https://discord.com/api/webhooks/XXXXXXXXXXXXX/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" # if a role's recipients are not configured, a notification will be send to # this discord channel (empty = do not send a notification for unconfigured @@ -45,6 +50,4 @@ role_recipients_discord[dba]="databases systems" role_recipients_discord[webmaster]="marketing development" ``` -The keywords `systems`, `databases`, `marketing`, `development` are discordapp.com channels (they should already exist within your discord server). - - +The keywords `systems`, `databases`, `marketing`, `development` are discord.com channels (they should already exist within your discord server). diff --git a/health/notifications/dynatrace/README.md b/health/notifications/dynatrace/README.md index 3f8ad85b6..a36683933 100644 --- a/health/notifications/dynatrace/README.md +++ b/health/notifications/dynatrace/README.md @@ -1,6 +1,11 @@ <!-- title: "Dynatrace" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/dynatrace/README.md +sidebar_label: "Dynatrace Events" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/dynatrace/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Dynatrace diff --git a/health/notifications/email/README.md b/health/notifications/email/README.md index 3dc84dd40..01dfd0e6f 100644 --- a/health/notifications/email/README.md +++ b/health/notifications/email/README.md @@ -1,6 +1,11 @@ <!-- title: "Email" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/email/README.md +sidebar_label: "Email" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/email/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': True, 'part_of_agent': True}" --> # Email diff --git a/health/notifications/flock/README.md b/health/notifications/flock/README.md index b9e0025b3..175f8a466 100644 --- a/health/notifications/flock/README.md +++ b/health/notifications/flock/README.md @@ -1,6 +1,11 @@ <!-- title: "Flock" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/flock/README.md +sidebar_label: "Flock" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/flock/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Flock diff --git a/health/notifications/gotify/README.md b/health/notifications/gotify/README.md index c253c845c..d01502b65 100644 --- a/health/notifications/gotify/README.md +++ b/health/notifications/gotify/README.md @@ -3,6 +3,10 @@ title: "Send notifications to Gotify" description: "Send alerts to your Gotify instance when an alert gets triggered in Netdata." sidebar_label: "Gotify" custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/gotify/README.md +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Send notifications to Gotify @@ -21,7 +25,7 @@ You can generate a new token in the Gotify Web UI. To set up Gotify in Netdata: 1. Switch to your [config -directory](/docs/configure/nodes.md) and edit the file `health_alarm_notify.conf` using the edit config script. +directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md) and edit the file `health_alarm_notify.conf` using the edit config script. ```bash ./edit-config health_alarm_notify.conf diff --git a/health/notifications/hangouts/README.md b/health/notifications/hangouts/README.md index 7554b39cd..45da1bfa0 100644 --- a/health/notifications/hangouts/README.md +++ b/health/notifications/hangouts/README.md @@ -2,7 +2,11 @@ title: "Send notifications to Google Hangouts" description: "Send alerts to Send notifications to Google Hangouts any time an anomaly or performance issue strikes a node in your infrastructure." sidebar_label: "Google Hangouts" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/hangouts/README.md +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/hangouts/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Send notifications to Google Hangouts diff --git a/health/notifications/health_alarm_notify.conf b/health/notifications/health_alarm_notify.conf index 52de86645..4878661aa 100755 --- a/health/notifications/health_alarm_notify.conf +++ b/health/notifications/health_alarm_notify.conf @@ -9,7 +9,7 @@ # - messages to your slack team (slack.com), # - messages to your alerta server (alerta.io), # - messages to your flock team (flock.com), -# - messages to your discord guild (discordapp.com), +# - messages to your discord guild (discord.com), # - messages to your telegram chat / group chat (telegram.org) # - sms messages to your cell phone or any sms enabled device (twilio.com) # - sms messages to your cell phone or any sms enabled device (messagebird.com) @@ -160,7 +160,11 @@ sendsms="" # - pagerduty.com (pd) services # - irc channels # -# You can append |critical to limit the notifications to be sent. +# You can append modifiers to limit the notifications to be sent: +# |critical - Send critical notifications and following status changes until +# the alarm is cleared. +# |nowarn - Do not send warning notifications. +# |noclear - Do not send clear notifications. # # In these examples, the first recipient receives all the alarms # while the second one receives only notifications for alarms that @@ -182,6 +186,11 @@ sendsms="" # irc : "<irc_channel_1> <irc_channel_2>|critical" # hangouts : "alarms disasters|critical" # +# You can append multiple modifiers. In this example, recipient receives +# notifications for critical alarms and following status changes except clear +# notifications. +# email : "user1@example.com|critical|noclear" +# # If a recipient is set to empty string, the default recipient of the given # notification method (email, pushover, telegram, slack, alerta, etc) will be used. # To disable a notification, use the recipient called: disabled @@ -579,7 +588,7 @@ DEFAULT_RECIPIENT_FLOCK="" #------------------------------------------------------------------------------ -# discord (discordapp.com) global notification options +# discord (discord.com) global notification options # multiple recipients can be given like this: # "CHANNEL1 CHANNEL2 ..." @@ -588,7 +597,7 @@ DEFAULT_RECIPIENT_FLOCK="" SEND_DISCORD="YES" # Create a webhook by following the official documentation - -# https://support.discordapp.com/hc/en-us/articles/228383668-Intro-to-Webhooks +# https://support.discord.com/hc/en-us/articles/228383668-Intro-to-Webhooks DISCORD_WEBHOOK_URL="" # if a role's recipients are not configured, a notification will be send to diff --git a/health/notifications/irc/README.md b/health/notifications/irc/README.md index 21c998d11..a4877f48a 100644 --- a/health/notifications/irc/README.md +++ b/health/notifications/irc/README.md @@ -1,6 +1,11 @@ <!-- title: "IRC" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/irc/README.md +sidebar_label: "IRC" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/irc/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # IRC diff --git a/health/notifications/kavenegar/README.md b/health/notifications/kavenegar/README.md index 6123eb901..443fcdba4 100644 --- a/health/notifications/kavenegar/README.md +++ b/health/notifications/kavenegar/README.md @@ -1,6 +1,11 @@ <!-- title: "Kavenegar" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/kavenegar/README.md +sidebar_label: "Kavenegar" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/kavenegar/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Kavenegar diff --git a/health/notifications/matrix/README.md b/health/notifications/matrix/README.md index 8eeecf55d..80e22da37 100644 --- a/health/notifications/matrix/README.md +++ b/health/notifications/matrix/README.md @@ -2,7 +2,11 @@ title: "Send Netdata notifications to Matrix network rooms" description: "Stay aware of warning or critical anomalies by sending health alarms to Matrix network rooms with Netdata's health monitoring watchdog." sidebar_label: "Matrix" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/matrix/README.md +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/matrix/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Matrix diff --git a/health/notifications/messagebird/README.md b/health/notifications/messagebird/README.md index f70e86c68..014301985 100644 --- a/health/notifications/messagebird/README.md +++ b/health/notifications/messagebird/README.md @@ -1,6 +1,11 @@ <!-- title: "Messagebird" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/messagebird/README.md +sidebar_label: "Messagebird" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/messagebird/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Messagebird diff --git a/health/notifications/msteams/README.md b/health/notifications/msteams/README.md index c9a13bac9..75e652a72 100644 --- a/health/notifications/msteams/README.md +++ b/health/notifications/msteams/README.md @@ -1,6 +1,11 @@ <!-- title: "Microsoft Teams" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/msteams/README.md +sidebar_label: "Microsoft Teams" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/msteams/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Microsoft Teams diff --git a/health/notifications/opsgenie/README.md b/health/notifications/opsgenie/README.md index 640fcd42a..20f14b396 100644 --- a/health/notifications/opsgenie/README.md +++ b/health/notifications/opsgenie/README.md @@ -2,7 +2,11 @@ title: "Send notifications to Opsgenie" description: "Send alerts to your Opsgenie incident response account any time an anomaly or performance issue strikes a node in your infrastructure." sidebar_label: "Opsgenie" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/opsgenie/README.md +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/opsgenie/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Send notifications to Opsgenie @@ -13,9 +17,9 @@ incidents. The first step is to create a [Netdata integration](https://docs.opsgenie.com/docs/api-integration) in the [Opsgenie](https://www.atlassian.com/software/opsgenie) dashboard. After this, you need to edit -`health_alarm_notify.conf` on your system, by running the following from your [config -directory](/docs/configure/nodes.md): - +`health_alarm_notify.conf` on your system, by running the following from +your [config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md): + ```bash ./edit-config health_alarm_notify.conf ``` @@ -56,7 +60,7 @@ message: 2020-09-03 23:07:00: alarm-notify.sh: ERROR: failed to send opsgenie notification for: hades test.chart.test_alarm is CRITICAL, with HTTP error code 401. ``` -You can find more details about the Opsgenie error codes in their [response -docs](https://docs.opsgenie.com/docs/response). +You can find more details about the Opsgenie error codes in +their [response docs](https://docs.opsgenie.com/docs/response). diff --git a/health/notifications/pagerduty/README.md b/health/notifications/pagerduty/README.md index 30db6379c..c6190e83f 100644 --- a/health/notifications/pagerduty/README.md +++ b/health/notifications/pagerduty/README.md @@ -2,7 +2,11 @@ title: "Send alert notifications to PagerDuty" description: "Send alerts to your PagerDuty dashboard any time an anomaly or performance issue strikes a node in your infrastructure." sidebar_label: "PagerDuty" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pagerduty/README.md +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/pagerduty/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Send alert notifications to PagerDuty @@ -14,7 +18,7 @@ resolution times. ## What you need to get started -- An installation of the open-source [Netdata](/docs/get-started.mdx) monitoring agent. +- An installation of the open-source [Netdata](https://github.com/netdata/netdata/blob/master/docs/get-started.mdx) monitoring agent. - An installation of the [PagerDuty agent](https://www.pagerduty.com/docs/guides/agent-install-guide/) on the node running Netdata. - A PagerDuty `Generic API` service using either the `Events API v2` or `Events API v1`. @@ -25,8 +29,8 @@ resolution times. to PagerDuty. Click **Use our API directly** and select either `Events API v2` or `Events API v1`. Once you finish creating the service, click on the **Integrations** tab to find your **Integration Key**. -Navigate to the [Netdata config directory](/docs/configure/nodes.md#the-netdata-config-directory) and use -[`edit-config`](/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) to open +Navigate to the [Netdata config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md#the-netdata-config-directory) and use +[`edit-config`](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md#use-edit-config-to-edit-configuration-files) to open `health_alarm_notify.conf`. ```bash @@ -59,5 +63,5 @@ sudo su -s /bin/bash netdata Aside from the three values set in `health_alarm_notify.conf`, there is no further configuration required to send alert notifications to PagerDuty. -To configure individual alarms, read our [alert configuration](/docs/monitor/configure-alarms.md) doc or -the [health entity reference](/health/REFERENCE.md) doc. +To configure individual alarms, read our [alert configuration](https://github.com/netdata/netdata/blob/master/docs/monitor/configure-alarms.md) doc or +the [health entity reference](https://github.com/netdata/netdata/blob/master/health/REFERENCE.md) doc. diff --git a/health/notifications/prowl/README.md b/health/notifications/prowl/README.md index dc136820c..8656c1314 100644 --- a/health/notifications/prowl/README.md +++ b/health/notifications/prowl/README.md @@ -1,6 +1,11 @@ <!-- title: "Prowl" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/prowl/README.md +sidebar_label: "Prowl" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/prowl/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Prowl diff --git a/health/notifications/pushbullet/README.md b/health/notifications/pushbullet/README.md index 194050bc1..17ed93646 100644 --- a/health/notifications/pushbullet/README.md +++ b/health/notifications/pushbullet/README.md @@ -1,6 +1,11 @@ <!-- title: "PushBullet" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pushbullet/README.md +sidebar_label: "PushBullet" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/pushbullet/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # PushBullet diff --git a/health/notifications/pushover/README.md b/health/notifications/pushover/README.md index 1e50f7140..4d5ea5a96 100644 --- a/health/notifications/pushover/README.md +++ b/health/notifications/pushover/README.md @@ -1,6 +1,11 @@ <!-- title: "PushOver" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/pushover/README.md +sidebar_label: "PushOver" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/pushover/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # PushOver diff --git a/health/notifications/rocketchat/README.md b/health/notifications/rocketchat/README.md index 96d6160b2..0f7867d0f 100644 --- a/health/notifications/rocketchat/README.md +++ b/health/notifications/rocketchat/README.md @@ -1,6 +1,11 @@ <!-- title: "Rocket.Chat" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/rocketchat/README.md +sidebar_label: "Rocket Chat" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/rocketchat/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Rocket.Chat diff --git a/health/notifications/slack/README.md b/health/notifications/slack/README.md index ad36ce34a..ad9a21346 100644 --- a/health/notifications/slack/README.md +++ b/health/notifications/slack/README.md @@ -1,6 +1,11 @@ <!-- title: "Slack" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/slack/README.md +sidebar_label: "Slack" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/slack/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Slack diff --git a/health/notifications/smstools3/README.md b/health/notifications/smstools3/README.md index 6618dfa18..9535c9549 100644 --- a/health/notifications/smstools3/README.md +++ b/health/notifications/smstools3/README.md @@ -1,6 +1,11 @@ <!-- title: "SMS Server Tools 3" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/smstools3/README.md +sidebar_label: "SMS server" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/smstools3/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # SMS Server Tools 3 diff --git a/health/notifications/stackpulse/README.md b/health/notifications/stackpulse/README.md index c478fd584..25266e822 100644 --- a/health/notifications/stackpulse/README.md +++ b/health/notifications/stackpulse/README.md @@ -2,7 +2,11 @@ title: "Send notifications to StackPulse" description: "Send alerts to your StackPulse Netdata integration any time an anomaly or performance issue strikes a node in your infrastructure." sidebar_label: "StackPulse" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/stackpulse/README.md +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/stackpulse/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Send notifications to StackPulse @@ -40,7 +44,7 @@ STACKPULSE_WEBHOOK="https://hooks.stackpulse.io/v1/webhooks/YOUR_UNIQUE_ID" ``` 4. Now restart Netdata using `sudo systemctl restart netdata`, or the [appropriate - method](/docs/configure/start-stop-restart.md) for your system. When your node creates an alarm, you can see the + method](https://github.com/netdata/netdata/blob/master/docs/configure/start-stop-restart.md) for your system. When your node creates an alarm, you can see the associated notification on your StackPulse Administration Portal ## React to alarms with playbooks diff --git a/health/notifications/syslog/README.md b/health/notifications/syslog/README.md index 8b7863a1a..3527decc4 100644 --- a/health/notifications/syslog/README.md +++ b/health/notifications/syslog/README.md @@ -1,6 +1,11 @@ <!-- title: "Syslog" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/syslog/README.md +sidebar_label: "Syslog" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/syslog/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Syslog diff --git a/health/notifications/telegram/README.md b/health/notifications/telegram/README.md index 2a2ed5623..f80a2838d 100644 --- a/health/notifications/telegram/README.md +++ b/health/notifications/telegram/README.md @@ -1,6 +1,11 @@ <!-- title: "Telegram" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/telegram/README.md +sidebar_label: "Telegram" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/telegram/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Telegram diff --git a/health/notifications/twilio/README.md b/health/notifications/twilio/README.md index b563c66c1..470b2413b 100644 --- a/health/notifications/twilio/README.md +++ b/health/notifications/twilio/README.md @@ -1,6 +1,11 @@ <!-- title: "Twilio" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/twilio/README.md +sidebar_label: "Twilio" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/twilio/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> # Twilio diff --git a/health/notifications/web/README.md b/health/notifications/web/README.md index 185843af5..b4afd9ea7 100644 --- a/health/notifications/web/README.md +++ b/health/notifications/web/README.md @@ -1,9 +1,14 @@ <!-- -title: "Dashboard" -custom_edit_url: https://github.com/netdata/netdata/edit/master/health/notifications/web/README.md +title: "Pop up" +sidebar_label: "Pop up notifications" +custom_edit_url: "https://github.com/netdata/netdata/edit/master/health/notifications/web/README.md" +learn_status: "Published" +learn_topic_type: "Tasks" +learn_rel_path: "Setup/Notification/Agent" +learn_autogeneration_metadata: "{'part_of_cloud': False, 'part_of_agent': True}" --> -# Dashboard +# Pop up notifications The Netdata dashboard shows HTML notifications, when it is open. |