diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-10-17 09:30:20 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2023-10-17 09:30:20 +0000 |
commit | 386ccdd61e8256c8b21ee27ee2fc12438fc5ca98 (patch) | |
tree | c9fbcacdb01f029f46133a5ba7ecd610c2bcb041 /health/health.d | |
parent | Adding upstream version 1.42.4. (diff) | |
download | netdata-386ccdd61e8256c8b21ee27ee2fc12438fc5ca98.tar.xz netdata-386ccdd61e8256c8b21ee27ee2fc12438fc5ca98.zip |
Adding upstream version 1.43.0.upstream/1.43.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to '')
77 files changed, 768 insertions, 532 deletions
diff --git a/health/health.d/adaptec_raid.conf b/health/health.d/adaptec_raid.conf index 1d823addd..1f1840491 100644 --- a/health/health.d/adaptec_raid.conf +++ b/health/health.d/adaptec_raid.conf @@ -11,7 +11,8 @@ component: RAID every: 10s crit: $this > 0 delay: down 5m multiplier 1.5 max 1h - info: logical device status is failed or degraded + summary: Adaptec raid logical device status + info: Logical device status is failed or degraded to: sysadmin # physical device state check @@ -26,5 +27,6 @@ component: RAID every: 10s crit: $this > 0 delay: down 5m multiplier 1.5 max 1h - info: physical device state is not online + summary: Adaptec raid physical device state + info: Physical device state is not online to: sysadmin diff --git a/health/health.d/apcupsd.conf b/health/health.d/apcupsd.conf index 7a0afcd18..fc8f2cd0f 100644 --- a/health/health.d/apcupsd.conf +++ b/health/health.d/apcupsd.conf @@ -12,7 +12,8 @@ component: UPS every: 1m warn: $this > (($status >= $WARNING) ? (70) : (80)) delay: down 10m multiplier 1.5 max 1h - info: average UPS load over the last 10 minutes + summary: APC UPS load + info: APC UPS average load over the last 10 minutes to: sitemgr # Discussion in https://github.com/netdata/netdata/pull/3928: @@ -30,7 +31,8 @@ component: UPS warn: $this < 100 crit: $this < 40 delay: down 10m multiplier 1.5 max 1h - info: average UPS charge over the last minute + summary: APC UPS battery charge + info: APC UPS average battery charge over the last minute to: sitemgr template: apcupsd_last_collected_secs @@ -43,5 +45,6 @@ component: UPS device units: seconds ago warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) delay: down 5m multiplier 1.5 max 1h - info: number of seconds since the last successful data collection + summary: APC UPS last collection + info: APC UPS number of seconds since the last successful data collection to: sitemgr diff --git a/health/health.d/bcache.conf b/health/health.d/bcache.conf index 8492bb6c7..446173428 100644 --- a/health/health.d/bcache.conf +++ b/health/health.d/bcache.conf @@ -9,7 +9,8 @@ component: Disk every: 1m warn: $this > 0 delay: up 2m down 1h multiplier 1.5 max 2h - info: number of times data was read from the cache, \ + summary: Bcache cache read race errors + info: Number of times data was read from the cache, \ the bucket was reused and invalidated in the last 10 minutes \ (when this occurs the data is reread from the backing device) to: silent @@ -24,6 +25,7 @@ component: Disk every: 1m warn: $this > 75 delay: up 1m down 1h multiplier 1.5 max 2h - info: percentage of cache space used for dirty data and metadata \ + summary: Bcache cache used space + info: Percentage of cache space used for dirty data and metadata \ (this usually means your SSD cache is too small) to: silent diff --git a/health/health.d/beanstalkd.conf b/health/health.d/beanstalkd.conf index 4ee8bc0bd..0d37f28e0 100644 --- a/health/health.d/beanstalkd.conf +++ b/health/health.d/beanstalkd.conf @@ -10,7 +10,8 @@ component: Beanstalk every: 10s warn: $this > 3 delay: up 0 down 5m multiplier 1.2 max 1h - info: number of buried jobs across all tubes. \ + summary: Beanstalk buried jobs + info: Number of buried jobs across all tubes. \ You need to manually kick them so they can be processed. \ Presence of buried jobs in a tube does not affect new jobs. to: sysadmin diff --git a/health/health.d/bind_rndc.conf b/health/health.d/bind_rndc.conf index b3e75a239..b1c271df9 100644 --- a/health/health.d/bind_rndc.conf +++ b/health/health.d/bind_rndc.conf @@ -7,5 +7,6 @@ component: BIND every: 60 calc: $stats_size warn: $this > 512 + summary: BIND statistics file size info: BIND statistics-file size to: sysadmin diff --git a/health/health.d/boinc.conf b/health/health.d/boinc.conf index b7dcbe316..092a56845 100644 --- a/health/health.d/boinc.conf +++ b/health/health.d/boinc.conf @@ -13,7 +13,8 @@ component: BOINC every: 1m warn: $this > 0 delay: up 1m down 5m multiplier 1.5 max 1h - info: average number of compute errors over the last 10 minutes + summary: BOINC compute errors + info: Average number of compute errors over the last 10 minutes to: sysadmin # Warn on lots of upload errors @@ -29,7 +30,8 @@ component: BOINC every: 1m warn: $this > 0 delay: up 1m down 5m multiplier 1.5 max 1h - info: average number of failed uploads over the last 10 minutes + summary: BOINC failed uploads + info: Average number of failed uploads over the last 10 minutes to: sysadmin # Warn on the task queue being empty @@ -45,7 +47,8 @@ component: BOINC every: 1m warn: $this < 1 delay: up 5m down 10m multiplier 1.5 max 1h - info: average number of total tasks over the last 10 minutes + summary: BOINC total tasks + info: Average number of total tasks over the last 10 minutes to: sysadmin # Warn on no active tasks with a non-empty queue @@ -62,5 +65,6 @@ component: BOINC every: 1m warn: $this < 1 delay: up 5m down 10m multiplier 1.5 max 1h - info: average number of active tasks over the last 10 minutes + summary: BOINC active tasks + info: Average number of active tasks over the last 10 minutes to: sysadmin diff --git a/health/health.d/btrfs.conf b/health/health.d/btrfs.conf index b2a50682b..1557a5941 100644 --- a/health/health.d/btrfs.conf +++ b/health/health.d/btrfs.conf @@ -11,7 +11,8 @@ component: File system every: 10s warn: $this > (($status == $CRITICAL) ? (95) : (98)) delay: up 1m down 15m multiplier 1.5 max 1h - info: percentage of allocated BTRFS physical disk space + summary: BTRFS allocated space utilization + info: Percentage of allocated BTRFS physical disk space to: silent template: btrfs_data @@ -27,7 +28,8 @@ component: File system warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 delay: up 1m down 15m multiplier 1.5 max 1h - info: utilization of BTRFS data space + summary: BTRFS data space utilization + info: Utilization of BTRFS data space to: sysadmin template: btrfs_metadata @@ -43,7 +45,8 @@ component: File system warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 delay: up 1m down 15m multiplier 1.5 max 1h - info: utilization of BTRFS metadata space + summary: BTRFS metadata space utilization + info: Utilization of BTRFS metadata space to: sysadmin template: btrfs_system @@ -59,7 +62,8 @@ component: File system warn: $this > (($status >= $WARNING) ? (90) : (95)) && $btrfs_allocated > 98 crit: $this > (($status == $CRITICAL) ? (95) : (98)) && $btrfs_allocated > 98 delay: up 1m down 15m multiplier 1.5 max 1h - info: utilization of BTRFS system space + summary: BTRFS system space utilization + info: Utilization of BTRFS system space to: sysadmin template: btrfs_device_read_errors @@ -73,7 +77,8 @@ component: File system lookup: max -10m every 1m of read_errs warn: $this > 0 delay: up 1m down 15m multiplier 1.5 max 1h - info: number of encountered BTRFS read errors + summary: BTRFS device read errors + info: Number of encountered BTRFS read errors to: sysadmin template: btrfs_device_write_errors @@ -87,7 +92,8 @@ component: File system lookup: max -10m every 1m of write_errs crit: $this > 0 delay: up 1m down 15m multiplier 1.5 max 1h - info: number of encountered BTRFS write errors + summary: BTRFS device write errors + info: Number of encountered BTRFS write errors to: sysadmin template: btrfs_device_flush_errors @@ -101,7 +107,8 @@ component: File system lookup: max -10m every 1m of flush_errs crit: $this > 0 delay: up 1m down 15m multiplier 1.5 max 1h - info: number of encountered BTRFS flush errors + summary: BTRFS device flush errors + info: Number of encountered BTRFS flush errors to: sysadmin template: btrfs_device_corruption_errors @@ -115,7 +122,8 @@ component: File system lookup: max -10m every 1m of corruption_errs warn: $this > 0 delay: up 1m down 15m multiplier 1.5 max 1h - info: number of encountered BTRFS corruption errors + summary: BTRFS device corruption errors + info: Number of encountered BTRFS corruption errors to: sysadmin template: btrfs_device_generation_errors @@ -129,5 +137,6 @@ component: File system lookup: max -10m every 1m of generation_errs warn: $this > 0 delay: up 1m down 15m multiplier 1.5 max 1h - info: number of encountered BTRFS generation errors + summary: BTRFS device generation errors + info: Number of encountered BTRFS generation errors to: sysadmin diff --git a/health/health.d/ceph.conf b/health/health.d/ceph.conf index 1f9da25c7..44d351338 100644 --- a/health/health.d/ceph.conf +++ b/health/health.d/ceph.conf @@ -11,5 +11,6 @@ component: Ceph warn: $this > (($status >= $WARNING ) ? (85) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 5m multiplier 1.2 max 1h - info: cluster disk space utilization + summary: Ceph cluster disk space utilization + info: Ceph cluster disk space utilization to: sysadmin diff --git a/health/health.d/cgroups.conf b/health/health.d/cgroups.conf index 53a6ea00f..9c55633ef 100644 --- a/health/health.d/cgroups.conf +++ b/health/health.d/cgroups.conf @@ -13,7 +13,8 @@ component: CPU every: 1m warn: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: average cgroup CPU utilization over the last 10 minutes + summary: Cgroup ${label:cgroup_name} CPU utilization + info: Cgroup ${label:cgroup_name} average CPU utilization over the last 10 minutes to: silent template: cgroup_ram_in_use @@ -29,46 +30,10 @@ component: Memory warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 15m multiplier 1.5 max 1h - info: cgroup memory utilization + summary: Cgroup ${label:cgroup_name} memory utilization + info: Cgroup ${label:cgroup_name} memory utilization to: silent -# FIXME COMMENTED DUE TO A BUG IN NETDATA -## ----------------------------------------------------------------------------- -## check for packet storms -# -## 1. calculate the rate packets are received in 1m: 1m_received_packets_rate -## 2. do the same for the last 10s -## 3. raise an alarm if the later is 10x or 20x the first -## we assume the minimum packet storm should at least have -## 10000 packets/s, average of the last 10 seconds -# -# template: cgroup_1m_received_packets_rate -# on: cgroup.net_packets -# class: Workload -# type: Cgroups -#component: Network -# hosts: * -# lookup: average -1m unaligned of received -# units: packets -# every: 10s -# info: average number of packets received by the network interface ${label:device} over the last minute -# -# template: cgroup_10s_received_packets_storm -# on: cgroup.net_packets -# class: Workload -# type: Cgroups -#component: Network -# hosts: * -# lookup: average -10s unaligned of received -# calc: $this * 100 / (($1m_received_packets_rate < 1000)?(1000):($1m_received_packets_rate)) -# every: 10s -# units: % -# warn: $this > (($status >= $WARNING)?(200):(5000)) -# options: no-clear-notification -# info: ratio of average number of received packets for the network interface ${label:device} over the last 10 seconds, \ -# compared to the rate over the last minute -# to: sysadmin -# # ---------------------------------K8s containers-------------------------------------------- template: k8s_cgroup_10min_cpu_usage @@ -83,7 +48,8 @@ component: CPU every: 1m warn: $this > (($status >= $WARNING) ? (75) : (85)) delay: down 15m multiplier 1.5 max 1h - info: container ${label:k8s_container_name} of pod ${label:k8s_pod_name} of namespace ${label:k8s_namespace}, \ + summary: Container ${label:k8s_container_name} pod ${label:k8s_pod_name} CPU utilization + info: Container ${label:k8s_container_name} of pod ${label:k8s_pod_name} of namespace ${label:k8s_namespace}, \ average CPU utilization over the last 10 minutes to: silent @@ -100,42 +66,7 @@ component: Memory warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 15m multiplier 1.5 max 1h + summary: Container ${label:k8s_container_name} pod ${label:k8s_pod_name} memory utilization info: container ${label:k8s_container_name} of pod ${label:k8s_pod_name} of namespace ${label:k8s_namespace}, \ memory utilization to: silent - -# check for packet storms - -# FIXME COMMENTED DUE TO A BUG IN NETDATA -## 1. calculate the rate packets are received in 1m: 1m_received_packets_rate -## 2. do the same for the last 10s -## 3. raise an alarm if the later is 10x or 20x the first -## we assume the minimum packet storm should at least have -## 10000 packets/s, average of the last 10 seconds -# -# template: k8s_cgroup_1m_received_packets_rate -# on: k8s.cgroup.net_packets -# class: Workload -# type: Cgroups -#component: Network -# hosts: * -# lookup: average -1m unaligned of received -# units: packets -# every: 10s -# info: average number of packets received by the network interface ${label:device} over the last minute -# -# template: k8s_cgroup_10s_received_packets_storm -# on: k8s.cgroup.net_packets -# class: Workload -# type: Cgroups -#component: Network -# hosts: * -# lookup: average -10s unaligned of received -# calc: $this * 100 / (($k8s_cgroup_10s_received_packets_storm < 1000)?(1000):($k8s_cgroup_10s_received_packets_storm)) -# every: 10s -# units: % -# warn: $this > (($status >= $WARNING)?(200):(5000)) -# options: no-clear-notification -# info: ratio of average number of received packets for the network interface ${label:device} over the last 10 seconds, \ -# compared to the rate over the last minute -# to: sysadmin diff --git a/health/health.d/cockroachdb.conf b/health/health.d/cockroachdb.conf index 09e4f9d40..60f178354 100644 --- a/health/health.d/cockroachdb.conf +++ b/health/health.d/cockroachdb.conf @@ -12,7 +12,8 @@ component: CockroachDB warn: $this > (($status >= $WARNING) ? (80) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: storage capacity utilization + summary: CockroachDB storage space utilization + info: Storage capacity utilization to: dba template: cockroachdb_used_usable_storage_capacity @@ -26,7 +27,8 @@ component: CockroachDB warn: $this > (($status >= $WARNING) ? (80) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: storage usable space utilization + summary: CockroachDB usable storage space utilization + info: Storage usable space utilization to: dba # Replication @@ -41,7 +43,8 @@ component: CockroachDB every: 10s warn: $this > 0 delay: down 15m multiplier 1.5 max 1h - info: number of ranges with fewer live replicas than needed for quorum + summary: CockroachDB unavailable replication + info: Number of ranges with fewer live replicas than needed for quorum to: dba template: cockroachdb_underreplicated_ranges @@ -54,7 +57,8 @@ component: CockroachDB every: 10s warn: $this > 0 delay: down 15m multiplier 1.5 max 1h - info: number of ranges with fewer live replicas than the replication target + summary: CockroachDB under-replicated + info: Number of ranges with fewer live replicas than the replication target to: dba # FD @@ -69,5 +73,6 @@ component: CockroachDB every: 10s warn: $this > 80 delay: down 15m multiplier 1.5 max 1h - info: open file descriptors utilization (against softlimit) + summary: CockroachDB file descriptors utilization + info: Open file descriptors utilization (against softlimit) to: dba diff --git a/health/health.d/consul.conf b/health/health.d/consul.conf index 7edca6563..8b414a26d 100644 --- a/health/health.d/consul.conf +++ b/health/health.d/consul.conf @@ -10,6 +10,7 @@ component: Consul units: seconds warn: $this < 14*24*60*60 crit: $this < 7*24*60*60 + summary: Consul license expiration on ${label:node_name} info: Consul Enterprise license expiration time on node ${label:node_name} datacenter ${label:datacenter} to: sysadmin @@ -23,7 +24,8 @@ component: Consul units: status warn: $this == 1 delay: down 5m multiplier 1.5 max 1h - info: datacenter ${label:datacenter} cluster is unhealthy as reported by server ${label:node_name} + summary: Consul datacenter ${label:datacenter} health + info: Datacenter ${label:datacenter} cluster is unhealthy as reported by server ${label:node_name} to: sysadmin template: consul_autopilot_server_health_status @@ -36,7 +38,8 @@ component: Consul units: status warn: $this == 1 delay: down 5m multiplier 1.5 max 1h - info: server ${label:node_name} from datacenter ${label:datacenter} is unhealthy + summary: Consul server ${label:node_name} health + info: Server ${label:node_name} from datacenter ${label:datacenter} is unhealthy to: sysadmin template: consul_raft_leader_last_contact_time @@ -50,7 +53,8 @@ component: Consul warn: $this > (($status >= $WARNING) ? (150) : (200)) crit: $this > (($status == $CRITICAL) ? (200) : (500)) delay: down 5m multiplier 1.5 max 1h - info: median time elapsed since leader server ${label:node_name} datacenter ${label:datacenter} was last able to contact the follower nodes + summary: Consul leader server ${label:node_name} last contact time + info: Median time elapsed since leader server ${label:node_name} datacenter ${label:datacenter} was last able to contact the follower nodes to: sysadmin template: consul_raft_leadership_transitions @@ -63,7 +67,8 @@ component: Consul units: transitions warn: $this > 0 delay: down 5m multiplier 1.5 max 1h - info: there has been a leadership change and server ${label:node_name} datacenter ${label:datacenter} has become the leader + summary: Consul server ${label:node_name} leadership transitions + info: There has been a leadership change and server ${label:node_name} datacenter ${label:datacenter} has become the leader to: sysadmin template: consul_raft_thread_main_saturation @@ -76,7 +81,8 @@ component: Consul units: percentage warn: $this > (($status >= $WARNING) ? (40) : (50)) delay: down 5m multiplier 1.5 max 1h - info: average saturation of the main Raft goroutine on server ${label:node_name} datacenter ${label:datacenter} + summary: Consul server ${label:node_name} main Raft saturation + info: Average saturation of the main Raft goroutine on server ${label:node_name} datacenter ${label:datacenter} to: sysadmin template: consul_raft_thread_fsm_saturation @@ -89,7 +95,8 @@ component: Consul units: milliseconds warn: $this > (($status >= $WARNING) ? (40) : (50)) delay: down 5m multiplier 1.5 max 1h - info: average saturation of the FSM Raft goroutine on server ${label:node_name} datacenter ${label:datacenter} + summary: Consul server ${label:node_name} FSM Raft saturation + info: Average saturation of the FSM Raft goroutine on server ${label:node_name} datacenter ${label:datacenter} to: sysadmin template: consul_client_rpc_requests_exceeded @@ -102,7 +109,8 @@ component: Consul units: requests warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: down 5m multiplier 1.5 max 1h - info: number of rate-limited RPC requests made by server ${label:node_name} datacenter ${label:datacenter} + summary: Consul server ${label:node_name} RPC requests rate + info: Number of rate-limited RPC requests made by server ${label:node_name} datacenter ${label:datacenter} to: sysadmin template: consul_client_rpc_requests_failed @@ -115,6 +123,7 @@ component: Consul units: requests warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: down 5m multiplier 1.5 max 1h + summary: Consul server ${label:node_name} failed RPC requests info: number of failed RPC requests made by server ${label:node_name} datacenter ${label:datacenter} to: sysadmin @@ -128,7 +137,8 @@ component: Consul units: status warn: $this != nan AND $this != 0 delay: down 5m multiplier 1.5 max 1h - info: node health check ${label:check_name} has failed on server ${label:node_name} datacenter ${label:datacenter} + summary: Consul node health check ${label:check_name} on ${label:node_name} + info: Node health check ${label:check_name} has failed on server ${label:node_name} datacenter ${label:datacenter} to: sysadmin template: consul_service_health_check_status @@ -141,7 +151,8 @@ component: Consul units: status warn: $this == 1 delay: down 5m multiplier 1.5 max 1h - info: service health check ${label:check_name} for service ${label:service_name} has failed on server ${label:node_name} datacenter ${label:datacenter} + summary: Consul service health check ${label:check_name} service ${label:service_name} node ${label:node_name} + info: Service health check ${label:check_name} for service ${label:service_name} has failed on server ${label:node_name} datacenter ${label:datacenter} to: sysadmin template: consul_gc_pause_time @@ -155,5 +166,6 @@ component: Consul warn: $this > (($status >= $WARNING) ? (1) : (2)) crit: $this > (($status >= $WARNING) ? (2) : (5)) delay: down 5m multiplier 1.5 max 1h - info: time spent in stop-the-world garbage collection pauses on server ${label:node_name} datacenter ${label:datacenter} + summary: Consul server ${label:node_name} garbage collection pauses + info: Time spent in stop-the-world garbage collection pauses on server ${label:node_name} datacenter ${label:datacenter} to: sysadmin diff --git a/health/health.d/cpu.conf b/health/health.d/cpu.conf index 4de5edd75..0b007d6b4 100644 --- a/health/health.d/cpu.conf +++ b/health/health.d/cpu.conf @@ -14,7 +14,8 @@ component: CPU warn: $this > (($status >= $WARNING) ? (75) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: average CPU utilization over the last 10 minutes (excluding iowait, nice and steal) + summary: System CPU utilization + info: Average CPU utilization over the last 10 minutes (excluding iowait, nice and steal) to: silent template: 10min_cpu_iowait @@ -29,7 +30,8 @@ component: CPU every: 1m warn: $this > (($status >= $WARNING) ? (20) : (40)) delay: up 30m down 30m multiplier 1.5 max 2h - info: average CPU iowait time over the last 10 minutes + summary: System CPU iowait time + info: Average CPU iowait time over the last 10 minutes to: silent template: 20min_steal_cpu @@ -44,7 +46,8 @@ component: CPU every: 5m warn: $this > (($status >= $WARNING) ? (5) : (10)) delay: down 1h multiplier 1.5 max 2h - info: average CPU steal time over the last 20 minutes + summary: System CPU steal time + info: Average CPU steal time over the last 20 minutes to: silent ## FreeBSD @@ -61,5 +64,6 @@ component: CPU warn: $this > (($status >= $WARNING) ? (75) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: average CPU utilization over the last 10 minutes (excluding nice) + summary: System CPU utilization + info: Average CPU utilization over the last 10 minutes (excluding nice) to: silent diff --git a/health/health.d/dbengine.conf b/health/health.d/dbengine.conf index 65c41b846..0a70d2e8f 100644 --- a/health/health.d/dbengine.conf +++ b/health/health.d/dbengine.conf @@ -13,7 +13,8 @@ component: DB engine every: 10s crit: $this > 0 delay: down 15m multiplier 1.5 max 1h - info: number of filesystem errors in the last 10 minutes (too many open files, wrong permissions, etc) + summary: Netdata DBengine filesystem errors + info: Number of filesystem errors in the last 10 minutes (too many open files, wrong permissions, etc) to: sysadmin alarm: 10min_dbengine_global_io_errors @@ -28,7 +29,8 @@ component: DB engine every: 10s crit: $this > 0 delay: down 1h multiplier 1.5 max 3h - info: number of IO errors in the last 10 minutes (CRC errors, out of space, bad disk, etc) + summary: Netdata DBengine IO errors + info: Number of IO errors in the last 10 minutes (CRC errors, out of space, bad disk, etc) to: sysadmin alarm: 10min_dbengine_global_flushing_warnings @@ -43,6 +45,7 @@ component: DB engine every: 10s warn: $this > 0 delay: down 1h multiplier 1.5 max 3h + summary: Netdata DBengine global flushing warnings info: number of times when dbengine dirty pages were over 50% of the instance's page cache in the last 10 minutes. \ Metric data are at risk of not being stored in the database. To remedy, reduce disk load or use faster disks. to: sysadmin @@ -59,6 +62,7 @@ component: DB engine every: 10s crit: $this != 0 delay: down 1h multiplier 1.5 max 3h - info: number of pages deleted due to failure to flush data to disk in the last 10 minutes. \ + summary: Netdata DBengine global flushing errors + info: Number of pages deleted due to failure to flush data to disk in the last 10 minutes. \ Metric data were lost to unblock data collection. To fix, reduce disk load or use faster disks. to: sysadmin diff --git a/health/health.d/disks.conf b/health/health.d/disks.conf index 27f5d6691..2e417fd4a 100644 --- a/health/health.d/disks.conf +++ b/health/health.d/disks.conf @@ -23,7 +23,8 @@ chart labels: mount_point=!/dev !/dev/* !/run !/run/* * warn: $this > (($status >= $WARNING ) ? (80) : (90)) crit: ($this > (($status == $CRITICAL) ? (90) : (98))) && $avail < 5 delay: up 1m down 15m multiplier 1.5 max 1h - info: disk ${label:mount_point} space utilization + summary: Disk ${label:mount_point} space usage + info: Total space utilization of disk ${label:mount_point} to: sysadmin template: disk_inode_usage @@ -40,7 +41,8 @@ chart labels: mount_point=!/dev !/dev/* !/run !/run/* * warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: up 1m down 15m multiplier 1.5 max 1h - info: disk ${label:mount_point} inode utilization + summary: Disk ${label:mount_point} inode usage + info: Total inode utilization of disk ${label:mount_point} to: sysadmin @@ -79,7 +81,8 @@ template: out_of_disk_space_time warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) delay: down 15m multiplier 1.2 max 1h - info: estimated time the disk will run out of space, if the system continues to add data with the rate of the last hour + summary: Disk ${label:mount_point} estimation of lack of space + info: Estimated time the disk ${label:mount_point} will run out of space, if the system continues to add data with the rate of the last hour to: silent @@ -118,7 +121,8 @@ template: out_of_disk_inodes_time warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) delay: down 15m multiplier 1.2 max 1h - info: estimated time the disk will run out of inodes, if the system continues to allocate inodes with the rate of the last hour + summary: Disk ${label:mount_point} estimation of lack of inodes + info: Estimated time the disk ${label:mount_point} will run out of inodes, if the system continues to allocate inodes with the rate of the last hour to: silent @@ -141,7 +145,8 @@ component: Disk every: 1m warn: $this > 98 * (($status >= $WARNING) ? (0.7) : (1)) delay: down 15m multiplier 1.2 max 1h - info: average percentage of time ${label:device} disk was busy over the last 10 minutes + summary: Disk ${label:device} utilization + info: Average percentage of time ${label:device} disk was busy over the last 10 minutes to: silent @@ -162,5 +167,6 @@ component: Disk every: 1m warn: $this > 5000 * (($status >= $WARNING) ? (0.7) : (1)) delay: down 15m multiplier 1.2 max 1h - info: average backlog size of the ${label:device} disk over the last 10 minutes + summary: Disk ${label:device} backlog + info: Average backlog size of the ${label:device} disk over the last 10 minutes to: silent diff --git a/health/health.d/dns_query.conf b/health/health.d/dns_query.conf index bf9397d85..756c6a1b6 100644 --- a/health/health.d/dns_query.conf +++ b/health/health.d/dns_query.conf @@ -10,5 +10,6 @@ component: DNS every: 10s warn: $this != nan && $this != 1 delay: up 30s down 5m multiplier 1.5 max 1h + summary: DNS query unsuccessful requests to ${label:server} info: DNS request type ${label:record_type} to server ${label:server} is unsuccessful to: sysadmin diff --git a/health/health.d/dnsmasq_dhcp.conf b/health/health.d/dnsmasq_dhcp.conf index 81d37df64..f6ef01940 100644 --- a/health/health.d/dnsmasq_dhcp.conf +++ b/health/health.d/dnsmasq_dhcp.conf @@ -10,5 +10,6 @@ component: Dnsmasq calc: $used warn: $this > ( ($status >= $WARNING ) ? ( 80 ) : ( 90 ) ) delay: down 5m - info: DHCP range utilization + summary: Dnsmasq DHCP range ${label:dhcp_range} utilization + info: DHCP range ${label:dhcp_range} utilization to: sysadmin diff --git a/health/health.d/docker.conf b/health/health.d/docker.conf index 01919dc0d..668614d4d 100644 --- a/health/health.d/docker.conf +++ b/health/health.d/docker.conf @@ -7,5 +7,6 @@ component: Docker every: 10s lookup: average -10s of unhealthy warn: $this > 0 + summary: Docker container ${label:container_name} health info: ${label:container_name} docker container health status is unhealthy to: sysadmin diff --git a/health/health.d/elasticsearch.conf b/health/health.d/elasticsearch.conf index 29f1e9b27..600840c58 100644 --- a/health/health.d/elasticsearch.conf +++ b/health/health.d/elasticsearch.conf @@ -12,7 +12,8 @@ component: Elasticsearch units: status crit: $this == 1 delay: down 5m multiplier 1.5 max 1h - info: cluster health status is red. + summary: Elasticsearch cluster ${label:cluster_name} status + info: Elasticsearch cluster ${label:cluster_name} health status is red. to: sysadmin # the idea of '-10m' is to handle yellow status after node restart, @@ -27,7 +28,8 @@ component: Elasticsearch units: status warn: $this == 1 delay: down 5m multiplier 1.5 max 1h - info: cluster health status is yellow. + summary: Elasticsearch cluster ${label:cluster_name} status + info: Elasticsearch cluster ${label:cluster_name} health status is yellow. to: sysadmin template: elasticsearch_node_index_health_red @@ -40,7 +42,8 @@ component: Elasticsearch units: status warn: $this == 1 delay: down 5m multiplier 1.5 max 1h - info: node index $label:index health status is red. + summary: Elasticsearch cluster ${label:cluster_name} index ${label:index} status + info: Elasticsearch cluster ${label:cluster_name} index ${label:index} health status is red. to: sysadmin # don't convert 'lookup' value to seconds in 'calc' due to UI showing seconds as hh:mm:ss (0 as now). @@ -55,7 +58,8 @@ component: Elasticsearch units: milliseconds warn: $this > (($status >= $WARNING) ? (20 * 1000) : (30 * 1000)) delay: down 5m multiplier 1.5 max 1h - info: search performance is degraded, queries run slowly. + summary: Elasticsearch cluster ${label:cluster_name} node ${label:node_name} query performance + info: Elasticsearch cluster ${label:cluster_name} node ${label:node_name} search performance is degraded, queries run slowly. to: sysadmin template: elasticsearch_node_indices_search_time_fetch @@ -69,5 +73,6 @@ component: Elasticsearch warn: $this > (($status >= $WARNING) ? (3 * 1000) : (5 * 1000)) crit: $this > (($status == $CRITICAL) ? (5 * 1000) : (30 * 1000)) delay: down 5m multiplier 1.5 max 1h - info: search performance is degraded, fetches run slowly. + summary: Elasticsearch cluster ${label:cluster_name} node ${label:node_name} fetch performance + info: Elasticsearch cluster ${label:cluster_name} node ${label:node_name} search performance is degraded, fetches run slowly. to: sysadmin diff --git a/health/health.d/entropy.conf b/health/health.d/entropy.conf index 13b0fcde4..be8b1fe4f 100644 --- a/health/health.d/entropy.conf +++ b/health/health.d/entropy.conf @@ -15,5 +15,6 @@ component: Cryptography every: 5m warn: $this < (($status >= $WARNING) ? (200) : (100)) delay: down 1h multiplier 1.5 max 2h - info: minimum number of entries in the random numbers pool in the last 5 minutes + summary: System entropy pool number of entries + info: Minimum number of entries in the random numbers pool in the last 5 minutes to: silent diff --git a/health/health.d/exporting.conf b/health/health.d/exporting.conf index f1030a317..37d4fd648 100644 --- a/health/health.d/exporting.conf +++ b/health/health.d/exporting.conf @@ -10,7 +10,8 @@ component: Exporting engine warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) delay: down 5m multiplier 1.5 max 1h - info: number of seconds since the last successful buffering of exporting data + summary: Netdata exporting data last successful buffering + info: Number of seconds since the last successful buffering of exporting data to: dba template: exporting_metrics_sent @@ -23,5 +24,6 @@ component: Exporting engine every: 10s warn: $this != 100 delay: down 5m multiplier 1.5 max 1h - info: percentage of metrics sent to the external database server + summary: Netdata exporting metrics sent + info: Percentage of metrics sent to the external database server to: dba diff --git a/health/health.d/file_descriptors.conf b/health/health.d/file_descriptors.conf index 60bb8d384..20a592d6b 100644 --- a/health/health.d/file_descriptors.conf +++ b/health/health.d/file_descriptors.conf @@ -11,11 +11,12 @@ every: 1m crit: $this > 90 delay: down 15m multiplier 1.5 max 1h - info: system-wide utilization of open files + summary: System open file descriptors utilization + info: System-wide utilization of open files to: sysadmin template: apps_group_file_descriptors_utilization - on: apps.fd_limit + on: app.fds_open_limit class: Utilization type: System component: Process @@ -27,5 +28,6 @@ component: Process every: 10s warn: $this > (($status >= $WARNING) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: open files percentage against the processes limits, among all PIDs in application group + summary: App group ${label:app_group} file descriptors utilization + info: Open files percentage against the processes limits, among all PIDs in application group to: sysadmin diff --git a/health/health.d/gearman.conf b/health/health.d/gearman.conf index 580d114f8..78e1165d1 100644 --- a/health/health.d/gearman.conf +++ b/health/health.d/gearman.conf @@ -9,5 +9,6 @@ component: Gearman every: 10s warn: $this > 30000 delay: down 5m multiplier 1.5 max 1h - info: average number of queued jobs over the last 10 minutes + summary: Gearman queued jobs + info: Average number of queued jobs over the last 10 minutes to: sysadmin diff --git a/health/health.d/go.d.plugin.conf b/health/health.d/go.d.plugin.conf index cd87fe0e7..7796a1bc8 100644 --- a/health/health.d/go.d.plugin.conf +++ b/health/health.d/go.d.plugin.conf @@ -13,5 +13,6 @@ component: go.d.plugin warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) delay: down 5m multiplier 1.5 max 1h - info: number of seconds since the last successful data collection + summary: Go.d plugin last collection + info: Number of seconds since the last successful data collection to: webmaster diff --git a/health/health.d/haproxy.conf b/health/health.d/haproxy.conf index a0ab52bca..66a488fa4 100644 --- a/health/health.d/haproxy.conf +++ b/health/health.d/haproxy.conf @@ -7,7 +7,8 @@ component: HAProxy every: 10s lookup: average -10s crit: $this > 0 - info: average number of failed haproxy backend servers over the last 10 seconds + summary: HAProxy server status + info: Average number of failed haproxy backend servers over the last 10 seconds to: sysadmin template: haproxy_backend_status @@ -19,5 +20,6 @@ component: HAProxy every: 10s lookup: average -10s crit: $this > 0 - info: average number of failed haproxy backends over the last 10 seconds + summary: HAProxy backend status + info: Average number of failed haproxy backends over the last 10 seconds to: sysadmin diff --git a/health/health.d/hdfs.conf b/health/health.d/hdfs.conf index ca8df31b9..566e815aa 100644 --- a/health/health.d/hdfs.conf +++ b/health/health.d/hdfs.conf @@ -12,6 +12,7 @@ component: HDFS warn: $this > (($status >= $WARNING) ? (70) : (80)) crit: $this > (($status == $CRITICAL) ? (80) : (98)) delay: down 15m multiplier 1.5 max 1h + summary: HDFS datanodes space utilization info: summary datanodes space capacity utilization to: sysadmin @@ -28,6 +29,7 @@ component: HDFS every: 10s warn: $this > 0 delay: down 15m multiplier 1.5 max 1h + summary: HDFS missing blocks info: number of missing blocks to: sysadmin @@ -42,6 +44,7 @@ component: HDFS every: 10s warn: $this > 0 delay: down 15m multiplier 1.5 max 1h + summary: HDFS stale datanodes info: number of datanodes marked stale due to delayed heartbeat to: sysadmin @@ -56,6 +59,7 @@ component: HDFS every: 10s crit: $this > 0 delay: down 15m multiplier 1.5 max 1h + summary: HDFS dead datanodes info: number of datanodes which are currently dead to: sysadmin @@ -72,5 +76,6 @@ component: HDFS every: 10s warn: $this > 0 delay: down 15m multiplier 1.5 max 1h + summary: HDFS failed volumes info: number of failed volumes to: sysadmin diff --git a/health/health.d/httpcheck.conf b/health/health.d/httpcheck.conf index 81748b9e0..da5dec797 100644 --- a/health/health.d/httpcheck.conf +++ b/health/health.d/httpcheck.conf @@ -9,7 +9,7 @@ component: HTTP endpoint calc: ($this < 75) ? (0) : ($this) every: 5s units: up/down - info: HTTP endpoint ${label:url} liveness status + info: HTTP check endpoint ${label:url} liveness status to: silent template: httpcheck_web_service_bad_content @@ -23,7 +23,8 @@ component: HTTP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: percentage of HTTP responses from ${label:url} with unexpected content in the last 5 minutes + summary: HTTP check for ${label:url} unexpected content + info: Percentage of HTTP responses from ${label:url} with unexpected content in the last 5 minutes to: webmaster template: httpcheck_web_service_bad_status @@ -37,7 +38,8 @@ component: HTTP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: percentage of HTTP responses from ${label:url} with unexpected status in the last 5 minutes + summary: HTTP check for ${label:url} unexpected status + info: Percentage of HTTP responses from ${label:url} with unexpected status in the last 5 minutes to: webmaster template: httpcheck_web_service_timeouts @@ -51,7 +53,8 @@ component: HTTP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: percentage of timed-out HTTP requests to ${label:url} in the last 5 minutes + summary: HTTP check for ${label:url} timeouts + info: Percentage of timed-out HTTP requests to ${label:url} in the last 5 minutes to: webmaster template: httpcheck_web_service_no_connection @@ -65,5 +68,6 @@ component: HTTP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: percentage of failed HTTP requests to ${label:url} in the last 5 minutes + summary: HTTP check for ${label:url} failed requests + info: Percentage of failed HTTP requests to ${label:url} in the last 5 minutes to: webmaster diff --git a/health/health.d/ioping.conf b/health/health.d/ioping.conf index 5fd785b84..6d832bf00 100644 --- a/health/health.d/ioping.conf +++ b/health/health.d/ioping.conf @@ -9,5 +9,6 @@ component: Disk green: 10000 warn: $this > $green delay: down 30m multiplier 1.5 max 2h - info: average I/O latency over the last 10 seconds + summary: IO ping latency + info: Average I/O latency over the last 10 seconds to: silent diff --git a/health/health.d/ipc.conf b/health/health.d/ipc.conf index 3d1b46c02..f77f56065 100644 --- a/health/health.d/ipc.conf +++ b/health/health.d/ipc.conf @@ -13,6 +13,7 @@ component: IPC every: 10s warn: $this > (($status >= $WARNING) ? (70) : (80)) delay: down 5m multiplier 1.5 max 1h + summary: IPC semaphores used info: IPC semaphore utilization to: sysadmin @@ -28,5 +29,6 @@ component: IPC every: 10s warn: $this > (($status >= $WARNING) ? (70) : (80)) delay: down 5m multiplier 1.5 max 1h + summary: IPC semaphore arrays used info: IPC semaphore arrays utilization to: sysadmin diff --git a/health/health.d/ipfs.conf b/health/health.d/ipfs.conf index a514ddfd0..4dfee3c7f 100644 --- a/health/health.d/ipfs.conf +++ b/health/health.d/ipfs.conf @@ -10,5 +10,6 @@ component: IPFS warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 15m multiplier 1.5 max 1h + summary: IPFS datastore utilization info: IPFS datastore utilization to: sysadmin diff --git a/health/health.d/ipmi.conf b/health/health.d/ipmi.conf index 1775783df..942dc070b 100644 --- a/health/health.d/ipmi.conf +++ b/health/health.d/ipmi.conf @@ -9,6 +9,7 @@ component: IPMI warn: $warning > 0 crit: $critical > 0 delay: up 5m down 15m multiplier 1.5 max 1h + summary: IPMI sensor ${label:sensor} state info: IPMI sensor ${label:sensor} (${label:component}) state to: sysadmin @@ -22,5 +23,6 @@ component: IPMI every: 10s warn: $this > 0 delay: up 5m down 15m multiplier 1.5 max 1h + summary: IPMI entries in System Event Log info: number of events in the IPMI System Event Log (SEL) to: silent diff --git a/health/health.d/kubelet.conf b/health/health.d/kubelet.conf index 428b6ee91..8adf5f7d4 100644 --- a/health/health.d/kubelet.conf +++ b/health/health.d/kubelet.conf @@ -14,7 +14,8 @@ component: Kubelet every: 10s warn: $this == 1 delay: down 1m multiplier 1.5 max 2h - info: the node is experiencing a configuration-related error (0: false, 1: true) + summary: Kubelet node config error + info: The node is experiencing a configuration-related error (0: false, 1: true) to: sysadmin # Failed Token() requests to the alternate token source @@ -29,7 +30,8 @@ component: Kubelet every: 10s warn: $this > 0 delay: down 1m multiplier 1.5 max 2h - info: number of failed Token() requests to the alternate token source + summary: Kubelet failed token requests + info: Number of failed Token() requests to the alternate token source to: sysadmin # Docker and runtime operation errors @@ -44,7 +46,8 @@ component: Kubelet every: 10s warn: $this > (($status >= $WARNING) ? (0) : (20)) delay: up 30s down 1m multiplier 1.5 max 2h - info: number of Docker or runtime operation errors + summary: Kubelet runtime errors + info: Number of Docker or runtime operation errors to: sysadmin # ----------------------------------------------------------------------------- @@ -84,7 +87,8 @@ component: Kubelet warn: $this > (($status >= $WARNING)?(100):(200)) crit: $this > (($status >= $WARNING)?(200):(400)) delay: down 1m multiplier 1.5 max 2h - info: ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ + summary: Kubelet relisting latency (quantile 0.5) + info: Ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ compared to the last minute (quantile 0.5) to: sysadmin @@ -112,7 +116,8 @@ component: Kubelet warn: $this > (($status >= $WARNING)?(200):(400)) crit: $this > (($status >= $WARNING)?(400):(800)) delay: down 1m multiplier 1.5 max 2h - info: ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ + summary: Kubelet relisting latency (quantile 0.9) + info: Ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ compared to the last minute (quantile 0.9) to: sysadmin @@ -140,6 +145,7 @@ component: Kubelet warn: $this > (($status >= $WARNING)?(400):(800)) crit: $this > (($status >= $WARNING)?(800):(1200)) delay: down 1m multiplier 1.5 max 2h - info: ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ + summary: Kubelet relisting latency (quantile 0.99) + info: Ratio of average Pod Lifecycle Event Generator relisting latency over the last 10 seconds, \ compared to the last minute (quantile 0.99) to: sysadmin diff --git a/health/health.d/linux_power_supply.conf b/health/health.d/linux_power_supply.conf index 71a5be284..b0d35e752 100644 --- a/health/health.d/linux_power_supply.conf +++ b/health/health.d/linux_power_supply.conf @@ -10,5 +10,6 @@ component: Battery every: 10s warn: $this < 10 delay: up 30s down 5m multiplier 1.2 max 1h - info: percentage of remaining power supply capacity + summary: Power supply capacity + info: Percentage of remaining power supply capacity to: silent diff --git a/health/health.d/load.conf b/health/health.d/load.conf index 20f6781c8..fd8bf9396 100644 --- a/health/health.d/load.conf +++ b/health/health.d/load.conf @@ -14,7 +14,7 @@ component: Load calc: ($active_processors == nan or $active_processors == 0) ? (nan) : ( ($active_processors < 2) ? ( 2 ) : ( $active_processors ) ) units: cpus every: 1m - info: number of active CPU cores in the system + info: Number of active CPU cores in the system # Send alarms if the load average is unusually high. # These intentionally _do not_ calculate the average over the sampled @@ -33,7 +33,8 @@ component: Load every: 1m warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 175 : 200) delay: down 15m multiplier 1.5 max 1h - info: system fifteen-minute load average + summary: Host load average (15 minutes) + info: System load average for the past 15 minutes to: silent alarm: load_average_5 @@ -49,7 +50,8 @@ component: Load every: 1m warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 350 : 400) delay: down 15m multiplier 1.5 max 1h - info: system five-minute load average + summary: System load average (5 minutes) + info: System load average for the past 5 minutes to: silent alarm: load_average_1 @@ -65,5 +67,6 @@ component: Load every: 1m warn: ($this * 100 / $load_cpu_number) > (($status >= $WARNING) ? 700 : 800) delay: down 15m multiplier 1.5 max 1h - info: system one-minute load average + summary: System load average (1 minute) + info: System load average for the past 1 minute to: silent diff --git a/health/health.d/mdstat.conf b/health/health.d/mdstat.conf index 4dc0bf207..90f97d851 100644 --- a/health/health.d/mdstat.conf +++ b/health/health.d/mdstat.conf @@ -8,7 +8,8 @@ component: RAID every: 10s calc: $down warn: $this > 0 - info: number of devices in the down state for the ${label:device} ${label:raid_level} array. \ + summary: MD array device ${label:device} down + info: Number of devices in the down state for the ${label:device} ${label:raid_level} array. \ Any number > 0 indicates that the array is degraded. to: sysadmin @@ -23,7 +24,8 @@ chart labels: raid_level=!raid1 !raid10 * every: 60s warn: $this > 1024 delay: up 30m - info: number of unsynchronized blocks for the ${label:device} ${label:raid_level} array + summary: MD array device ${label:device} unsynchronized blocks + info: Number of unsynchronized blocks for the ${label:device} ${label:raid_level} array to: silent template: mdstat_nonredundant_last_collected @@ -36,5 +38,6 @@ component: RAID every: 10s warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) - info: number of seconds since the last successful data collection + summary: MD array last collected + info: Number of seconds since the last successful data collection to: sysadmin diff --git a/health/health.d/megacli.conf b/health/health.d/megacli.conf index 9fbcfdb92..118997a59 100644 --- a/health/health.d/megacli.conf +++ b/health/health.d/megacli.conf @@ -11,7 +11,8 @@ component: RAID every: 10s crit: $this > 0 delay: down 5m multiplier 2 max 10m - info: adapter is in the degraded state (0: false, 1: true) + summary: MegaCLI adapter state + info: Adapter is in the degraded state (0: false, 1: true) to: sysadmin ## Physical Disks @@ -26,7 +27,8 @@ component: RAID every: 10s warn: $this > 0 delay: up 1m down 5m multiplier 2 max 10m - info: number of physical drive predictive failures + summary: MegaCLI physical drive predictive failures + info: Number of physical drive predictive failures to: sysadmin template: megacli_pd_media_errors @@ -39,7 +41,8 @@ component: RAID every: 10s warn: $this > 0 delay: up 1m down 5m multiplier 2 max 10m - info: number of physical drive media errors + summary: MegaCLI physical drive errors + info: Number of physical drive media errors to: sysadmin ## Battery Backup Units (BBU) @@ -54,7 +57,8 @@ component: RAID every: 10s warn: $this <= (($status >= $WARNING) ? (85) : (80)) crit: $this <= (($status == $CRITICAL) ? (50) : (40)) - info: average battery backup unit (BBU) relative state of charge over the last 10 seconds + summary: MegaCLI BBU charge state + info: Average battery backup unit (BBU) relative state of charge over the last 10 seconds to: sysadmin template: megacli_bbu_cycle_count @@ -67,5 +71,6 @@ component: RAID every: 10s warn: $this >= 100 crit: $this >= 500 - info: average battery backup unit (BBU) charge cycles count over the last 10 seconds + summary: MegaCLI BBU cycles count + info: Average battery backup unit (BBU) charge cycles count over the last 10 seconds to: sysadmin diff --git a/health/health.d/memcached.conf b/health/health.d/memcached.conf index 2a2fe4b82..77ca0afa9 100644 --- a/health/health.d/memcached.conf +++ b/health/health.d/memcached.conf @@ -12,7 +12,8 @@ component: Memcached warn: $this > (($status >= $WARNING) ? (70) : (80)) crit: $this > (($status == $CRITICAL) ? (80) : (90)) delay: up 0 down 15m multiplier 1.5 max 1h - info: cache memory utilization + summary: Memcached memory utilization + info: Cache memory utilization to: dba @@ -27,7 +28,7 @@ component: Memcached calc: ($this - $available) / (($now - $after) / 3600) units: KB/hour every: 1m - info: average rate the cache fills up (positive), or frees up (negative) space over the last hour + info: Average rate the cache fills up (positive), or frees up (negative) space over the last hour # find the hours remaining until memcached cache is full @@ -43,6 +44,7 @@ component: Memcached warn: $this > 0 and $this < (($status >= $WARNING) ? (48) : (8)) crit: $this > 0 and $this < (($status == $CRITICAL) ? (24) : (2)) delay: down 15m multiplier 1.5 max 1h - info: estimated time the cache will run out of space \ + summary: Memcached estimation of lack of cache space + info: Estimated time the cache will run out of space \ if the system continues to add data at the same rate as the past hour to: dba diff --git a/health/health.d/memory.conf b/health/health.d/memory.conf index 8badf09c4..5ab3d2d92 100644 --- a/health/health.d/memory.conf +++ b/health/health.d/memory.conf @@ -12,7 +12,8 @@ component: Memory every: 10s warn: $this > 0 delay: down 1h multiplier 1.5 max 1h - info: amount of memory corrupted due to a hardware failure + summary: System corrupted memory + info: Amount of memory corrupted due to a hardware failure to: sysadmin ## ECC Controller @@ -29,7 +30,8 @@ component: Memory every: 1m warn: $this > 0 delay: down 1h multiplier 1.5 max 1h - info: memory controller ${label:controller} ECC correctable errors in the last 10 minutes + summary: System ECC memory ${label:controller} correctable errors + info: Memory controller ${label:controller} ECC correctable errors in the last 10 minutes to: sysadmin template: ecc_memory_mc_uncorrectable @@ -44,7 +46,8 @@ component: Memory every: 1m crit: $this > 0 delay: down 1h multiplier 1.5 max 1h - info: memory controller ${label:controller} ECC uncorrectable errors in the last 10 minutes + summary: System ECC memory ${label:controller} uncorrectable errors + info: Memory controller ${label:controller} ECC uncorrectable errors in the last 10 minutes to: sysadmin ## ECC DIMM @@ -61,6 +64,7 @@ component: Memory every: 1m warn: $this > 0 delay: down 1h multiplier 1.5 max 1h + summary: System ECC memory DIMM ${label:dimm} correctable errors info: DIMM ${label:dimm} controller ${label:controller} (location ${label:dimm_location}) ECC correctable errors in the last 10 minutes to: sysadmin @@ -76,5 +80,6 @@ component: Memory every: 1m crit: $this > 0 delay: down 1h multiplier 1.5 max 1h + summary: System ECC memory DIMM ${label:dimm} uncorrectable errors info: DIMM ${label:dimm} controller ${label:controller} (location ${label:dimm_location}) ECC uncorrectable errors in the last 10 minutes to: sysadmin diff --git a/health/health.d/ml.conf b/health/health.d/ml.conf index 6836ce7b1..aef9b0368 100644 --- a/health/health.d/ml.conf +++ b/health/health.d/ml.conf @@ -3,23 +3,26 @@ # native anomaly detection here: # https://learn.netdata.cloud/docs/agent/ml#anomaly-bit---100--anomalous-0--normal -# examples below are commented, you would need to uncomment and adjust as desired to enable them. +# some examples below are commented, you would need to uncomment and adjust as desired to enable them. -# node level anomaly rate example +# node level anomaly rate # https://learn.netdata.cloud/docs/agent/ml#node-anomaly-rate -# if node level anomaly rate is between 1-5% then warning (pick your own threshold that works best via tial and error). -# if node level anomaly rate is above 5% then critical (pick your own threshold that works best via tial and error). -# template: ml_1min_node_ar -# on: anomaly_detection.anomaly_rate -# os: linux -# hosts: * -# lookup: average -1m foreach anomaly_rate -# calc: $this -# units: % -# every: 30s -# warn: $this > (($status >= $WARNING) ? (1) : (5)) -# crit: $this > (($status == $CRITICAL) ? (5) : (100)) -# info: rolling 1min node level anomaly rate +# if node level anomaly rate is above 1% then warning (pick your own threshold that works best via trial and error). + template: ml_1min_node_ar + on: anomaly_detection.anomaly_rate + class: Workload + type: System +component: ML + os: * + hosts: * + lookup: average -1m of anomaly_rate + calc: $this + units: % + every: 30s + warn: $this > 1 + summary: ML node anomaly rate + info: Rolling 1min node level anomaly rate + to: silent # alert per dimension example # if anomaly rate is between 5-20% then warning (pick your own threshold that works best via tial and error). diff --git a/health/health.d/mysql.conf b/health/health.d/mysql.conf index 3941c71cc..572560b4e 100644 --- a/health/health.d/mysql.conf +++ b/health/health.d/mysql.conf @@ -12,7 +12,8 @@ component: MySQL warn: $this > (($status >= $WARNING) ? (5) : (10)) crit: $this > (($status == $CRITICAL) ? (10) : (20)) delay: down 5m multiplier 1.5 max 1h - info: number of slow queries in the last 10 seconds + summary: MySQL slow queries + info: Number of slow queries in the last 10 seconds to: dba @@ -27,7 +28,8 @@ component: MySQL lookup: sum -10s absolute of immediate units: immediate locks every: 10s - info: number of table immediate locks in the last 10 seconds + summary: MySQL table immediate locks + info: Number of table immediate locks in the last 10 seconds to: dba template: mysql_10s_table_locks_waited @@ -38,7 +40,8 @@ component: MySQL lookup: sum -10s absolute of waited units: waited locks every: 10s - info: number of table waited locks in the last 10 seconds + summary: MySQL table waited locks + info: Number of table waited locks in the last 10 seconds to: dba template: mysql_10s_waited_locks_ratio @@ -52,7 +55,8 @@ component: MySQL warn: $this > (($status >= $WARNING) ? (10) : (25)) crit: $this > (($status == $CRITICAL) ? (25) : (50)) delay: down 30m multiplier 1.5 max 1h - info: ratio of waited table locks over the last 10 seconds + summary: MySQL waited table locks ratio + info: Ratio of waited table locks over the last 10 seconds to: dba @@ -70,7 +74,8 @@ component: MySQL warn: $this > (($status >= $WARNING) ? (60) : (70)) crit: $this > (($status == $CRITICAL) ? (80) : (90)) delay: down 15m multiplier 1.5 max 1h - info: client connections utilization + summary: MySQL connections utilization + info: Client connections utilization to: dba @@ -87,7 +92,8 @@ component: MySQL every: 10s crit: $this == 0 delay: down 5m multiplier 1.5 max 1h - info: replication status (0: stopped, 1: working) + summary: MySQL replication status + info: Replication status (0: stopped, 1: working) to: dba template: mysql_replication_lag @@ -101,7 +107,8 @@ component: MySQL warn: $this > (($status >= $WARNING) ? (5) : (10)) crit: $this > (($status == $CRITICAL) ? (10) : (30)) delay: down 15m multiplier 1.5 max 1h - info: difference between the timestamp of the latest transaction processed by the SQL thread and \ + summary: MySQL replication lag + info: Difference between the timestamp of the latest transaction processed by the SQL thread and \ the timestamp of the same transaction when it was processed on the master to: dba @@ -131,7 +138,8 @@ component: MySQL warn: $this > $mysql_galera_cluster_size_max_2m crit: $this < $mysql_galera_cluster_size_max_2m delay: up 20s down 5m multiplier 1.5 max 1h - info: current galera cluster size, compared to the maximum size in the last 2 minutes + summary: MySQL galera cluster size + info: Current galera cluster size, compared to the maximum size in the last 2 minutes to: dba # galera node state @@ -145,7 +153,8 @@ component: MySQL every: 10s warn: $this != nan AND $this != 0 delay: up 30s down 5m multiplier 1.5 max 1h - info: galera node state is either Donor/Desynced or Joined. + summary: MySQL galera node state + info: Galera node state is either Donor/Desynced or Joined. to: dba template: mysql_galera_cluster_state_crit @@ -157,7 +166,8 @@ component: MySQL every: 10s crit: $this != nan AND $this != 0 delay: up 30s down 5m multiplier 1.5 max 1h - info: galera node state is either Undefined or Joining or Error. + summary: MySQL galera node state + info: Galera node state is either Undefined or Joining or Error. to: dba # galera node status @@ -171,6 +181,7 @@ component: MySQL every: 10s crit: $this != nan AND $this != 1 delay: up 30s down 5m multiplier 1.5 max 1h - info: galera node is part of a nonoperational component. \ + summary: MySQL galera cluster status + info: Galera node is part of a nonoperational component. \ This occurs in cases of multiple membership changes that result in a loss of Quorum or in cases of split-brain situations. to: dba diff --git a/health/health.d/net.conf b/health/health.d/net.conf index 095d488da..ea4954187 100644 --- a/health/health.d/net.conf +++ b/health/health.d/net.conf @@ -14,7 +14,7 @@ component: Network calc: ( $nic_speed_max > 0 ) ? ( $nic_speed_max) : ( nan ) units: Mbit every: 10s - info: network interface ${label:device} current speed + info: Network interface ${label:device} current speed template: 1m_received_traffic_overflow on: net.net @@ -29,7 +29,8 @@ component: Network every: 10s warn: $this > (($status >= $WARNING) ? (85) : (90)) delay: up 1m down 1m multiplier 1.5 max 1h - info: average inbound utilization for the network interface ${label:device} over the last minute + summary: System network interface ${label:device} inbound utilization + info: Average inbound utilization for the network interface ${label:device} over the last minute to: silent template: 1m_sent_traffic_overflow @@ -45,7 +46,8 @@ component: Network every: 10s warn: $this > (($status >= $WARNING) ? (85) : (90)) delay: up 1m down 1m multiplier 1.5 max 1h - info: average outbound utilization for the network interface ${label:device} over the last minute + summary: System network interface ${label:device} outbound utilization + info: Average outbound utilization for the network interface ${label:device} over the last minute to: silent # ----------------------------------------------------------------------------- @@ -58,66 +60,70 @@ component: Network # it is possible to have expected packet drops on an interface for some network configurations # look at the Monitoring Network Interfaces section in the proc.plugin documentation for more information - template: inbound_packets_dropped - on: net.drops - class: Errors + template: net_interface_inbound_packets + on: net.packets + class: Workload type: System component: Network - os: linux + os: * hosts: * - lookup: sum -10m unaligned absolute of inbound + lookup: sum -10m unaligned absolute of received units: packets every: 1m - info: number of inbound dropped packets for the network interface ${label:device} in the last 10 minutes + summary: Network interface ${label:device} received packets + info: Received packets for the network interface ${label:device} in the last 10 minutes - template: outbound_packets_dropped - on: net.drops - class: Errors + template: net_interface_outbound_packets + on: net.packets + class: Workload type: System component: Network - os: linux + os: * hosts: * - lookup: sum -10m unaligned absolute of outbound + lookup: sum -10m unaligned absolute of sent units: packets every: 1m - info: number of outbound dropped packets for the network interface ${label:device} in the last 10 minutes + summary: Network interface ${label:device} sent packets + info: Sent packets for the network interface ${label:device} in the last 10 minutes template: inbound_packets_dropped_ratio - on: net.packets + on: net.drops class: Errors type: System component: Network - os: linux + os: * hosts: * chart labels: device=!wl* * - lookup: sum -10m unaligned absolute of received - calc: (($inbound_packets_dropped != nan AND $this > 10000) ? ($inbound_packets_dropped * 100 / $this) : (0)) + lookup: sum -10m unaligned absolute of inbound + calc: (($net_interface_inbound_packets > 10000) ? ($this * 100 / $net_interface_inbound_packets) : (0)) units: % every: 1m warn: $this >= 2 delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of inbound dropped packets for the network interface ${label:device} over the last 10 minutes + summary: System network interface ${label:device} inbound drops + info: Ratio of inbound dropped packets for the network interface ${label:device} over the last 10 minutes to: silent template: outbound_packets_dropped_ratio - on: net.packets + on: net.drops class: Errors type: System component: Network - os: linux + os: * hosts: * chart labels: device=!wl* * - lookup: sum -10m unaligned absolute of sent - calc: (($outbound_packets_dropped != nan AND $this > 1000) ? ($outbound_packets_dropped * 100 / $this) : (0)) + lookup: sum -10m unaligned absolute of outbound + calc: (($net_interface_outbound_packets > 1000) ? ($this * 100 / $net_interface_outbound_packets) : (0)) units: % every: 1m warn: $this >= 2 delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of outbound dropped packets for the network interface ${label:device} over the last 10 minutes + summary: System network interface ${label:device} outbound drops + info: Ratio of outbound dropped packets for the network interface ${label:device} over the last 10 minutes to: silent template: wifi_inbound_packets_dropped_ratio - on: net.packets + on: net.drops class: Errors type: System component: Network @@ -125,16 +131,17 @@ component: Network hosts: * chart labels: device=wl* lookup: sum -10m unaligned absolute of received - calc: (($inbound_packets_dropped != nan AND $this > 10000) ? ($inbound_packets_dropped * 100 / $this) : (0)) + calc: (($net_interface_inbound_packets > 10000) ? ($this * 100 / $net_interface_inbound_packets) : (0)) units: % every: 1m warn: $this >= 10 delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of inbound dropped packets for the network interface ${label:device} over the last 10 minutes + summary: System network interface ${label:device} inbound drops ratio + info: Ratio of inbound dropped packets for the network interface ${label:device} over the last 10 minutes to: silent template: wifi_outbound_packets_dropped_ratio - on: net.packets + on: net.drops class: Errors type: System component: Network @@ -142,12 +149,13 @@ component: Network hosts: * chart labels: device=wl* lookup: sum -10m unaligned absolute of sent - calc: (($outbound_packets_dropped != nan AND $this > 1000) ? ($outbound_packets_dropped * 100 / $this) : (0)) + calc: (($net_interface_outbound_packets > 1000) ? ($this * 100 / $net_interface_outbound_packets) : (0)) units: % every: 1m warn: $this >= 10 delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of outbound dropped packets for the network interface ${label:device} over the last 10 minutes + summary: System network interface ${label:device} outbound drops ratio + info: Ratio of outbound dropped packets for the network interface ${label:device} over the last 10 minutes to: silent # ----------------------------------------------------------------------------- @@ -165,7 +173,8 @@ component: Network every: 1m warn: $this >= 5 delay: down 1h multiplier 1.5 max 2h - info: number of inbound errors for the network interface ${label:device} in the last 10 minutes + summary: System network interface ${label:device} inbound errors + info: Number of inbound errors for the network interface ${label:device} in the last 10 minutes to: silent template: interface_outbound_errors @@ -180,7 +189,8 @@ component: Network every: 1m warn: $this >= 5 delay: down 1h multiplier 1.5 max 2h - info: number of outbound errors for the network interface ${label:device} in the last 10 minutes + summary: System network interface ${label:device} outbound errors + info: Number of outbound errors for the network interface ${label:device} in the last 10 minutes to: silent # ----------------------------------------------------------------------------- @@ -203,7 +213,8 @@ component: Network every: 1m warn: $this > 0 delay: down 1h multiplier 1.5 max 2h - info: number of FIFO errors for the network interface ${label:device} in the last 10 minutes + summary: System network interface ${label:device} FIFO errors + info: Number of FIFO errors for the network interface ${label:device} in the last 10 minutes to: silent # ----------------------------------------------------------------------------- @@ -225,7 +236,7 @@ component: Network lookup: average -1m unaligned of received units: packets every: 10s - info: average number of packets received by the network interface ${label:device} over the last minute + info: Average number of packets received by the network interface ${label:device} over the last minute template: 10s_received_packets_storm on: net.packets @@ -241,6 +252,7 @@ component: Network warn: $this > (($status >= $WARNING)?(200):(5000)) crit: $this > (($status == $CRITICAL)?(5000):(6000)) options: no-clear-notification - info: ratio of average number of received packets for the network interface ${label:device} over the last 10 seconds, \ + summary: System network interface ${label:device} inbound packet storm + info: Ratio of average number of received packets for the network interface ${label:device} over the last 10 seconds, \ compared to the rate over the last minute to: silent diff --git a/health/health.d/netfilter.conf b/health/health.d/netfilter.conf index 7de383fa2..417105d43 100644 --- a/health/health.d/netfilter.conf +++ b/health/health.d/netfilter.conf @@ -15,5 +15,6 @@ component: Network warn: $this > (($status >= $WARNING) ? (85) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (95)) delay: down 5m multiplier 1.5 max 1h - info: netfilter connection tracker table size utilization + summary: System Netfilter connection tracker utilization + info: Netfilter connection tracker table size utilization to: sysadmin diff --git a/health/health.d/nut.conf b/health/health.d/nut.conf index 67843205c..7a74653e9 100644 --- a/health/health.d/nut.conf +++ b/health/health.d/nut.conf @@ -13,7 +13,8 @@ component: UPS warn: $this > (($status >= $WARNING) ? (70) : (80)) crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 10m multiplier 1.5 max 1h - info: average UPS load over the last 10 minutes + summary: UPS load + info: UPS average load over the last 10 minutes to: sitemgr template: nut_ups_charge @@ -29,7 +30,8 @@ component: UPS warn: $this < 75 crit: $this < 40 delay: down 10m multiplier 1.5 max 1h - info: average UPS charge over the last minute + summary: UPS battery charge + info: UPS average battery charge over the last minute to: sitemgr template: nut_last_collected_secs @@ -43,5 +45,6 @@ component: UPS device warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) delay: down 5m multiplier 1.5 max 1h - info: number of seconds since the last successful data collection + summary: NUT last collected + info: Number of seconds since the last successful data collection to: sitemgr diff --git a/health/health.d/nvme.conf b/health/health.d/nvme.conf index 742ffbc93..aea402e88 100644 --- a/health/health.d/nvme.conf +++ b/health/health.d/nvme.conf @@ -10,5 +10,6 @@ component: Disk every: 10s crit: $this != nan AND $this != 0 delay: down 5m multiplier 1.5 max 2h + summary: NVMe device ${label:device} state info: NVMe device ${label:device} has critical warnings to: sysadmin diff --git a/health/health.d/pihole.conf b/health/health.d/pihole.conf index 045930ae5..c4db835ce 100644 --- a/health/health.d/pihole.conf +++ b/health/health.d/pihole.conf @@ -11,6 +11,7 @@ component: Pi-hole units: seconds calc: $ago warn: $this > 60 * 60 * 24 * 30 + summary: Pi-hole blocklist last update info: gravity.list (blocklist) file last update time to: sysadmin @@ -27,5 +28,6 @@ component: Pi-hole calc: $disabled warn: $this != nan AND $this == 1 delay: up 2m down 5m - info: unwanted domains blocking is disabled + summary: Pi-hole domains blocking status + info: Unwanted domains blocking is disabled to: sysadmin diff --git a/health/health.d/ping.conf b/health/health.d/ping.conf index b8d39bbad..0e434420d 100644 --- a/health/health.d/ping.conf +++ b/health/health.d/ping.conf @@ -11,7 +11,8 @@ component: Network every: 10s crit: $this == 0 delay: down 30m multiplier 1.5 max 2h - info: network host ${label:host} reachability status + summary: Host ${label:host} ping status + info: Network host ${label:host} reachability status to: sysadmin template: ping_packet_loss @@ -27,7 +28,8 @@ component: Network warn: $this > $green crit: $this > $red delay: down 30m multiplier 1.5 max 2h - info: packet loss percentage to the network host ${label:host} over the last 10 minutes + summary: Host ${label:host} ping packet loss + info: Packet loss percentage to the network host ${label:host} over the last 10 minutes to: sysadmin template: ping_host_latency @@ -43,5 +45,6 @@ component: Network warn: $this > $green OR $max > $red crit: $this > $red delay: down 30m multiplier 1.5 max 2h - info: average latency to the network host ${label:host} over the last 10 seconds + summary: Host ${label:host} ping latency + info: Average latency to the network host ${label:host} over the last 10 seconds to: sysadmin diff --git a/health/health.d/plugin.conf b/health/health.d/plugin.conf index 0a891db79..8615a0213 100644 --- a/health/health.d/plugin.conf +++ b/health/health.d/plugin.conf @@ -7,5 +7,6 @@ every: 10s warn: $this > (($status >= $WARNING) ? ($update_every) : (20 * $update_every)) delay: down 5m multiplier 1.5 max 1h + summary: Plugin ${label:_collect_plugin} availability status info: the amount of time that ${label:_collect_plugin} did not report its availability status to: sysadmin diff --git a/health/health.d/portcheck.conf b/health/health.d/portcheck.conf index 34550ea02..281731c86 100644 --- a/health/health.d/portcheck.conf +++ b/health/health.d/portcheck.conf @@ -9,6 +9,7 @@ component: TCP endpoint calc: ($this < 75) ? (0) : ($this) every: 5s units: up/down + summary: Portcheck status for ${label:host}:${label:port} info: TCP host ${label:host} port ${label:port} liveness status to: silent @@ -23,7 +24,8 @@ component: TCP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: percentage of timed-out TCP connections to host ${label:host} port ${label:port} in the last 5 minutes + summary: Portcheck timeouts for ${label:host}:${label:port} + info: Percentage of timed-out TCP connections to host ${label:host} port ${label:port} in the last 5 minutes to: sysadmin template: portcheck_connection_fails @@ -37,5 +39,6 @@ component: TCP endpoint warn: $this >= 10 AND $this < 40 crit: $this >= 40 delay: down 5m multiplier 1.5 max 1h - info: percentage of failed TCP connections to host ${label:host} port ${label:port} in the last 5 minutes + summary: Portcheck fails for ${label:host}:${label:port} + info: Percentage of failed TCP connections to host ${label:host} port ${label:port} in the last 5 minutes to: sysadmin diff --git a/health/health.d/postgres.conf b/health/health.d/postgres.conf index 67b25673b..de4c0078e 100644 --- a/health/health.d/postgres.conf +++ b/health/health.d/postgres.conf @@ -12,7 +12,8 @@ component: PostgreSQL warn: $this > (($status >= $WARNING) ? (70) : (80)) crit: $this > (($status == $CRITICAL) ? (80) : (90)) delay: down 15m multiplier 1.5 max 1h - info: average total connection utilization over the last minute + summary: PostgreSQL connection utilization + info: Average total connection utilization over the last minute to: dba template: postgres_acquired_locks_utilization @@ -26,7 +27,8 @@ component: PostgreSQL every: 1m warn: $this > (($status >= $WARNING) ? (15) : (20)) delay: down 15m multiplier 1.5 max 1h - info: average acquired locks utilization over the last minute + summary: PostgreSQL acquired locks utilization + info: Average acquired locks utilization over the last minute to: dba template: postgres_txid_exhaustion_perc @@ -40,7 +42,8 @@ component: PostgreSQL every: 1m warn: $this > 90 delay: down 15m multiplier 1.5 max 1h - info: percent towards TXID wraparound + summary: PostgreSQL TXID exhaustion + info: Percent towards TXID wraparound to: dba # Database alarms @@ -58,7 +61,8 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average cache hit ratio in db ${label:database} over the last minute + summary: PostgreSQL DB ${label:database} cache hit ratio + info: Average cache hit ratio in db ${label:database} over the last minute to: dba template: postgres_db_transactions_rollback_ratio @@ -72,7 +76,8 @@ component: PostgreSQL every: 1m warn: $this > (($status >= $WARNING) ? (0) : (2)) delay: down 15m multiplier 1.5 max 1h - info: average aborted transactions percentage in db ${label:database} over the last five minutes + summary: PostgreSQL DB ${label:database} aborted transactions + info: Average aborted transactions percentage in db ${label:database} over the last five minutes to: dba template: postgres_db_deadlocks_rate @@ -86,7 +91,8 @@ component: PostgreSQL every: 1m warn: $this > (($status >= $WARNING) ? (0) : (10)) delay: down 15m multiplier 1.5 max 1h - info: number of deadlocks detected in db ${label:database} in the last minute + summary: PostgreSQL DB ${label:database} deadlocks rate + info: Number of deadlocks detected in db ${label:database} in the last minute to: dba # Table alarms @@ -104,7 +110,8 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average cache hit ratio in db ${label:database} table ${label:table} over the last minute + summary: PostgreSQL table ${label:table} db ${label:database} cache hit ratio + info: Average cache hit ratio in db ${label:database} table ${label:table} over the last minute to: dba template: postgres_table_index_cache_io_ratio @@ -120,7 +127,8 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average index cache hit ratio in db ${label:database} table ${label:table} over the last minute + summary: PostgreSQL table ${label:table} db ${label:database} index cache hit ratio + info: Average index cache hit ratio in db ${label:database} table ${label:table} over the last minute to: dba template: postgres_table_toast_cache_io_ratio @@ -136,7 +144,8 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h - info: average TOAST hit ratio in db ${label:database} table ${label:table} over the last minute + summary: PostgreSQL table ${label:table} db ${label:database} toast cache hit ratio + info: Average TOAST hit ratio in db ${label:database} table ${label:table} over the last minute to: dba template: postgres_table_toast_index_cache_io_ratio @@ -152,6 +161,7 @@ component: PostgreSQL warn: $this < (($status >= $WARNING) ? (70) : (60)) crit: $this < (($status == $CRITICAL) ? (60) : (50)) delay: down 15m multiplier 1.5 max 1h + summary: PostgreSQL table ${label:table} db ${label:database} index toast hit ratio info: average index TOAST hit ratio in db ${label:database} table ${label:table} over the last minute to: dba @@ -167,7 +177,8 @@ component: PostgreSQL warn: $this > (($status >= $WARNING) ? (60) : (70)) crit: $this > (($status == $CRITICAL) ? (70) : (80)) delay: down 15m multiplier 1.5 max 1h - info: bloat size percentage in db ${label:database} table ${label:table} + summary: PostgreSQL table ${label:table} db ${label:database} bloat size + info: Bloat size percentage in db ${label:database} table ${label:table} to: dba template: postgres_table_last_autovacuum_time @@ -180,7 +191,8 @@ component: PostgreSQL units: seconds every: 1m warn: $this != nan AND $this > (60 * 60 * 24 * 7) - info: time elapsed since db ${label:database} table ${label:table} was vacuumed by the autovacuum daemon + summary: PostgreSQL table ${label:table} db ${label:database} last autovacuum + info: Time elapsed since db ${label:database} table ${label:table} was vacuumed by the autovacuum daemon to: dba template: postgres_table_last_autoanalyze_time @@ -193,7 +205,8 @@ component: PostgreSQL units: seconds every: 1m warn: $this != nan AND $this > (60 * 60 * 24 * 7) - info: time elapsed since db ${label:database} table ${label:table} was analyzed by the autovacuum daemon + summary: PostgreSQL table ${label:table} db ${label:database} last autoanalyze + info: Time elapsed since db ${label:database} table ${label:table} was analyzed by the autovacuum daemon to: dba # Index alarms @@ -210,5 +223,6 @@ component: PostgreSQL warn: $this > (($status >= $WARNING) ? (60) : (70)) crit: $this > (($status == $CRITICAL) ? (70) : (80)) delay: down 15m multiplier 1.5 max 1h - info: bloat size percentage in db ${label:database} table ${label:table} index ${label:index} + summary: PostgreSQL table ${label:table} db ${label:database} index bloat size + info: Bloat size percentage in db ${label:database} table ${label:table} index ${label:index} to: dba diff --git a/health/health.d/processes.conf b/health/health.d/processes.conf index 2929ee3d4..8f2e0fda5 100644 --- a/health/health.d/processes.conf +++ b/health/health.d/processes.conf @@ -12,5 +12,6 @@ component: Processes warn: $this > (($status >= $WARNING) ? (85) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (95)) delay: down 5m multiplier 1.5 max 1h - info: system process IDs (PID) space utilization + summary: System PIDs utilization + info: System process IDs (PID) space utilization to: sysadmin diff --git a/health/health.d/python.d.plugin.conf b/health/health.d/python.d.plugin.conf index 0e81a482f..da27ad5b7 100644 --- a/health/health.d/python.d.plugin.conf +++ b/health/health.d/python.d.plugin.conf @@ -13,5 +13,6 @@ component: python.d.plugin warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) delay: down 5m multiplier 1.5 max 1h - info: number of seconds since the last successful data collection + summary: Python.d plugin last collection + info: Number of seconds since the last successful data collection to: webmaster diff --git a/health/health.d/qos.conf b/health/health.d/qos.conf index 4b0a5cb96..970ea6363 100644 --- a/health/health.d/qos.conf +++ b/health/health.d/qos.conf @@ -13,5 +13,6 @@ template: 10min_qos_packet_drops every: 30s warn: $this > 0 units: packets - info: dropped packets in the last 5 minutes + summary: QOS packet drops + info: Dropped packets in the last 5 minutes to: silent diff --git a/health/health.d/ram.conf b/health/health.d/ram.conf index c121264f7..51f307ca6 100644 --- a/health/health.d/ram.conf +++ b/health/health.d/ram.conf @@ -14,7 +14,8 @@ component: Memory warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 15m multiplier 1.5 max 1h - info: system memory utilization + summary: System memory utilization + info: System memory utilization to: sysadmin alarm: ram_available @@ -29,20 +30,22 @@ component: Memory every: 10s warn: $this < (($status >= $WARNING) ? (15) : (10)) delay: down 15m multiplier 1.5 max 1h - info: percentage of estimated amount of RAM available for userspace processes, without causing swapping + summary: System available memory + info: Percentage of estimated amount of RAM available for userspace processes, without causing swapping to: silent - alarm: oom_kill - on: mem.oom_kill - os: linux - hosts: * - lookup: sum -30m unaligned - units: kills - every: 5m - warn: $this > 0 - delay: down 10m - info: number of out of memory kills in the last 30 minutes - to: silent + alarm: oom_kill + on: mem.oom_kill + os: linux + hosts: * + lookup: sum -30m unaligned + units: kills + every: 5m + warn: $this > 0 + delay: down 10m + summary: System OOM kills + info: Number of out of memory kills in the last 30 minutes + to: silent ## FreeBSD alarm: ram_in_use @@ -58,7 +61,8 @@ component: Memory warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 15m multiplier 1.5 max 1h - info: system memory utilization + summary: System memory utilization + info: System memory utilization to: sysadmin alarm: ram_available @@ -73,5 +77,6 @@ component: Memory every: 10s warn: $this < (($status >= $WARNING) ? (15) : (10)) delay: down 15m multiplier 1.5 max 1h - info: percentage of estimated amount of RAM available for userspace processes, without causing swapping + summary: System available memory + info: Percentage of estimated amount of RAM available for userspace processes, without causing swapping to: silent diff --git a/health/health.d/redis.conf b/health/health.d/redis.conf index a58fa34d1..7c2945e68 100644 --- a/health/health.d/redis.conf +++ b/health/health.d/redis.conf @@ -9,7 +9,8 @@ component: Redis every: 10s units: connections warn: $this > 0 - info: connections rejected because of maxclients limit in the last minute + summary: Redis rejected connections + info: Connections rejected because of maxclients limit in the last minute delay: down 5m multiplier 1.5 max 1h to: dba @@ -21,7 +22,8 @@ component: Redis every: 10s crit: $last_bgsave != nan AND $last_bgsave != 0 units: ok/failed - info: status of the last RDB save operation (0: ok, 1: error) + summary: Redis background save + info: Status of the last RDB save operation (0: ok, 1: error) delay: down 5m multiplier 1.5 max 1h to: dba @@ -35,7 +37,8 @@ component: Redis warn: $this > 600 crit: $this > 1200 units: seconds - info: duration of the on-going RDB save operation + summary: Redis slow background save + info: Duration of the on-going RDB save operation delay: down 5m multiplier 1.5 max 1h to: dba @@ -48,6 +51,7 @@ component: Redis calc: $time units: seconds crit: $this != nan AND $this > 0 - info: time elapsed since the link between master and slave is down + summary: Redis master link down + info: Time elapsed since the link between master and slave is down delay: down 5m multiplier 1.5 max 1h to: dba diff --git a/health/health.d/retroshare.conf b/health/health.d/retroshare.conf index 14aa76b4c..c665430fa 100644 --- a/health/health.d/retroshare.conf +++ b/health/health.d/retroshare.conf @@ -12,5 +12,6 @@ component: Retroshare warn: $this < (($status >= $WARNING) ? (120) : (100)) crit: $this < (($status == $CRITICAL) ? (10) : (1)) delay: up 0 down 15m multiplier 1.5 max 1h - info: number of DHT peers + summary: Retroshare DHT peers + info: Number of DHT peers to: sysadmin diff --git a/health/health.d/riakkv.conf b/health/health.d/riakkv.conf index 261fd48c6..677e3cb4f 100644 --- a/health/health.d/riakkv.conf +++ b/health/health.d/riakkv.conf @@ -9,7 +9,8 @@ component: Riak KV units: state machines every: 10s warn: $list_fsm_active > 0 - info: number of currently running list keys finite state machines + summary: Riak KV active list keys + info: Number of currently running list keys finite state machines to: dba @@ -38,7 +39,8 @@ component: Riak KV every: 10s warn: ($this > ($riakkv_1h_kv_get_mean_latency * 2) ) crit: ($this > ($riakkv_1h_kv_get_mean_latency * 3) ) - info: average time between reception of client GET request and \ + summary: Riak KV GET latency + info: Average time between reception of client GET request and \ subsequent response to the client over the last 3 minutes, \ compared to the average over the last hour delay: down 5m multiplier 1.5 max 1h @@ -54,7 +56,8 @@ component: Riak KV lookup: average -1h unaligned of time every: 30s units: ms - info: average time between reception of client PUT request and \ + summary: Riak KV PUT mean latency + info: Average time between reception of client PUT request and \ subsequent response to the client over the last hour template: riakkv_kv_put_slow @@ -68,7 +71,8 @@ component: Riak KV every: 10s warn: ($this > ($riakkv_1h_kv_put_mean_latency * 2) ) crit: ($this > ($riakkv_1h_kv_put_mean_latency * 3) ) - info: average time between reception of client PUT request and \ + summary: Riak KV PUT latency + info: Average time between reception of client PUT request and \ subsequent response to the client over the last 3 minutes, \ compared to the average over the last hour delay: down 5m multiplier 1.5 max 1h @@ -89,5 +93,6 @@ component: Riak KV every: 10s warn: $this > 10000 crit: $this > 100000 - info: number of processes running in the Erlang VM + summary: Riak KV number of processes + info: Number of processes running in the Erlang VM to: dba diff --git a/health/health.d/scaleio.conf b/health/health.d/scaleio.conf index 27a857fcd..b089cb85e 100644 --- a/health/health.d/scaleio.conf +++ b/health/health.d/scaleio.conf @@ -12,7 +12,8 @@ component: ScaleIO warn: $this > (($status >= $WARNING) ? (80) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (90)) delay: down 15m multiplier 1.5 max 1h - info: storage pool capacity utilization + summary: ScaleIO storage pool capacity utilization + info: Storage pool capacity utilization to: sysadmin @@ -27,5 +28,6 @@ component: ScaleIO every: 10s warn: $this != 1 delay: up 30s down 5m multiplier 1.5 max 1h + summary: ScaleIO SDC-MDM connection state info: Data Client (SDC) to Metadata Manager (MDM) connection state (0: disconnected, 1: connected) to: sysadmin diff --git a/health/health.d/softnet.conf b/health/health.d/softnet.conf index b621d969d..8d7ba5661 100644 --- a/health/health.d/softnet.conf +++ b/health/health.d/softnet.conf @@ -15,7 +15,8 @@ component: Network every: 10s warn: $this > (($status >= $WARNING) ? (0) : (10)) delay: down 1h multiplier 1.5 max 2h - info: average number of dropped packets in the last minute \ + summary: System netdev dropped packets + info: Average number of dropped packets in the last minute \ due to exceeded net.core.netdev_max_backlog to: silent @@ -31,7 +32,8 @@ component: Network every: 10s warn: $this > (($status >= $WARNING) ? (0) : (10)) delay: down 1h multiplier 1.5 max 2h - info: average number of times ksoftirq ran out of sysctl net.core.netdev_budget or \ + summary: System netdev budget run outs + info: Average number of times ksoftirq ran out of sysctl net.core.netdev_budget or \ net.core.netdev_budget_usecs with work remaining over the last minute \ (this can be a cause for dropped packets) to: silent @@ -48,7 +50,8 @@ component: Network every: 10s warn: $this > (($status >= $WARNING) ? (0) : (10)) delay: down 1h multiplier 1.5 max 2h - info: average number of drops in the last minute \ + summary: System netisr drops + info: Average number of drops in the last minute \ due to exceeded sysctl net.route.netisr_maxqlen \ (this can be a cause for dropped packets) to: silent diff --git a/health/health.d/swap.conf b/health/health.d/swap.conf index 3adcae9db..e39733996 100644 --- a/health/health.d/swap.conf +++ b/health/health.d/swap.conf @@ -15,7 +15,8 @@ component: Memory every: 1m warn: $this > (($status >= $WARNING) ? (20) : (30)) delay: down 15m multiplier 1.5 max 1h - info: percentage of the system RAM swapped in the last 30 minutes + summary: System memory swapped out + info: Percentage of the system RAM swapped in the last 30 minutes to: silent alarm: used_swap @@ -31,5 +32,6 @@ component: Memory warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: up 30s down 15m multiplier 1.5 max 1h - info: swap memory utilization + summary: System swap memory utilization + info: Swap memory utilization to: sysadmin diff --git a/health/health.d/synchronization.conf b/health/health.d/synchronization.conf index 837bb1b32..6c947d90b 100644 --- a/health/health.d/synchronization.conf +++ b/health/health.d/synchronization.conf @@ -6,7 +6,8 @@ every: 1m warn: $this > 6 delay: up 1m down 10m multiplier 1.5 max 1h - info: number of sync() system calls. \ + summary: Sync system call frequency + info: Number of sync() system calls. \ Every call causes all pending modifications to filesystem metadata and \ cached file data to be written to the underlying filesystems. to: silent diff --git a/health/health.d/systemdunits.conf b/health/health.d/systemdunits.conf index aadf8452b..ad53a0e1c 100644 --- a/health/health.d/systemdunits.conf +++ b/health/health.d/systemdunits.conf @@ -12,6 +12,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd service unit in the failed state to: sysadmin @@ -27,6 +28,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd socket unit in the failed state to: sysadmin @@ -42,6 +44,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd target unit in the failed state to: sysadmin @@ -57,6 +60,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd path unit in the failed state to: sysadmin @@ -72,6 +76,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd device unit in the failed state to: sysadmin @@ -87,6 +92,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd mount units in the failed state to: sysadmin @@ -102,6 +108,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd automount unit in the failed state to: sysadmin @@ -117,6 +124,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd swap units in the failed state to: sysadmin @@ -132,6 +140,7 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd scope units in the failed state to: sysadmin @@ -147,5 +156,6 @@ component: Systemd units every: 10s warn: $this != nan AND $this == 1 delay: down 5m multiplier 1.5 max 1h + summary: systemd unit ${label:unit_name} state info: systemd slice units in the failed state to: sysadmin diff --git a/health/health.d/tcp_conn.conf b/health/health.d/tcp_conn.conf index 67b3bee53..2b2f97406 100644 --- a/health/health.d/tcp_conn.conf +++ b/health/health.d/tcp_conn.conf @@ -6,7 +6,7 @@ # alarm: tcp_connections - on: ipv4.tcpsock + on: ip.tcpsock class: Workload type: System component: Network @@ -18,5 +18,6 @@ component: Network warn: $this > (($status >= $WARNING ) ? ( 60 ) : ( 80 )) crit: $this > (($status == $CRITICAL) ? ( 80 ) : ( 90 )) delay: up 0 down 5m multiplier 1.5 max 1h + summary: System TCP connections utilization info: IPv4 TCP connections utilization to: sysadmin diff --git a/health/health.d/tcp_listen.conf b/health/health.d/tcp_listen.conf index 00ee055d0..9d1104a51 100644 --- a/health/health.d/tcp_listen.conf +++ b/health/health.d/tcp_listen.conf @@ -31,7 +31,8 @@ component: Network warn: $this > 1 crit: $this > (($status == $CRITICAL) ? (1) : (5)) delay: up 0 down 5m multiplier 1.5 max 1h - info: average number of overflows in the TCP accept queue over the last minute + summary: System TCP accept queue overflows + info: Average number of overflows in the TCP accept queue over the last minute to: silent # THIS IS TOO GENERIC @@ -49,7 +50,8 @@ component: Network warn: $this > 1 crit: $this > (($status == $CRITICAL) ? (1) : (5)) delay: up 0 down 5m multiplier 1.5 max 1h - info: average number of dropped packets in the TCP accept queue over the last minute + summary: System TCP accept queue dropped packets + info: Average number of dropped packets in the TCP accept queue over the last minute to: silent @@ -74,7 +76,8 @@ component: Network warn: $this > 1 crit: $this > (($status == $CRITICAL) ? (0) : (5)) delay: up 10 down 5m multiplier 1.5 max 1h - info: average number of SYN requests was dropped due to the full TCP SYN queue over the last minute \ + summary: System TCP SYN queue drops + info: Average number of SYN requests was dropped due to the full TCP SYN queue over the last minute \ (SYN cookies were not enabled) to: silent @@ -91,6 +94,7 @@ component: Network warn: $this > 1 crit: $this > (($status == $CRITICAL) ? (0) : (5)) delay: up 10 down 5m multiplier 1.5 max 1h - info: average number of sent SYN cookies due to the full TCP SYN queue over the last minute + summary: System TCP SYN queue cookies + info: Average number of sent SYN cookies due to the full TCP SYN queue over the last minute to: silent diff --git a/health/health.d/tcp_mem.conf b/health/health.d/tcp_mem.conf index f472d9533..4e422ec1c 100644 --- a/health/health.d/tcp_mem.conf +++ b/health/health.d/tcp_mem.conf @@ -19,5 +19,6 @@ component: Network warn: ${mem} > (($status >= $WARNING ) ? ( ${tcp_mem_pressure} * 0.8 ) : ( ${tcp_mem_pressure} )) crit: ${mem} > (($status == $CRITICAL ) ? ( ${tcp_mem_pressure} ) : ( ${tcp_mem_high} * 0.9 )) delay: up 0 down 5m multiplier 1.5 max 1h + summary: System TCP memory utilization info: TCP memory utilization to: silent diff --git a/health/health.d/tcp_orphans.conf b/health/health.d/tcp_orphans.conf index 07022af30..8f665d50e 100644 --- a/health/health.d/tcp_orphans.conf +++ b/health/health.d/tcp_orphans.conf @@ -20,5 +20,6 @@ component: Network warn: $this > (($status >= $WARNING ) ? ( 20 ) : ( 25 )) crit: $this > (($status == $CRITICAL) ? ( 25 ) : ( 50 )) delay: up 0 down 5m multiplier 1.5 max 1h - info: orphan IPv4 TCP sockets utilization + summary: System TCP orphan sockets utilization + info: Orphan IPv4 TCP sockets utilization to: silent diff --git a/health/health.d/tcp_resets.conf b/health/health.d/tcp_resets.conf index 089ac988d..7c39db2db 100644 --- a/health/health.d/tcp_resets.conf +++ b/health/health.d/tcp_resets.conf @@ -4,8 +4,8 @@ # ----------------------------------------------------------------------------- # tcp resets this host sends - alarm: 1m_ipv4_tcp_resets_sent - on: ipv4.tcphandshake + alarm: 1m_ip_tcp_resets_sent + on: ip.tcphandshake class: Errors type: System component: Network @@ -16,8 +16,8 @@ component: Network every: 10s info: average number of sent TCP RESETS over the last minute - alarm: 10s_ipv4_tcp_resets_sent - on: ipv4.tcphandshake + alarm: 10s_ip_tcp_resets_sent + on: ip.tcphandshake class: Errors type: System component: Network @@ -26,10 +26,11 @@ component: Network lookup: average -10s unaligned absolute of OutRsts units: tcp resets/s every: 10s - warn: $netdata.uptime.uptime > (1 * 60) AND $this > ((($1m_ipv4_tcp_resets_sent < 5)?(5):($1m_ipv4_tcp_resets_sent)) * (($status >= $WARNING) ? (1) : (10))) + warn: $netdata.uptime.uptime > (1 * 60) AND $this > ((($1m_ip_tcp_resets_sent < 5)?(5):($1m_ip_tcp_resets_sent)) * (($status >= $WARNING) ? (1) : (10))) delay: up 20s down 60m multiplier 1.2 max 2h options: no-clear-notification - info: average number of sent TCP RESETS over the last 10 seconds. \ + summary: System TCP outbound resets + info: Average number of sent TCP RESETS over the last 10 seconds. \ This can indicate a port scan, \ or that a service running on this host has crashed. \ Netdata will not send a clear notification for this alarm. @@ -38,8 +39,8 @@ component: Network # ----------------------------------------------------------------------------- # tcp resets this host receives - alarm: 1m_ipv4_tcp_resets_received - on: ipv4.tcphandshake + alarm: 1m_ip_tcp_resets_received + on: ip.tcphandshake class: Errors type: System component: Network @@ -50,8 +51,8 @@ component: Network every: 10s info: average number of received TCP RESETS over the last minute - alarm: 10s_ipv4_tcp_resets_received - on: ipv4.tcphandshake + alarm: 10s_ip_tcp_resets_received + on: ip.tcphandshake class: Errors type: System component: Network @@ -60,9 +61,10 @@ component: Network lookup: average -10s unaligned absolute of AttemptFails units: tcp resets/s every: 10s - warn: $netdata.uptime.uptime > (1 * 60) AND $this > ((($1m_ipv4_tcp_resets_received < 5)?(5):($1m_ipv4_tcp_resets_received)) * (($status >= $WARNING) ? (1) : (10))) + warn: $netdata.uptime.uptime > (1 * 60) AND $this > ((($1m_ip_tcp_resets_received < 5)?(5):($1m_ip_tcp_resets_received)) * (($status >= $WARNING) ? (1) : (10))) delay: up 20s down 60m multiplier 1.2 max 2h options: no-clear-notification + summary: System TCP inbound resets info: average number of received TCP RESETS over the last 10 seconds. \ This can be an indication that a service this host needs has crashed. \ Netdata will not send a clear notification for this alarm. diff --git a/health/health.d/timex.conf b/health/health.d/timex.conf index 2e9b1a3cf..65c9628b5 100644 --- a/health/health.d/timex.conf +++ b/health/health.d/timex.conf @@ -13,5 +13,6 @@ component: Clock every: 10s warn: $system.uptime.uptime > 17 * 60 AND $this == 0 delay: down 5m - info: when set to 0, the system kernel believes the system clock is not properly synchronized to a reliable server + summary: System clock sync state + info: When set to 0, the system kernel believes the system clock is not properly synchronized to a reliable server to: silent diff --git a/health/health.d/udp_errors.conf b/health/health.d/udp_errors.conf index 00593c583..dc0948403 100644 --- a/health/health.d/udp_errors.conf +++ b/health/health.d/udp_errors.conf @@ -15,7 +15,8 @@ component: Network units: errors every: 10s warn: $this > (($status >= $WARNING) ? (0) : (10)) - info: average number of UDP receive buffer errors over the last minute + summary: System UDP receive buffer errors + info: Average number of UDP receive buffer errors over the last minute delay: up 1m down 60m multiplier 1.2 max 2h to: silent @@ -33,6 +34,7 @@ component: Network units: errors every: 10s warn: $this > (($status >= $WARNING) ? (0) : (10)) - info: average number of UDP send buffer errors over the last minute + summary: System UDP send buffer errors + info: Average number of UDP send buffer errors over the last minute delay: up 1m down 60m multiplier 1.2 max 2h to: silent diff --git a/health/health.d/unbound.conf b/health/health.d/unbound.conf index 4e8d164d2..3c898f1d5 100644 --- a/health/health.d/unbound.conf +++ b/health/health.d/unbound.conf @@ -11,7 +11,8 @@ component: Unbound every: 10s warn: $this > 5 delay: up 10 down 5m multiplier 1.5 max 1h - info: number of overwritten queries in the request-list + summary: Unbound overwritten queries + info: Number of overwritten queries in the request-list to: sysadmin template: unbound_request_list_dropped @@ -24,5 +25,6 @@ component: Unbound every: 10s warn: $this > 0 delay: up 10 down 5m multiplier 1.5 max 1h - info: number of dropped queries in the request-list + summary: Unbound dropped queries + info: Number of dropped queries in the request-list to: sysadmin diff --git a/health/health.d/upsd.conf b/health/health.d/upsd.conf new file mode 100644 index 000000000..703a64881 --- /dev/null +++ b/health/health.d/upsd.conf @@ -0,0 +1,50 @@ +# you can disable an alarm notification by setting the 'to' line to: silent + + template: upsd_10min_ups_load + on: upsd.ups_load + class: Utilization + type: Power Supply +component: UPS + os: * + hosts: * + lookup: average -10m unaligned of load + units: % + every: 1m + warn: $this > (($status >= $WARNING) ? (70) : (80)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) + delay: down 10m multiplier 1.5 max 1h + summary: UPS ${label:ups_name} load + info: UPS ${label:ups_name} average load over the last 10 minutes + to: sitemgr + + template: upsd_ups_battery_charge + on: upsd.ups_battery_charge + class: Errors + type: Power Supply +component: UPS + os: * + hosts: * + lookup: average -60s unaligned of charge + units: % + every: 60s + warn: $this < 75 + crit: $this < 40 + delay: down 10m multiplier 1.5 max 1h + summary: UPS ${label:ups_name} battery charge + info: UPS ${label:ups_name} average battery charge over the last minute + to: sitemgr + + template: upsd_ups_last_collected_secs + on: upsd.ups_load + class: Latency + type: Power Supply +component: UPS device + calc: $now - $last_collected_t + every: 10s + units: seconds ago + warn: $this > (($status >= $WARNING) ? ($update_every) : ( 5 * $update_every)) + crit: $this > (($status == $CRITICAL) ? ($update_every) : (60 * $update_every)) + delay: down 5m multiplier 1.5 max 1h + summary: UPS ${label:ups_name} last collected + info: UPS ${label:ups_name} number of seconds since the last successful data collection + to: sitemgr diff --git a/health/health.d/vcsa.conf b/health/health.d/vcsa.conf index bff34cd39..3e20bfd1e 100644 --- a/health/health.d/vcsa.conf +++ b/health/health.d/vcsa.conf @@ -6,19 +6,32 @@ # - 3: one or more components might be in an unusable status and the appliance might become unresponsive soon. # - 4: no health data is available. - template: vcsa_system_health - on: vcsa.system_health + template: vcsa_system_health_warn + on: vcsa.system_health_status class: Errors type: Virtual Machine component: VMware vCenter - lookup: max -10s unaligned of system + calc: $orange units: status every: 10s - warn: ($this == 1) || ($this == 2) - crit: $this == 3 + warn: $this == 1 delay: down 1m multiplier 1.5 max 1h - info: overall system health status \ - (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + summary: VCSA system status + info: VCSA overall system status is orange. One or more components are degraded. + to: sysadmin + + template: vcsa_system_health_crit + on: vcsa.system_health_status + class: Errors + type: Virtual Machine +component: VMware vCenter + calc: $red + units: status + every: 10s + crit: $this == 1 + delay: down 1m multiplier 1.5 max 1h + summary: VCSA system status + info: VCSA overall system status is red. One or more components are unavailable or will stop functioning soon. to: sysadmin # Components health: @@ -28,96 +41,173 @@ component: VMware vCenter # - 3: unavailable, or will stop functioning soon. # - 4: no health data is available. - template: vcsa_swap_health - on: vcsa.components_health + template: vcsa_applmgmt_health_warn + on: vcsa.applmgmt_health_status class: Errors type: Virtual Machine component: VMware vCenter - lookup: max -10s unaligned of swap + calc: $orange units: status every: 10s warn: $this == 1 - crit: ($this == 2) || ($this == 3) delay: down 1m multiplier 1.5 max 1h - info: swap health status \ - (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) - to: sysadmin + summary: VCSA ApplMgmt service status + info: VCSA ApplMgmt component status is orange. It is degraded, and may have serious problems. + to: silent - template: vcsa_storage_health - on: vcsa.components_health + template: vcsa_applmgmt_health_crit + on: vcsa.applmgmt_health_status class: Errors type: Virtual Machine component: VMware vCenter - lookup: max -10s unaligned of storage + calc: $red units: status every: 10s warn: $this == 1 - crit: ($this == 2) || ($this == 3) delay: down 1m multiplier 1.5 max 1h - info: storage health status \ - (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + summary: VCSA ApplMgmt service status + info: VCSA ApplMgmt component status is red. It is unavailable, or will stop functioning soon. to: sysadmin + + template: vcsa_load_health_warn + on: vcsa.load_health_status + class: Errors + type: Virtual Machine +component: VMware vCenter + calc: $orange + units: status + every: 10s + warn: $this == 1 + delay: down 1m multiplier 1.5 max 1h + summary: VCSA Load status + info: VCSA Load component status is orange. It is degraded, and may have serious problems. + to: silent - template: vcsa_mem_health - on: vcsa.components_health + template: vcsa_load_health_crit + on: vcsa.load_health_status class: Errors type: Virtual Machine component: VMware vCenter - lookup: max -10s unaligned of mem + calc: $red units: status every: 10s warn: $this == 1 - crit: ($this == 2) || ($this == 3) delay: down 1m multiplier 1.5 max 1h - info: memory health status \ - (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + summary: VCSA Load status + info: VCSA Load component status is red. It is unavailable, or will stop functioning soon. to: sysadmin - template: vcsa_load_health - on: vcsa.components_health - class: Utilization + template: vcsa_mem_health_warn + on: vcsa.mem_health_status + class: Errors type: Virtual Machine component: VMware vCenter - lookup: max -10s unaligned of load + calc: $orange units: status every: 10s warn: $this == 1 - crit: ($this == 2) || ($this == 3) delay: down 1m multiplier 1.5 max 1h - info: load health status \ - (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + summary: VCSA Memory status + info: VCSA Memory component status is orange. It is degraded, and may have serious problems. + to: silent + + template: vcsa_mem_health_crit + on: vcsa.mem_health_status + class: Errors + type: Virtual Machine +component: VMware vCenter + calc: $red + units: status + every: 10s + warn: $this == 1 + delay: down 1m multiplier 1.5 max 1h + summary: VCSA Memory status + info: VCSA Memory component status is red. It is unavailable, or will stop functioning soon. to: sysadmin - template: vcsa_database_storage_health - on: vcsa.components_health + template: vcsa_swap_health_warn + on: vcsa.swap_health_status + class: Errors + type: Virtual Machine +component: VMware vCenter + calc: $orange + units: status + every: 10s + warn: $this == 1 + delay: down 1m multiplier 1.5 max 1h + summary: VCSA Swap status + info: VCSA Swap component status is orange. It is degraded, and may have serious problems. + to: silent + + template: vcsa_swap_health_crit + on: vcsa.swap_health_status class: Errors type: Virtual Machine component: VMware vCenter - lookup: max -10s unaligned of database_storage + calc: $red units: status every: 10s warn: $this == 1 - crit: ($this == 2) || ($this == 3) delay: down 1m multiplier 1.5 max 1h - info: database storage health status \ - (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + summary: VCSA Swap status + info: VCSA Swap component status is red. It is unavailable, or will stop functioning soon. to: sysadmin - template: vcsa_applmgmt_health - on: vcsa.components_health + template: vcsa_database_storage_health_warn + on: vcsa.database_storage_health_status class: Errors type: Virtual Machine component: VMware vCenter - lookup: max -10s unaligned of applmgmt + calc: $orange units: status every: 10s warn: $this == 1 - crit: ($this == 2) || ($this == 3) delay: down 1m multiplier 1.5 max 1h - info: applmgmt health status \ - (-1: unknown, 0: green, 1: yellow, 2: orange, 3: red, 4: grey) + summary: VCSA Database status + info: VCSA Database Storage component status is orange. It is degraded, and may have serious problems. + to: silent + + template: vcsa_database_storage_health_crit + on: vcsa.database_storage_health_status + class: Errors + type: Virtual Machine +component: VMware vCenter + calc: $red + units: status + every: 10s + warn: $this == 1 + delay: down 1m multiplier 1.5 max 1h + summary: VCSA Database status + info: VCSA Database Storage component status is red. It is unavailable, or will stop functioning soon. to: sysadmin + template: vcsa_storage_health_warn + on: vcsa.storage_health_status + class: Errors + type: Virtual Machine +component: VMware vCenter + calc: $orange + units: status + every: 10s + warn: $this == 1 + delay: down 1m multiplier 1.5 max 1h + summary: VCSA Storage status + info: VCSA Storage component status is orange. It is degraded, and may have serious problems. + to: silent + + template: vcsa_storage_health_crit + on: vcsa.storage_health_status + class: Errors + type: Virtual Machine +component: VMware vCenter + calc: $red + units: status + every: 10s + warn: $this == 1 + delay: down 1m multiplier 1.5 max 1h + summary: VCSA Storage status + info: VCSA Storage component status is red. It is unavailable, or will stop functioning soon. + to: sysadmin # Software updates health: # - 0: no updates available. @@ -125,16 +215,16 @@ component: VMware vCenter # - 3: security updates are available. # - 4: an error retrieving information on software updates. - template: vcsa_software_updates_health - on: vcsa.software_updates_health + template: vcsa_software_packages_health_warn + on: vcsa.software_packages_health_status class: Errors type: Virtual Machine component: VMware vCenter - lookup: max -10s unaligned of software_packages + calc: $orange units: status every: 10s - warn: ($this == 3) || ($this == 4) + warn: $this == 1 delay: down 1m multiplier 1.5 max 1h - info: software updates availability status \ - (-1: unknown, 0: green, 2: orange, 3: red, 4: grey) - to: sysadmin + summary: VCSA software status + info: VCSA software packages security updates are available. + to: silent diff --git a/health/health.d/vernemq.conf b/health/health.d/vernemq.conf index cfbe2a524..6ea9f99dc 100644 --- a/health/health.d/vernemq.conf +++ b/health/health.d/vernemq.conf @@ -11,7 +11,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of socket errors in the last minute + summary: VerneMQ socket errors + info: Number of socket errors in the last minute to: sysadmin # Queues dropped/expired/unhandled PUBLISH messages @@ -26,7 +27,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of dropped messaged due to full queues in the last minute + summary: VerneMQ dropped messages + info: Number of dropped messages due to full queues in the last minute to: sysadmin template: vernemq_queue_message_expired @@ -39,6 +41,7 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h + summary: VerneMQ expired messages info: number of messages which expired before delivery in the last minute to: sysadmin @@ -52,7 +55,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of unhandled messages (connections with clean session=true) in the last minute + summary: VerneMQ unhandled messages + info: Number of unhandled messages (connections with clean session=true) in the last minute to: sysadmin # Erlang VM @@ -68,7 +72,8 @@ component: VerneMQ warn: $this > (($status >= $WARNING) ? (75) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: average scheduler utilization over the last 10 minutes + summary: VerneMQ scheduler utilization + info: Average scheduler utilization over the last 10 minutes to: sysadmin # Cluster communication and netsplits @@ -83,7 +88,8 @@ component: VerneMQ every: 1m warn: $this > 0 delay: up 5m down 5m multiplier 1.5 max 1h - info: amount of traffic dropped during communication with the cluster nodes in the last minute + summary: VerneMQ dropped traffic + info: Amount of traffic dropped during communication with the cluster nodes in the last minute to: sysadmin template: vernemq_netsplits @@ -96,7 +102,8 @@ component: VerneMQ every: 10s warn: $this > 0 delay: down 5m multiplier 1.5 max 2h - info: number of detected netsplits (split brain situation) in the last minute + summary: VerneMQ netsplits + info: Number of detected netsplits (split brain situation) in the last minute to: sysadmin # Unsuccessful CONNACK @@ -111,7 +118,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of sent unsuccessful v3/v5 CONNACK packets in the last minute + summary: VerneMQ unsuccessful CONNACK + info: Number of sent unsuccessful v3/v5 CONNACK packets in the last minute to: sysadmin # Not normal DISCONNECT @@ -126,7 +134,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of received not normal v5 DISCONNECT packets in the last minute + summary: VerneMQ received not normal DISCONNECT + info: Number of received not normal v5 DISCONNECT packets in the last minute to: sysadmin template: vernemq_mqtt_disconnect_sent_reason_not_normal @@ -139,7 +148,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of sent not normal v5 DISCONNECT packets in the last minute + summary: VerneMQ sent not normal DISCONNECT + info: Number of sent not normal v5 DISCONNECT packets in the last minute to: sysadmin # SUBSCRIBE errors and unauthorized attempts @@ -154,7 +164,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of failed v3/v5 SUBSCRIBE operations in the last minute + summary: VerneMQ failed SUBSCRIBE + info: Number of failed v3/v5 SUBSCRIBE operations in the last minute to: sysadmin template: vernemq_mqtt_subscribe_auth_error @@ -167,6 +178,7 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h + summary: VerneMQ unauthorized SUBSCRIBE info: number of unauthorized v3/v5 SUBSCRIBE attempts in the last minute to: sysadmin @@ -182,7 +194,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of failed v3/v5 UNSUBSCRIBE operations in the last minute + summary: VerneMQ failed UNSUBSCRIBE + info: Number of failed v3/v5 UNSUBSCRIBE operations in the last minute to: sysadmin # PUBLISH errors and unauthorized attempts @@ -197,7 +210,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of failed v3/v5 PUBLISH operations in the last minute + summary: VerneMQ failed PUBLISH + info: Number of failed v3/v5 PUBLISH operations in the last minute to: sysadmin template: vernemq_mqtt_publish_auth_errors @@ -210,7 +224,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of unauthorized v3/v5 PUBLISH attempts in the last minute + summary: VerneMQ unauthorized PUBLISH + info: Number of unauthorized v3/v5 PUBLISH attempts in the last minute to: sysadmin # Unsuccessful and unexpected PUBACK @@ -225,7 +240,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of received unsuccessful v5 PUBACK packets in the last minute + summary: VerneMQ unsuccessful received PUBACK + info: Number of received unsuccessful v5 PUBACK packets in the last minute to: sysadmin template: vernemq_mqtt_puback_sent_reason_unsuccessful @@ -238,7 +254,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of sent unsuccessful v5 PUBACK packets in the last minute + summary: VerneMQ unsuccessful sent PUBACK + info: Number of sent unsuccessful v5 PUBACK packets in the last minute to: sysadmin template: vernemq_mqtt_puback_unexpected @@ -251,7 +268,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of received unexpected v3/v5 PUBACK packets in the last minute + summary: VerneMQ unnexpected recieved PUBACK + info: Number of received unexpected v3/v5 PUBACK packets in the last minute to: sysadmin # Unsuccessful and unexpected PUBREC @@ -266,7 +284,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of received unsuccessful v5 PUBREC packets in the last minute + summary: VerneMQ unsuccessful received PUBREC + info: Number of received unsuccessful v5 PUBREC packets in the last minute to: sysadmin template: vernemq_mqtt_pubrec_sent_reason_unsuccessful @@ -279,7 +298,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of sent unsuccessful v5 PUBREC packets in the last minute + summary: VerneMQ unsuccessful sent PUBREC + info: Number of sent unsuccessful v5 PUBREC packets in the last minute to: sysadmin template: vernemq_mqtt_pubrec_invalid_error @@ -292,7 +312,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of received unexpected v3 PUBREC packets in the last minute + summary: VerneMQ invalid received PUBREC + info: Number of received invalid v3 PUBREC packets in the last minute to: sysadmin # Unsuccessful PUBREL @@ -307,7 +328,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of received unsuccessful v5 PUBREL packets in the last minute + summary: VerneMQ unsuccessful received PUBREL + info: Number of received unsuccessful v5 PUBREL packets in the last minute to: sysadmin template: vernemq_mqtt_pubrel_sent_reason_unsuccessful @@ -320,6 +342,7 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h + summary: VerneMQ unsuccessful sent PUBREL info: number of sent unsuccessful v5 PUBREL packets in the last minute to: sysadmin @@ -335,7 +358,8 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h - info: number of received unsuccessful v5 PUBCOMP packets in the last minute + summary: VerneMQ unsuccessful received PUBCOMP + info: Number of received unsuccessful v5 PUBCOMP packets in the last minute to: sysadmin template: vernemq_mqtt_pubcomp_sent_reason_unsuccessful @@ -348,6 +372,7 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h + summary: VerneMQ unsuccessful sent PUBCOMP info: number of sent unsuccessful v5 PUBCOMP packets in the last minute to: sysadmin @@ -361,5 +386,6 @@ component: VerneMQ every: 1m warn: $this > (($status >= $WARNING) ? (0) : (5)) delay: up 2m down 5m multiplier 1.5 max 2h + summary: VerneMQ unexpected received PUBCOMP info: number of received unexpected v3/v5 PUBCOMP packets in the last minute to: sysadmin diff --git a/health/health.d/vsphere.conf b/health/health.d/vsphere.conf index 1d8be6cb5..b8ad9aee4 100644 --- a/health/health.d/vsphere.conf +++ b/health/health.d/vsphere.conf @@ -1,28 +1,26 @@ # you can disable an alarm notification by setting the 'to' line to: silent -# -----------------------------------------------VM Specific------------------------------------------------------------ -# Memory +# -----------------------------------------------Virtual Machine-------------------------------------------------------- - template: vsphere_vm_mem_usage - on: vsphere.vm_mem_usage_percentage + template: vsphere_vm_cpu_utilization + on: vsphere.vm_cpu_utilization class: Utilization type: Virtual Machine -component: Memory +component: CPU hosts: * - calc: $used + lookup: average -10m unaligned match-names of used units: % every: 20s - warn: $this > (($status >= $WARNING) ? (80) : (90)) - crit: $this > (($status == $CRITICAL) ? (90) : (98)) + warn: $this > (($status >= $WARNING) ? (75) : (85)) + crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: virtual machine memory utilization - -# -----------------------------------------------HOST Specific---------------------------------------------------------- -# Memory + summary: vSphere CPU utilization for VM ${label:vm} + info: CPU utilization VM ${label:vm} host ${label:host} cluster ${label:cluster} datacenter ${label:datacenter} + to: silent - template: vsphere_host_mem_usage - on: vsphere.host_mem_usage_percentage + template: vsphere_vm_mem_utilization + on: vsphere.vm_mem_utilization class: Utilization type: Virtual Machine component: Memory @@ -33,69 +31,14 @@ component: Memory warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 15m multiplier 1.5 max 1h - info: host memory utilization - -# Network errors - - template: vsphere_inbound_packets_errors - on: vsphere.net_errors_total - class: Errors - type: Virtual Machine -component: Network - hosts: * - lookup: sum -10m unaligned absolute match-names of rx - units: packets - every: 1m - info: number of inbound errors for the network interface in the last 10 minutes - - template: vsphere_outbound_packets_errors - on: vsphere.net_errors_total - class: Errors - type: Virtual Machine -component: Network - hosts: * - lookup: sum -10m unaligned absolute match-names of tx - units: packets - every: 1m - info: number of outbound errors for the network interface in the last 10 minutes - -# Network errors ratio + summary: vSphere memory utilization for VM ${label:vm} + info: Memory utilization VM ${label:vm} host ${label:host} cluster ${label:cluster} datacenter ${label:datacenter} + to: silent - template: vsphere_inbound_packets_errors_ratio - on: vsphere.net_packets_total - class: Errors - type: Virtual Machine -component: Network - hosts: * - lookup: sum -10m unaligned absolute match-names of rx - calc: (($vsphere_inbound_packets_errors != nan AND $this > 1000) ? ($vsphere_inbound_packets_errors * 100 / $this) : (0)) - units: % - every: 1m - warn: $this >= 2 - delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of inbound errors for the network interface over the last 10 minutes - to: sysadmin +# -----------------------------------------------ESXI host-------------------------------------------------------------- - template: vsphere_outbound_packets_errors_ratio - on: vsphere.net_packets_total - class: Errors - type: Virtual Machine -component: Network - hosts: * - lookup: sum -10m unaligned absolute match-names of tx - calc: (($vsphere_outbound_packets_errors != nan AND $this > 1000) ? ($vsphere_outbound_packets_errors * 100 / $this) : (0)) - units: % - every: 1m - warn: $this >= 2 - delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of outbound errors for the network interface over the last 10 minutes - to: sysadmin - -# -----------------------------------------------Common------------------------------------------------------------------- -# CPU - - template: vsphere_cpu_usage - on: vsphere.cpu_usage_total + template: vsphere_host_cpu_utilization + on: vsphere.host_cpu_utilization class: Utilization type: Virtual Machine component: CPU @@ -106,61 +49,22 @@ component: CPU warn: $this > (($status >= $WARNING) ? (75) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: average CPU utilization + summary: vSphere ESXi CPU utilization for host ${label:host} + info: CPU utilization ESXi host ${label:host} cluster ${label:cluster} datacenter ${label:datacenter} to: sysadmin -# Network drops - - template: vsphere_inbound_packets_dropped - on: vsphere.net_drops_total - class: Errors - type: Virtual Machine -component: Network - hosts: * - lookup: sum -10m unaligned absolute match-names of rx - units: packets - every: 1m - info: number of inbound dropped packets for the network interface in the last 10 minutes - - template: vsphere_outbound_packets_dropped - on: vsphere.net_drops_total - class: Errors - type: Virtual Machine -component: Network - hosts: * - lookup: sum -10m unaligned absolute match-names of tx - units: packets - every: 1m - info: number of outbound dropped packets for the network interface in the last 10 minutes - -# Network drops ratio - - template: vsphere_inbound_packets_dropped_ratio - on: vsphere.net_packets_total - class: Errors - type: Virtual Machine -component: Network - hosts: * - lookup: sum -10m unaligned absolute match-names of rx - calc: (($vsphere_inbound_packets_dropped != nan AND $this > 1000) ? ($vsphere_inbound_packets_dropped * 100 / $this) : (0)) - units: % - every: 1m - warn: $this >= 2 - delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of inbound dropped packets for the network interface over the last 10 minutes - to: sysadmin - - template: vsphere_outbound_packets_dropped_ratio - on: vsphere.net_packets_total - class: Errors + template: vsphere_host_mem_utilization + on: vsphere.host_mem_utilization + class: Utilization type: Virtual Machine -component: Network +component: Memory hosts: * - lookup: sum -10m unaligned absolute match-names of tx - calc: (($vsphere_outbound_packets_dropped != nan AND $this > 1000) ? ($vsphere_outbound_packets_dropped * 100 / $this) : (0)) + calc: $used units: % - every: 1m - warn: $this >= 2 - delay: up 1m down 1h multiplier 1.5 max 2h - info: ratio of outbound dropped packets for the network interface over the last 10 minutes + every: 20s + warn: $this > (($status >= $WARNING) ? (80) : (90)) + crit: $this > (($status == $CRITICAL) ? (90) : (98)) + delay: down 15m multiplier 1.5 max 1h + summary: vSphere ESXi Ram utilization for host ${label:host} + info: Memory utilization ESXi host ${label:host} cluster ${label:cluster} datacenter ${label:datacenter} to: sysadmin diff --git a/health/health.d/web_log.conf b/health/health.d/web_log.conf index 3fd01831b..78f1cc7f5 100644 --- a/health/health.d/web_log.conf +++ b/health/health.d/web_log.conf @@ -30,7 +30,8 @@ component: Web log every: 10s warn: ($web_log_1m_total_requests > 120) ? ($this > 1) : ( 0 ) delay: up 1m down 5m multiplier 1.5 max 1h - info: percentage of unparsed log lines over the last minute + summary: Web log unparsed + info: Percentage of unparsed log lines over the last minute to: webmaster # ----------------------------------------------------------------------------- @@ -66,7 +67,8 @@ component: Web log warn: ($web_log_1m_requests > 120) ? ($this < (($status >= $WARNING ) ? ( 95 ) : ( 85 )) ) : ( 0 ) crit: ($web_log_1m_requests > 120) ? ($this < (($status == $CRITICAL) ? ( 85 ) : ( 75 )) ) : ( 0 ) delay: up 2m down 15m multiplier 1.5 max 1h - info: ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401) + summary: Web log successful + info: Ratio of successful HTTP requests over the last minute (1xx, 2xx, 304, 401) to: webmaster template: web_log_1m_redirects @@ -80,7 +82,8 @@ component: Web log every: 10s warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING ) ? ( 1 ) : ( 20 )) ) : ( 0 ) delay: up 2m down 15m multiplier 1.5 max 1h - info: ratio of redirection HTTP requests over the last minute (3xx except 304) + summary: Web log redirects + info: Ratio of redirection HTTP requests over the last minute (3xx except 304) to: webmaster template: web_log_1m_bad_requests @@ -94,7 +97,8 @@ component: Web log every: 10s warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 10 ) : ( 30 )) ) : ( 0 ) delay: up 2m down 15m multiplier 1.5 max 1h - info: ratio of client error HTTP requests over the last minute (4xx except 401) + summary: Web log bad requests + info: Ratio of client error HTTP requests over the last minute (4xx except 401) to: webmaster template: web_log_1m_internal_errors @@ -109,7 +113,8 @@ component: Web log warn: ($web_log_1m_requests > 120) ? ($this > (($status >= $WARNING) ? ( 1 ) : ( 2 )) ) : ( 0 ) crit: ($web_log_1m_requests > 120) ? ($this > (($status == $CRITICAL) ? ( 2 ) : ( 5 )) ) : ( 0 ) delay: up 2m down 15m multiplier 1.5 max 1h - info: ratio of server error HTTP requests over the last minute (5xx) + summary: Web log server errors + info: Ratio of server error HTTP requests over the last minute (5xx) to: webmaster # ----------------------------------------------------------------------------- @@ -145,7 +150,8 @@ component: Web log warn: ($web_log_1m_requests > 120) ? ($this > $green && $this > ($web_log_10m_response_time * 2) ) : ( 0 ) crit: ($web_log_1m_requests > 120) ? ($this > $red && $this > ($web_log_10m_response_time * 4) ) : ( 0 ) delay: down 15m multiplier 1.5 max 1h - info: average HTTP response time over the last 1 minute + summary: Web log processing time + info: Average HTTP response time over the last 1 minute options: no-clear-notification to: webmaster @@ -192,7 +198,8 @@ component: Web log crit: ($web_log_5m_successful_old > 120) ? ($this > 400 OR $this < 25) : (0) delay: down 15m multiplier 1.5 max 1h options: no-clear-notification - info: ratio of successful HTTP requests over over the last 5 minutes, \ + summary: Web log 5 minutes requests ratio + info: Ratio of successful HTTP requests over over the last 5 minutes, \ compared with the previous 5 minutes \ (clear notification for this alarm will not be sent) to: webmaster diff --git a/health/health.d/whoisquery.conf b/health/health.d/whoisquery.conf index be5eb58f9..0a328b592 100644 --- a/health/health.d/whoisquery.conf +++ b/health/health.d/whoisquery.conf @@ -9,5 +9,6 @@ component: WHOIS every: 60s warn: $this < $days_until_expiration_warning*24*60*60 crit: $this < $days_until_expiration_critical*24*60*60 - info: time until the domain name registration expires + summary: Whois expiration time for domain ${label:domain} + info: Time until the domain name registration for ${label:domain} expires to: webmaster diff --git a/health/health.d/windows.conf b/health/health.d/windows.conf index 9ef4c202f..706fcbf22 100644 --- a/health/health.d/windows.conf +++ b/health/health.d/windows.conf @@ -14,7 +14,8 @@ component: CPU warn: $this > (($status >= $WARNING) ? (75) : (85)) crit: $this > (($status == $CRITICAL) ? (85) : (95)) delay: down 15m multiplier 1.5 max 1h - info: average CPU utilization over the last 10 minutes + summary: CPU utilization + info: Average CPU utilization over the last 10 minutes to: silent @@ -33,7 +34,8 @@ component: Memory warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 15m multiplier 1.5 max 1h - info: memory utilization + summary: Ram utilization + info: Memory utilization to: sysadmin @@ -51,7 +53,8 @@ component: Network every: 1m warn: $this >= 5 delay: down 1h multiplier 1.5 max 2h - info: number of inbound discarded packets for the network interface in the last 10 minutes + summary: Inbound network packets discarded + info: Number of inbound discarded packets for the network interface in the last 10 minutes to: silent template: windows_outbound_packets_discarded @@ -66,7 +69,8 @@ component: Network every: 1m warn: $this >= 5 delay: down 1h multiplier 1.5 max 2h - info: number of outbound discarded packets for the network interface in the last 10 minutes + summary: Outbound network packets discarded + info: Number of outbound discarded packets for the network interface in the last 10 minutes to: silent template: windows_inbound_packets_errors @@ -81,7 +85,8 @@ component: Network every: 1m warn: $this >= 5 delay: down 1h multiplier 1.5 max 2h - info: number of inbound errors for the network interface in the last 10 minutes + summary: Inbound network errors + info: Number of inbound errors for the network interface in the last 10 minutes to: silent template: windows_outbound_packets_errors @@ -96,7 +101,8 @@ component: Network every: 1m warn: $this >= 5 delay: down 1h multiplier 1.5 max 2h - info: number of outbound errors for the network interface in the last 10 minutes + summary: Outbound network errors + info: Number of outbound errors for the network interface in the last 10 minutes to: silent @@ -115,5 +121,6 @@ component: Disk warn: $this > (($status >= $WARNING) ? (80) : (90)) crit: $this > (($status == $CRITICAL) ? (90) : (98)) delay: down 15m multiplier 1.5 max 1h - info: disk space utilization + summary: Disk space usage + info: Disk space utilization to: sysadmin diff --git a/health/health.d/x509check.conf b/health/health.d/x509check.conf index fc69d0288..d05f3ef0f 100644 --- a/health/health.d/x509check.conf +++ b/health/health.d/x509check.conf @@ -9,7 +9,8 @@ component: x509 certificates every: 60s warn: $this < $days_until_expiration_warning*24*60*60 crit: $this < $days_until_expiration_critical*24*60*60 - info: time until x509 certificate expires + summary: x509 certificate expiration for ${label:source} + info: Time until x509 certificate expires for ${label:source} to: webmaster template: x509check_revocation_status @@ -20,5 +21,6 @@ component: x509 certificates calc: $revoked every: 60s crit: $this != nan AND $this != 0 - info: x509 certificate revocation status (0: revoked, 1: valid) + summary: x509 certificate revocation status for ${label:source} + info: x509 certificate revocation status (0: revoked, 1: valid) for ${label:source} to: webmaster diff --git a/health/health.d/zfs.conf b/health/health.d/zfs.conf index 40ec4ce8a..d2a561000 100644 --- a/health/health.d/zfs.conf +++ b/health/health.d/zfs.conf @@ -9,6 +9,7 @@ component: File system every: 1m warn: $this > 0 delay: down 1h multiplier 1.5 max 2h + summary: ZFS ARC growth throttling info: number of times ZFS had to limit the ARC growth in the last 10 minutes to: silent @@ -24,6 +25,7 @@ component: File system every: 10s warn: $this > 0 delay: down 1m multiplier 1.5 max 1h + summary: ZFS pool ${label:pool} state info: ZFS pool ${label:pool} state is degraded to: sysadmin @@ -37,5 +39,6 @@ component: File system every: 10s crit: $this > 0 delay: down 1m multiplier 1.5 max 1h + summary: Critical ZFS pool ${label:pool} state info: ZFS pool ${label:pool} state is faulted or unavail to: sysadmin |