summaryrefslogtreecommitdiffstats
path: root/monitoring/ceph-mixin/tests_alerts/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'monitoring/ceph-mixin/tests_alerts/README.md')
-rw-r--r--monitoring/ceph-mixin/tests_alerts/README.md92
1 files changed, 92 insertions, 0 deletions
diff --git a/monitoring/ceph-mixin/tests_alerts/README.md b/monitoring/ceph-mixin/tests_alerts/README.md
new file mode 100644
index 000000000..cf95fa636
--- /dev/null
+++ b/monitoring/ceph-mixin/tests_alerts/README.md
@@ -0,0 +1,92 @@
+
+## Alert Rule Standards
+
+The alert rules should adhere to the following principles
+- each alert must have a unique name
+- each alert should define a common structure
+ - labels : must contain severity and type
+ - annotations : must provide description
+ - expr : must define the promql expression
+ - alert : defines the alert name
+- alerts that have a corresponding section within docs.ceph.com must include a
+ documentation field in the annotations section
+- critical alerts should declare an oid in the labels section
+- critical alerts should have a corresponding entry in the Ceph MIB
+
+ 
+## Testing Prometheus Rules
+Once you have updated the `ceph_default_alerts.yml` file, you should use the
+`validate_rules.py` script directly, or via `tox` to ensure the format of any update
+or change aligns to our rule structure guidelines. The validate_rules.py script will
+process the rules and look for any configuration anomalies and output a report if
+problems are detected.
+
+Here's an example run, to illustrate the format and the kinds of issues detected.
+
+```
+[paul@myhost tests]$ ./validate_rules.py
+
+Checking rule groups
+ cluster health : ..
+ mon : E.W..
+ osd : E...W......W.E..
+ mds : WW
+ mgr : WW
+ pgs : ..WWWW..
+ nodes : .EEEE
+ pools : EEEW.
+ healthchecks : .
+ cephadm : WW.
+ prometheus : W
+ rados : W
+
+Summary
+
+Rule file : ../alerts/ceph_default_alerts.yml
+Unit Test file : test_alerts.yml
+
+Rule groups processed : 12
+Rules processed : 51
+Rule errors : 10
+Rule warnings : 16
+Rule name duplicates : 0
+Unit tests missing : 4
+
+Problem Report
+
+ Group Severity Alert Name Problem Description
+ ----- -------- ---------- -------------------
+ cephadm Warning Cluster upgrade has failed critical level alert is missing an SNMP oid entry
+ cephadm Warning A daemon managed by cephadm is down critical level alert is missing an SNMP oid entry
+ mds Warning Ceph Filesystem damage detected critical level alert is missing an SNMP oid entry
+ mds Warning Ceph Filesystem switched to READ ONLY critical level alert is missing an SNMP oid entry
+ mgr Warning mgr module failure critical level alert is missing an SNMP oid entry
+ mgr Warning mgr prometheus module is not active critical level alert is missing an SNMP oid entry
+ mon Error Monitor down, quorum is at risk documentation link error: #mon-downwah not found on the page
+ mon Warning Ceph mon disk space critically low critical level alert is missing an SNMP oid entry
+ nodes Error network packets dropped invalid alert structure. Missing field: for
+ nodes Error network packet errors invalid alert structure. Missing field: for
+ nodes Error storage filling up invalid alert structure. Missing field: for
+ nodes Error MTU Mismatch invalid alert structure. Missing field: for
+ osd Error 10% OSDs down invalid alert structure. Missing field: for
+ osd Error Flapping OSD invalid alert structure. Missing field: for
+ osd Warning OSD Full critical level alert is missing an SNMP oid entry
+ osd Warning Too many devices predicted to fail critical level alert is missing an SNMP oid entry
+ pgs Warning Placement Group (PG) damaged critical level alert is missing an SNMP oid entry
+ pgs Warning Recovery at risk, cluster too full critical level alert is missing an SNMP oid entry
+ pgs Warning I/O blocked to some data critical level alert is missing an SNMP oid entry
+ pgs Warning Cluster too full, automatic data recovery impaired critical level alert is missing an SNMP oid entry
+ pools Error pool full invalid alert structure. Missing field: for
+ pools Error pool filling up (growth forecast) invalid alert structure. Missing field: for
+ pools Error Ceph pool is too full for recovery/rebalance invalid alert structure. Missing field: for
+ pools Warning Ceph pool is full - writes blocked critical level alert is missing an SNMP oid entry
+ prometheus Warning Scrape job is missing critical level alert is missing an SNMP oid entry
+ rados Warning Data not found/missing critical level alert is missing an SNMP oid entry
+
+Unit tests are incomplete. Tests missing for the following alerts;
+ - Placement Group (PG) damaged
+ - OSD Full
+ - storage filling up
+ - pool filling up (growth forecast)
+
+```