1 files changed, 388 insertions, 0 deletions
diff --git a/doc/mgr/prometheus.rst b/doc/mgr/prometheus.rst
new file mode 100644
index 000000000..d4a9fd911
--- /dev/null
+++ b/doc/mgr/prometheus.rst
@@ -0,0 +1,388 @@
+.. _mgr-prometheus:
+
+=================
+Prometheus Module
+=================
+
+Provides a Prometheus exporter to pass on Ceph performance counters
+from the collection point in ceph-mgr.  Ceph-mgr receives MMgrReport
+messages from all MgrClient processes (mons and OSDs, for instance)
+with performance counter schema data and actual counter data, and keeps
+a circular buffer of the last N samples.  This module creates an HTTP
+endpoint (like all Prometheus exporters) and retrieves the latest sample
+of every counter when polled (or "scraped" in Prometheus terminology).
+The HTTP path and query parameters are ignored; all extant counters
+for all reporting entities are returned in text exposition format.
+(See the Prometheus `documentation <https://prometheus.io/docs/instrumenting/exposition_formats/#text-format-details>`_.)
+
+Enabling prometheus output
+==========================
+
+The *prometheus* module is enabled with::
+
+  ceph mgr module enable prometheus
+
+Configuration
+-------------
+
+.. note::
+
+    The Prometheus manager module needs to be restarted for configuration changes to be applied.
+
+By default the module will accept HTTP requests on port ``9283`` on all IPv4
+and IPv6 addresses on the host.  The port and listen address are both
+configurable with ``ceph config set``, with keys
+``mgr/prometheus/server_addr`` and ``mgr/prometheus/server_port``.  This port
+is registered with Prometheus's `registry
+<https://github.com/prometheus/prometheus/wiki/Default-port-allocations>`_.
+
+::
+
+    ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
+    ceph config set mgr mgr/prometheus/server_port 9283
+
+.. warning::
+
+    The ``scrape_interval`` of this module should always be set to match
+    Prometheus' scrape interval to work properly and not cause any issues.
+
+The scrape interval in the module is used for caching purposes
+and to determine when a cache is stale.
+
+It is not recommended to use a scrape interval below 10 seconds.  It is
+recommended to use 15 seconds as scrape interval, though, in some cases it
+might be useful to increase the scrape interval.
+
+To set a different scrape interval in the Prometheus module, set
+``scrape_interval`` to the desired value::
+
+    ceph config set mgr mgr/prometheus/scrape_interval 20
+
+On large clusters (>1000 OSDs), the time to fetch the metrics may become
+significant.  Without the cache, the Prometheus manager module could, especially
+in conjunction with multiple Prometheus instances, overload the manager and lead
+to unresponsive or crashing Ceph manager instances.  Hence, the cache is enabled
+by default.  This means that there is a possibility that the cache becomes
+stale.  The cache is considered stale when the time to fetch the metrics from
+Ceph exceeds the configured :confval:``mgr/prometheus/scrape_interval``.
+
+If that is the case, **a warning will be logged** and the module will either
+
+* respond with a 503 HTTP status code (service unavailable) or,
+* it will return the content of the cache, even though it might be stale.
+
+This behavior can be configured. By default, it will return a 503 HTTP status
+code (service unavailable). You can set other options using the ``ceph config
+set`` commands.
+
+To tell the module to respond with possibly stale data, set it to ``return``::
+
+    ceph config set mgr mgr/prometheus/stale_cache_strategy return
+
+To tell the module to respond with "service unavailable", set it to ``fail``::
+
+    ceph config set mgr mgr/prometheus/stale_cache_strategy fail
+
+If you are confident that you don't require the cache, you can disable it::
+
+    ceph config set mgr mgr/prometheus/cache false
+
+If you are using the prometheus module behind some kind of reverse proxy or
+loadbalancer, you can simplify discovering the active instance by switching
+to ``error``-mode::
+
+    ceph config set mgr mgr/prometheus/standby_behaviour error
+
+If set, the prometheus module will repond with a HTTP error when requesting ``/``
+from the standby instance. The default error code is 500, but you can configure
+the HTTP response code with::
+
+    ceph config set mgr mgr/prometheus/standby_error_status_code 503
+
+Valid error codes are between 400-599.
+
+To switch back to the default behaviour, simply set the config key to ``default``::
+
+    ceph config set mgr mgr/prometheus/standby_behaviour default
+
+.. _prometheus-rbd-io-statistics:
+
+Ceph Health Checks
+------------------
+
+The mgr/prometheus module also tracks and maintains a history of Ceph health checks,
+exposing them to the Prometheus server as discrete metrics. This allows Prometheus
+alert rules to be configured for specific health check events.
+
+The metrics take the following form;
+
+::
+
+    # HELP ceph_health_detail healthcheck status by type (0=inactive, 1=active)
+    # TYPE ceph_health_detail gauge
+    ceph_health_detail{name="OSDMAP_FLAGS",severity="HEALTH_WARN"} 0.0
+    ceph_health_detail{name="OSD_DOWN",severity="HEALTH_WARN"} 1.0
+    ceph_health_detail{name="PG_DEGRADED",severity="HEALTH_WARN"} 1.0
+
+The health check history is made available through the following commands;
+
+::
+
+    healthcheck history ls [--format {plain|json|json-pretty}]
+    healthcheck history clear
+
+The ``ls`` command provides an overview of the health checks that the cluster has
+encountered, or since the last ``clear`` command was issued. The example below;
+
+::
+
+    [ceph: root@c8-node1 /]# ceph healthcheck history ls
+    Healthcheck Name          First Seen (UTC)      Last seen (UTC)       Count  Active
+    OSDMAP_FLAGS              2021/09/16 03:17:47   2021/09/16 22:07:40       2    No
+    OSD_DOWN                  2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
+    PG_DEGRADED               2021/09/17 00:11:59   2021/09/17 00:11:59       1   Yes
+    3 health check(s) listed
+
+
+RBD IO statistics
+-----------------
+
+The module can optionally collect RBD per-image IO statistics by enabling
+dynamic OSD performance counters. The statistics are gathered for all images
+in the pools that are specified in the ``mgr/prometheus/rbd_stats_pools``
+configuration parameter. The parameter is a comma or space separated list
+of ``pool[/namespace]`` entries. If the namespace is not specified the
+statistics are collected for all namespaces in the pool.
+
+Example to activate the RBD-enabled pools ``pool1``, ``pool2`` and ``poolN``::
+
+  ceph config set mgr mgr/prometheus/rbd_stats_pools "pool1,pool2,poolN"
+
+The module makes the list of all available images scanning the specified
+pools and namespaces and refreshes it periodically. The period is
+configurable via the ``mgr/prometheus/rbd_stats_pools_refresh_interval``
+parameter (in sec) and is 300 sec (5 minutes) by default. The module will
+force refresh earlier if it detects statistics from a previously unknown
+RBD image.
+
+Example to turn up the sync interval to 10 minutes::
+
+  ceph config set mgr mgr/prometheus/rbd_stats_pools_refresh_interval 600
+
+Statistic names and labels
+==========================
+
+The names of the stats are exactly as Ceph names them, with
+illegal characters ``.``, ``-`` and ``::`` translated to ``_``,
+and ``ceph_`` prefixed to all names.
+
+
+All *daemon* statistics have a ``ceph_daemon`` label such as "osd.123"
+that identifies the type and ID of the daemon they come from.  Some
+statistics can come from different types of daemon, so when querying
+e.g. an OSD's RocksDB stats, you would probably want to filter
+on ceph_daemon starting with "osd" to avoid mixing in the monitor
+rocksdb stats.
+
+
+The *cluster* statistics (i.e. those global to the Ceph cluster)
+have labels appropriate to what they report on.  For example,
+metrics relating to pools have a ``pool_id`` label.
+
+
+The long running averages that represent the histograms from core Ceph
+are represented by a pair of ``<name>_sum`` and ``<name>_count`` metrics.
+This is similar to how histograms are represented in `Prometheus <https://prometheus.io/docs/concepts/metric_types/#histogram>`_
+and they can also be treated `similarly <https://prometheus.io/docs/practices/histograms/>`_.
+
+Pool and OSD metadata series
+----------------------------
+
+Special series are output to enable displaying and querying on
+certain metadata fields.
+
+Pools have a ``ceph_pool_metadata`` field like this:
+
+::
+
+    ceph_pool_metadata{pool_id="2",name="cephfs_metadata_a"} 1.0
+
+OSDs have a ``ceph_osd_metadata`` field like this:
+
+::
+
+    ceph_osd_metadata{cluster_addr="172.21.9.34:6802/19096",device_class="ssd",ceph_daemon="osd.0",public_addr="172.21.9.34:6801/19096",weight="1.0"} 1.0
+
+
+Correlating drive statistics with node_exporter
+-----------------------------------------------
+
+The prometheus output from Ceph is designed to be used in conjunction
+with the generic host monitoring from the Prometheus node_exporter.
+
+To enable correlation of Ceph OSD statistics with node_exporter's
+drive statistics, special series are output like this:
+
+::
+
+    ceph_disk_occupation_human{ceph_daemon="osd.0", device="sdd", exported_instance="myhost"}
+
+To use this to get disk statistics by OSD ID, use either the ``and`` operator or
+the ``*`` operator in your prometheus query. All metadata metrics (like ``
+ceph_disk_occupation_human`` have the value 1 so they act neutral with ``*``. Using ``*``
+allows to use ``group_left`` and ``group_right`` grouping modifiers, so that
+the resulting metric has additional labels from one side of the query.
+
+See the
+`prometheus documentation`__ for more information about constructing queries.
+
+__ https://prometheus.io/docs/prometheus/latest/querying/basics
+
+The goal is to run a query like
+
+::
+
+    rate(node_disk_bytes_written[30s]) and
+    on (device,instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
+
+Out of the box the above query will not return any metrics since the ``instance`` labels of
+both metrics don't match. The ``instance`` label of ``ceph_disk_occupation_human``
+will be the currently active MGR node.
+
+The following two section outline two approaches to remedy this.
+
+.. note::
+
+    If you need to group on the `ceph_daemon` label instead of `device` and
+    `instance` labels, using `ceph_disk_occupation_human` may not work reliably.
+    It is advised that you use `ceph_disk_occupation` instead.
+
+    The difference is that `ceph_disk_occupation_human` may group several OSDs
+    into the value of a single `ceph_daemon` label in cases where multiple OSDs
+    share a disk.
+
+Use label_replace
+=================
+
+The ``label_replace`` function (cp.
+`label_replace documentation <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_replace>`_)
+can add a label to, or alter a label of, a metric within a query.
+
+To correlate an OSD and its disks write rate, the following query can be used:
+
+::
+
+    label_replace(
+        rate(node_disk_bytes_written[30s]),
+        "exported_instance",
+        "$1",
+        "instance",
+        "(.*):.*"
+    ) and on (device, exported_instance) ceph_disk_occupation_human{ceph_daemon="osd.0"}
+
+Configuring Prometheus server
+=============================
+
+honor_labels
+------------
+
+To enable Ceph to output properly-labeled data relating to any host,
+use the ``honor_labels`` setting when adding the ceph-mgr endpoints
+to your prometheus configuration.
+
+This allows Ceph to export the proper ``instance`` label without prometheus
+overwriting it. Without this setting, Prometheus applies an ``instance`` label
+that includes the hostname and port of the endpoint that the series came from.
+Because Ceph clusters have multiple manager daemons, this results in an
+``instance`` label that changes spuriously when the active manager daemon
+changes.
+
+If this is undesirable a custom ``instance`` label can be set in the
+Prometheus target configuration: you might wish to set it to the hostname
+of your first mgr daemon, or something completely arbitrary like "ceph_cluster".
+
+node_exporter hostname labels
+-----------------------------
+
+Set your ``instance`` labels to match what appears in Ceph's OSD metadata
+in the ``instance`` field.  This is generally the short hostname of the node.
+
+This is only necessary if you want to correlate Ceph stats with host stats,
+but you may find it useful to do it in all cases in case you want to do
+the correlation in the future.
+
+Example configuration
+---------------------
+
+This example shows a single node configuration running ceph-mgr and
+node_exporter on a server called ``senta04``. Note that this requires one
+to add an appropriate and unique ``instance`` label to each ``node_exporter`` target.
+
+This is just an example: there are other ways to configure prometheus
+scrape targets and label rewrite rules.
+
+prometheus.yml
+~~~~~~~~~~~~~~
+
+::
+
+    global:
+      scrape_interval:     15s
+      evaluation_interval: 15s
+
+    scrape_configs:
+      - job_name: 'node'
+        file_sd_configs:
+          - files:
+            - node_targets.yml
+      - job_name: 'ceph'
+        honor_labels: true
+        file_sd_configs:
+          - files:
+            - ceph_targets.yml
+
+
+ceph_targets.yml
+~~~~~~~~~~~~~~~~
+
+
+::
+
+    [
+        {
+            "targets": [ "senta04.mydomain.com:9283" ],
+            "labels": {}
+        }
+    ]
+
+
+node_targets.yml
+~~~~~~~~~~~~~~~~
+
+::
+
+    [
+        {
+            "targets": [ "senta04.mydomain.com:9100" ],
+            "labels": {
+                "instance": "senta04"
+            }
+        }
+    ]
+
+
+Notes
+=====
+
+Counters and gauges are exported; currently histograms and long-running
+averages are not.  It's possible that Ceph's 2-D histograms could be
+reduced to two separate 1-D histograms, and that long-running averages
+could be exported as Prometheus' Summary type.
+
+Timestamps, as with many Prometheus exporters, are established by
+the server's scrape time (Prometheus expects that it is polling the
+actual counter process synchronously).  It is possible to supply a
+timestamp along with the stat report, but the Prometheus team strongly
+advises against this.  This means that timestamps will be delayed by
+an unpredictable amount; it's not clear if this will be problematic,
+but it's worth knowing about.