doc/cephadm/services/monitoring.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457

.. _mgr-cephadm-monitoring:

Monitoring Services
===================

Ceph Dashboard uses `Prometheus <https://prometheus.io/>`_, `Grafana
<https://grafana.com/>`_, and related tools to store and visualize detailed
metrics on cluster utilization and performance.  Ceph users have three options:

#. Have cephadm deploy and configure these services.  This is the default
   when bootstrapping a new cluster unless the ``--skip-monitoring-stack``
   option is used.
#. Deploy and configure these services manually.  This is recommended for users
   with existing prometheus services in their environment (and in cases where
   Ceph is running in Kubernetes with Rook).
#. Skip the monitoring stack completely.  Some Ceph dashboard graphs will
   not be available.

The monitoring stack consists of `Prometheus <https://prometheus.io/>`_,
Prometheus exporters (:ref:`mgr-prometheus`, `Node exporter
<https://prometheus.io/docs/guides/node-exporter/>`_), `Prometheus Alert
Manager <https://prometheus.io/docs/alerting/alertmanager/>`_ and `Grafana
<https://grafana.com/>`_.

.. note::

  Prometheus' security model presumes that untrusted users have access to the
  Prometheus HTTP endpoint and logs. Untrusted users have access to all the
  (meta)data Prometheus collects that is contained in the database, plus a
  variety of operational and debugging information.

  However, Prometheus' HTTP API is limited to read-only operations.
  Configurations can *not* be changed using the API and secrets are not
  exposed. Moreover, Prometheus has some built-in measures to mitigate the
  impact of denial of service attacks.

  Please see `Prometheus' Security model
  <https://prometheus.io/docs/operating/security/>` for more detailed
  information.

Deploying monitoring with cephadm
---------------------------------

The default behavior of ``cephadm`` is to deploy a basic monitoring stack.  It
is however possible that you have a Ceph cluster without a monitoring stack,
and you would like to add a monitoring stack to it. (Here are some ways that
you might have come to have a Ceph cluster without a monitoring stack: You
might have passed the ``--skip-monitoring stack`` option to ``cephadm`` during
the installation of the cluster, or you might have converted an existing
cluster (which had no monitoring stack) to cephadm management.)

To set up monitoring on a Ceph cluster that has no monitoring, follow the
steps below:

#. Deploy a node-exporter service on every node of the cluster.  The node-exporter provides host-level metrics like CPU and memory utilization:

   .. prompt:: bash #

     ceph orch apply node-exporter

#. Deploy alertmanager:

   .. prompt:: bash #

     ceph orch apply alertmanager

#. Deploy Prometheus. A single Prometheus instance is sufficient, but
   for high availablility (HA) you might want to deploy two:

   .. prompt:: bash #

     ceph orch apply prometheus

   or 

   .. prompt:: bash #
     
     ceph orch apply prometheus --placement 'count:2'

#. Deploy grafana:

   .. prompt:: bash #

     ceph orch apply grafana

.. _cephadm-monitoring-networks-ports:

Networks and Ports
~~~~~~~~~~~~~~~~~~

All monitoring services can have the network and port they bind to configured with a yaml service specification

example spec file:

.. code-block:: yaml

    service_type: grafana
    service_name: grafana
    placement:
      count: 1
    networks:
    - 192.169.142.0/24
    spec:
      port: 4200

.. _cephadm_monitoring-images:

Using custom images
~~~~~~~~~~~~~~~~~~~

It is possible to install or upgrade monitoring components based on other
images.  To do so, the name of the image to be used needs to be stored in the
configuration first.  The following configuration options are available.

- ``container_image_prometheus``
- ``container_image_grafana``
- ``container_image_alertmanager``
- ``container_image_node_exporter``

Custom images can be set with the ``ceph config`` command

.. code-block:: bash

     ceph config set mgr mgr/cephadm/<option_name> <value>

For example

.. code-block:: bash

     ceph config set mgr mgr/cephadm/container_image_prometheus prom/prometheus:v1.4.1

If there were already running monitoring stack daemon(s) of the type whose
image you've changed, you must redeploy the daemon(s) in order to have them
actually use the new image.

For example, if you had changed the prometheus image

.. prompt:: bash #

     ceph orch redeploy prometheus


.. note::

     By setting a custom image, the default value will be overridden (but not
     overwritten).  The default value changes when updates become available.
     By setting a custom image, you will not be able to update the component
     you have set the custom image for automatically.  You will need to
     manually update the configuration (image name and tag) to be able to
     install updates.

     If you choose to go with the recommendations instead, you can reset the
     custom image you have set before.  After that, the default value will be
     used again.  Use ``ceph config rm`` to reset the configuration option

     .. code-block:: bash

          ceph config rm mgr mgr/cephadm/<option_name>

     For example

     .. code-block:: bash

          ceph config rm mgr mgr/cephadm/container_image_prometheus

See also :ref:`cephadm-airgap`.

.. _cephadm-overwrite-jinja2-templates:

Using custom configuration files
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By overriding cephadm templates, it is possible to completely customize the
configuration files for monitoring services.

Internally, cephadm already uses `Jinja2
<https://jinja.palletsprojects.com/en/2.11.x/>`_ templates to generate the
configuration files for all monitoring components. To be able to customize the
configuration of Prometheus, Grafana or the Alertmanager it is possible to store
a Jinja2 template for each service that will be used for configuration
generation instead. This template will be evaluated every time a service of that
kind is deployed or reconfigured. That way, the custom configuration is
preserved and automatically applied on future deployments of these services.

.. note::

  The configuration of the custom template is also preserved when the default
  configuration of cephadm changes. If the updated configuration is to be used,
  the custom template needs to be migrated *manually* after each upgrade of Ceph.

Option names
""""""""""""

The following templates for files that will be generated by cephadm can be
overridden. These are the names to be used when storing with ``ceph config-key
set``:

- ``services/alertmanager/alertmanager.yml``
- ``services/grafana/ceph-dashboard.yml``
- ``services/grafana/grafana.ini``
- ``services/prometheus/prometheus.yml``
- ``services/prometheus/alerting/custom_alerts.yml``

You can look up the file templates that are currently used by cephadm in
``src/pybind/mgr/cephadm/templates``:

- ``services/alertmanager/alertmanager.yml.j2``
- ``services/grafana/ceph-dashboard.yml.j2``
- ``services/grafana/grafana.ini.j2``
- ``services/prometheus/prometheus.yml.j2``

Usage
"""""

The following command applies a single line value:

.. code-block:: bash

  ceph config-key set mgr/cephadm/<option_name> <value>

To set contents of files as template use the ``-i`` argument:

.. code-block:: bash

  ceph config-key set mgr/cephadm/<option_name> -i $PWD/<filename>

.. note::

  When using files as input to ``config-key`` an absolute path to the file must
  be used.


Then the configuration file for the service needs to be recreated.
This is done using `reconfig`. For more details see the following example.

Example
"""""""

.. code-block:: bash

  # set the contents of ./prometheus.yml.j2 as template
  ceph config-key set mgr/cephadm/services/prometheus/prometheus.yml \
    -i $PWD/prometheus.yml.j2

  # reconfig the prometheus service
  ceph orch reconfig prometheus

.. code-block:: bash

  # set additional custom alerting rules for Prometheus
  ceph config-key set mgr/cephadm/services/prometheus/alerting/custom_alerts.yml \
    -i $PWD/custom_alerts.yml

  # Note that custom alerting rules are not parsed by Jinja and hence escaping
  # will not be an issue.

Deploying monitoring without cephadm
------------------------------------

If you have an existing prometheus monitoring infrastructure, or would like
to manage it yourself, you need to configure it to integrate with your Ceph
cluster.

* Enable the prometheus module in the ceph-mgr daemon

  .. code-block:: bash

     ceph mgr module enable prometheus

  By default, ceph-mgr presents prometheus metrics on port 9283 on each host
  running a ceph-mgr daemon.  Configure prometheus to scrape these.

* To enable the dashboard's prometheus-based alerting, see :ref:`dashboard-alerting`.

* To enable dashboard integration with Grafana, see :ref:`dashboard-grafana`.

Disabling monitoring
--------------------

To disable monitoring and remove the software that supports it, run the following commands:

.. code-block:: console

  $ ceph orch rm grafana
  $ ceph orch rm prometheus --force   # this will delete metrics data collected so far
  $ ceph orch rm node-exporter
  $ ceph orch rm alertmanager
  $ ceph mgr module disable prometheus

See also :ref:`orch-rm`.

Setting up RBD-Image monitoring
-------------------------------

Due to performance reasons, monitoring of RBD images is disabled by default. For more information please see
:ref:`prometheus-rbd-io-statistics`. If disabled, the overview and details dashboards will stay empty in Grafana
and the metrics will not be visible in Prometheus.

Setting up Prometheus
-----------------------

Setting Prometheus Retention Time
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cephadm provides the option to set the Prometheus TDSB retention time using
a ``retention_time`` field in the Prometheus service spec. The value defaults
to 15 days (15d). If you would like a different value, such as 1 year (1y) you
can apply a service spec similar to:

.. code-block:: yaml

    service_type: prometheus
    placement:
      count: 1
    spec:
      retention_time: "1y"

.. note::

  If you already had Prometheus daemon(s) deployed before and are updating an
  existent spec as opposed to doing a fresh Prometheus deployment, you must also
  tell cephadm to redeploy the Prometheus daemon(s) to put this change into effect.
  This can be done with a ``ceph orch redeploy prometheus`` command.

Setting up Grafana
------------------

Manually setting the Grafana URL
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cephadm automatically configures Prometheus, Grafana, and Alertmanager in
all cases except one.

In a some setups, the Dashboard user's browser might not be able to access the
Grafana URL that is configured in Ceph Dashboard. This can happen when the
cluster and the accessing user are in different DNS zones.

If this is the case, you can use a configuration option for Ceph Dashboard
to set the URL that the user's browser will use to access Grafana. This
value will never be altered by cephadm. To set this configuration option,
issue the following command:

   .. prompt:: bash $

     ceph dashboard set-grafana-frontend-api-url <grafana-server-api>

It might take a minute or two for services to be deployed. After the
services have been deployed, you should see something like this when you issue the command ``ceph orch ls``:

.. code-block:: console

  $ ceph orch ls
  NAME           RUNNING  REFRESHED  IMAGE NAME                                      IMAGE ID        SPEC
  alertmanager       1/1  6s ago     docker.io/prom/alertmanager:latest              0881eb8f169f  present
  crash              2/2  6s ago     docker.io/ceph/daemon-base:latest-master-devel  mix           present
  grafana            1/1  0s ago     docker.io/pcuzner/ceph-grafana-el8:latest       f77afcf0bcf6   absent
  node-exporter      2/2  6s ago     docker.io/prom/node-exporter:latest             e5a616e4b9cf  present
  prometheus         1/1  6s ago     docker.io/prom/prometheus:latest                e935122ab143  present

Configuring SSL/TLS for Grafana
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``cephadm`` deploys Grafana using the certificate defined in the ceph
key/value store. If no certificate is specified, ``cephadm`` generates a
self-signed certificate during the deployment of the Grafana service.

A custom certificate can be configured using the following commands:

.. prompt:: bash #

  ceph config-key set mgr/cephadm/grafana_key -i $PWD/key.pem
  ceph config-key set mgr/cephadm/grafana_crt -i $PWD/certificate.pem

If you have already deployed Grafana, run ``reconfig`` on the service to
update its configuration:

.. prompt:: bash #

  ceph orch reconfig grafana

The ``reconfig`` command also sets the proper URL for Ceph Dashboard.

Setting the initial admin password
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default, Grafana will not create an initial
admin user. In order to create the admin user, please create a file
``grafana.yaml`` with this content:

.. code-block:: yaml

  service_type: grafana
  spec:
    initial_admin_password: mypassword

Then apply this specification:

.. code-block:: bash

  ceph orch apply -i grafana.yaml
  ceph orch redeploy grafana

Grafana will now create an admin user called ``admin`` with the
given password.


Setting up Alertmanager
-----------------------

Adding Alertmanager webhooks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To add new webhooks to the Alertmanager configuration, add additional
webhook urls like so:

.. code-block:: yaml

    service_type: alertmanager
    spec:
      user_data:
        default_webhook_urls:
        - "https://foo"
        - "https://bar"

Where ``default_webhook_urls`` is a list of additional URLs that are
added to the default receivers' ``<webhook_configs>`` configuration.

Run ``reconfig`` on the service to update its configuration:

.. prompt:: bash #

  ceph orch reconfig alertmanager

Turn on Certificate Validation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you are using certificates for alertmanager and want to make sure
these certs are verified, you should set the "secure" option to
true in your alertmanager spec (this defaults to false).

.. code-block:: yaml

    service_type: alertmanager
    spec:
      secure: true

If you already had alertmanager daemons running before applying the spec
you must reconfigure them to update their configuration

.. prompt:: bash #

  ceph orch reconfig alertmanager

Further Reading
---------------

* :ref:`mgr-prometheus`