summaryrefslogtreecommitdiffstats
path: root/doc/ceph-volume/systemd.rst
blob: 5b5273c9cfd8b20e4d268ab027af3a7210840c1a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
.. _ceph-volume-systemd:

systemd
=======
As part of the activation process (either with :ref:`ceph-volume-lvm-activate`
or :ref:`ceph-volume-simple-activate`), systemd units will get enabled that
will use the OSD id and uuid as part of their name. These units will be run
when the system boots, and will proceed to activate their corresponding
volumes via their sub-command implementation.

The API for activation is a bit loose, it only requires two parts: the
subcommand to use and any extra meta information separated by a dash. This
convention makes the units look like::

    ceph-volume@{command}-{extra metadata}

The *extra metadata* can be anything needed that the subcommand implementing
the processing might need. In the case of :ref:`ceph-volume-lvm` and
:ref:`ceph-volume-simple`, both look to consume the :term:`OSD id` and :term:`OSD uuid`,
but this is not a hard requirement, it is just how the sub-commands are
implemented.

Both the command and extra metadata gets persisted by systemd as part of the
*"instance name"* of the unit.  For example an OSD with an ID of 0, for the
``lvm`` sub-command would look like::

    systemctl enable ceph-volume@lvm-0-0A3E1ED2-DA8A-4F0E-AA95-61DEC71768D6

The enabled unit is a :term:`systemd oneshot` service, meant to start at boot
after the local file system is ready to be used.


Failure and Retries
-------------------
It is common to have failures when a system is coming up online. The devices
are sometimes not fully available and this unpredictable behavior may cause an
OSD to not be ready to be used.

There are two configurable environment variables used to set the retry
behavior:

* ``CEPH_VOLUME_SYSTEMD_TRIES``: Defaults to 30
* ``CEPH_VOLUME_SYSTEMD_INTERVAL``: Defaults to 5

The *"tries"* is a number that sets the maximum number of times the unit will
attempt to activate an OSD before giving up.

The *"interval"* is a value in seconds that determines the waiting time before
initiating another try at activating the OSD.