summaryrefslogtreecommitdiffstats
path: root/doc/ceph-volume/intro.rst
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /doc/ceph-volume/intro.rst
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/ceph-volume/intro.rst')
-rw-r--r--doc/ceph-volume/intro.rst84
1 files changed, 84 insertions, 0 deletions
diff --git a/doc/ceph-volume/intro.rst b/doc/ceph-volume/intro.rst
new file mode 100644
index 000000000..c36f12a77
--- /dev/null
+++ b/doc/ceph-volume/intro.rst
@@ -0,0 +1,84 @@
+.. _ceph-volume-overview:
+
+Overview
+--------
+The ``ceph-volume`` tool aims to be a single purpose command line tool to deploy
+logical volumes as OSDs, trying to maintain a similar API to ``ceph-disk`` when
+preparing, activating, and creating OSDs.
+
+It deviates from ``ceph-disk`` by not interacting or relying on the udev rules
+that come installed for Ceph. These rules allow automatic detection of
+previously setup devices that are in turn fed into ``ceph-disk`` to activate
+them.
+
+.. _ceph-disk-replaced:
+
+Replacing ``ceph-disk``
+-----------------------
+The ``ceph-disk`` tool was created at a time when the project was required to
+support many different types of init systems (upstart, sysvinit, etc...) while
+being able to discover devices. This caused the tool to concentrate initially
+(and exclusively afterwards) on GPT partitions. Specifically on GPT GUIDs,
+which were used to label devices in a unique way to answer questions like:
+
+* is this device a Journal?
+* an encrypted data partition?
+* was the device left partially prepared?
+
+To solve these, it used ``UDEV`` rules to match the GUIDs, that would call
+``ceph-disk``, and end up in a back and forth between the ``ceph-disk`` systemd
+unit and the ``ceph-disk`` executable. The process was very unreliable and time
+consuming (a timeout of close to three hours **per OSD** had to be put in
+place), and would cause OSDs to not come up at all during the boot process of
+a node.
+
+It was hard to debug, or even replicate these problems given the asynchronous
+behavior of ``UDEV``.
+
+Since the world-view of ``ceph-disk`` had to be GPT partitions exclusively, it meant
+that it couldn't work with other technologies like LVM, or similar device
+mapper devices. It was ultimately decided to create something modular, starting
+with LVM support, and the ability to expand on other technologies as needed.
+
+
+GPT partitions are simple?
+--------------------------
+Although partitions in general are simple to reason about, ``ceph-disk``
+partitions were not simple by any means. It required a tremendous amount of
+special flags in order to get them to work correctly with the device discovery
+workflow. Here is an example call to create a data partition::
+
+ /sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:f0fc39fd-eeb2-49f1-b922-a11939cf8a0f --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sdb
+
+Not only creating these was hard, but these partitions required devices to be
+exclusively owned by Ceph. For example, in some cases a special partition would
+be created when devices were encrypted, which would contain unencrypted keys.
+This was ``ceph-disk`` domain knowledge, which would not translate to a "GPT
+partitions are simple" understanding. Here is an example of that special
+partition being created::
+
+ /sbin/sgdisk --new=5:0:+10M --change-name=5:ceph lockbox --partition-guid=5:None --typecode=5:fb3aabf9-d25f-47cc-bf5e-721d181642be --mbrtogpt -- /dev/sdad
+
+
+Modularity
+----------
+``ceph-volume`` was designed to be a modular tool because we anticipate that
+there are going to be lots of ways that people provision the hardware devices
+that we need to consider. There are already two: legacy ceph-disk devices that
+are still in use and have GPT partitions (handled by :ref:`ceph-volume-simple`),
+and lvm. SPDK devices where we manage NVMe devices directly from userspace are
+on the immediate horizon, where LVM won't work there since the kernel isn't
+involved at all.
+
+``ceph-volume lvm``
+-------------------
+By making use of :term:`LVM tags`, the :ref:`ceph-volume-lvm` sub-command is
+able to store and later re-discover and query devices associated with OSDs so
+that they can later be activated.
+
+LVM performance penalty
+-----------------------
+In short: we haven't been able to notice any significant performance penalties
+associated with the change to LVM. By being able to work closely with LVM, the
+ability to work with other device mapper technologies was a given: there is no
+technical difficulty in working with anything that can sit below a Logical Volume.