summaryrefslogtreecommitdiffstats
path: root/doc/dev
diff options
context:
space:
mode:
Diffstat (limited to 'doc/dev')
-rw-r--r--doc/dev/cephadm/developing-cephadm.rst61
-rw-r--r--doc/dev/crimson/crimson.rst5
-rw-r--r--doc/dev/crimson/index.rst2
-rw-r--r--doc/dev/developer_guide/essentials.rst26
-rw-r--r--doc/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-workflow.rst9
-rw-r--r--doc/dev/internals.rst10
-rw-r--r--doc/dev/osd_internals/manifest.rst2
-rw-r--r--doc/dev/osd_internals/snaps.rst121
-rw-r--r--doc/dev/peering.rst120
9 files changed, 267 insertions, 89 deletions
diff --git a/doc/dev/cephadm/developing-cephadm.rst b/doc/dev/cephadm/developing-cephadm.rst
index 49b771caa..a213b5f1e 100644
--- a/doc/dev/cephadm/developing-cephadm.rst
+++ b/doc/dev/cephadm/developing-cephadm.rst
@@ -401,3 +401,64 @@ own copy of the cephadm "binary" use the script located at
``./src/cephadm/build.py [output]``.
.. _Python Zip Application: https://peps.python.org/pep-0441/
+
+You can pass a limited set of version metadata values to be stored in the
+compiled cepadm. These options can be passed to the build script with
+the ``--set-version-var`` or ``-S`` option. The values should take the form
+``KEY=VALUE`` and valid keys include:
+* ``CEPH_GIT_VER``
+* ``CEPH_GIT_NICE_VER``
+* ``CEPH_RELEASE``
+* ``CEPH_RELEASE_NAME``
+* ``CEPH_RELEASE_TYPE``
+
+Example: ``./src/cephadm/build.py -SCEPH_GIT_VER=$(git rev-parse HEAD) -SCEPH_GIT_NICE_VER=$(git describe) /tmp/cephadm``
+
+Typically these values will be passed to build.py by other, higher level, build
+tools - such as cmake.
+
+The compiled version of the binary may include a curated set of dependencies
+within the zipapp. The tool used to fetch the bundled dependencies can be
+Python's ``pip``, locally installed RPMs, or bundled dependencies can be
+disabled. To select the mode for bundled dependencies use the
+``--bundled-dependencies`` or ``-B`` option with a value of ``pip``, ``rpm``,
+or ``none``.
+
+The compiled cephadm zipapp file retains metadata about how it was built. This
+can be displayed by running ``cephadm version --verbose``. The command will
+emit a JSON formatted object showing version metadata (if available), a list of
+the bundled dependencies generated by the build script (if bundled dependencies
+were enabled), and a summary of the top-level contents of the zipapp. Example::
+
+ $ ./cephadm version --verbose
+ {
+ "name": "cephadm",
+ "ceph_git_nice_ver": "18.0.0-6867-g6a1df2d0b01",
+ "ceph_git_ver": "6a1df2d0b01da581bfef3357940e1e88d5ce70ce",
+ "ceph_release_name": "reef",
+ "ceph_release_type": "dev",
+ "bundled_packages": [
+ {
+ "name": "Jinja2",
+ "version": "3.1.2",
+ "package_source": "pip",
+ "requirements_entry": "Jinja2 == 3.1.2"
+ },
+ {
+ "name": "MarkupSafe",
+ "version": "2.1.3",
+ "package_source": "pip",
+ "requirements_entry": "MarkupSafe == 2.1.3"
+ }
+ ],
+ "zip_root_entries": [
+ "Jinja2-3.1.2-py3.9.egg-info",
+ "MarkupSafe-2.1.3-py3.9.egg-info",
+ "__main__.py",
+ "__main__.pyc",
+ "_cephadmmeta",
+ "cephadmlib",
+ "jinja2",
+ "markupsafe"
+ ]
+ }
diff --git a/doc/dev/crimson/crimson.rst b/doc/dev/crimson/crimson.rst
index cbc20b773..ea00ceebf 100644
--- a/doc/dev/crimson/crimson.rst
+++ b/doc/dev/crimson/crimson.rst
@@ -148,7 +148,7 @@ options. By default, ``log-to-stdout`` is enabled, and ``--log-to-syslog`` is di
vstart.sh
---------
-The following options aree handy when using ``vstart.sh``,
+The following options can be used with ``vstart.sh``.
``--crimson``
Start ``crimson-osd`` instead of ``ceph-osd``.
@@ -195,9 +195,6 @@ The following options aree handy when using ``vstart.sh``,
Valid types include ``HDD``, ``SSD``(default), ``ZNS``, and ``RANDOM_BLOCK_SSD``
Note secondary devices should not be faster than the main device.
-``--seastore``
- Use SeaStore as the object store backend.
-
To start a cluster with a single Crimson node, run::
$ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \
diff --git a/doc/dev/crimson/index.rst b/doc/dev/crimson/index.rst
index 55f071825..9790a9640 100644
--- a/doc/dev/crimson/index.rst
+++ b/doc/dev/crimson/index.rst
@@ -1,3 +1,5 @@
+.. _crimson_dev_doc:
+
===============================
Crimson developer documentation
===============================
diff --git a/doc/dev/developer_guide/essentials.rst b/doc/dev/developer_guide/essentials.rst
index 5a31e430b..90201f7c5 100644
--- a/doc/dev/developer_guide/essentials.rst
+++ b/doc/dev/developer_guide/essentials.rst
@@ -13,20 +13,18 @@ following table shows all the leads and their nicks on `GitHub`_:
.. _github: https://github.com/
-========= ================ =============
-Scope Lead GitHub nick
-========= ================ =============
-Ceph Sage Weil liewegas
-RADOS Neha Ojha neha-ojha
-RGW Yehuda Sadeh yehudasa
-RGW Matt Benjamin mattbenjamin
-RBD Ilya Dryomov dis
-CephFS Venky Shankar vshankar
-Dashboard Ernesto Puerta epuertat
-MON Joao Luis jecluis
-Build/Ops Ken Dreyer ktdreyer
-Docs Zac Dover zdover23
-========= ================ =============
+========= ================== =============
+Scope Lead GitHub nick
+========= ================== =============
+RADOS Radoslaw Zarzynski rzarzynski
+RGW Casey Bodley cbodley
+RGW Matt Benjamin mattbenjamin
+RBD Ilya Dryomov dis
+CephFS Venky Shankar vshankar
+Dashboard Nizamudeen A nizamial09
+Build/Ops Ken Dreyer ktdreyer
+Docs Zac Dover zdover23
+========= ================== =============
The Ceph-specific acronyms in the table are explained in
:doc:`/architecture`.
diff --git a/doc/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-workflow.rst b/doc/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-workflow.rst
index 64b006c57..427c84bd0 100644
--- a/doc/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-workflow.rst
+++ b/doc/dev/developer_guide/testing_integration_tests/tests-integration-testing-teuthology-workflow.rst
@@ -209,6 +209,15 @@ For example: for the above test ID, the path is::
This method can be used to view the log more quickly than would be possible through a browser.
+In addition to ``teuthology.log``, some other files are included for debugging
+purposes:
+
+* ``unit_test_summary.yaml``: Provides a summary of all unit test failures.
+ Generated (optionally) when the ``unit_test_scan`` configuration option is
+ used in the job's YAML file.
+
+* ``valgrind.yaml``: Summarizes any Valgrind errors that may occur.
+
.. note:: To access archives more conveniently, ``/a/`` has been symbolically
linked to ``/ceph/teuthology-archive/``. For instance, to access the previous
example, we can use something like::
diff --git a/doc/dev/internals.rst b/doc/dev/internals.rst
index a894394c9..e72d2738b 100644
--- a/doc/dev/internals.rst
+++ b/doc/dev/internals.rst
@@ -2,10 +2,14 @@
Ceph Internals
================
-.. note:: If you're looking for how to use Ceph as a library from your
- own software, please see :doc:`/api/index`.
+.. note:: For information on how to use Ceph as a library (from your own
+ software), see :doc:`/api/index`.
-You can start a development mode Ceph cluster, after compiling the source, with::
+Starting a Development-mode Ceph Cluster
+----------------------------------------
+
+Compile the source and then run the following commands to start a
+development-mode Ceph cluster::
cd build
OSD=3 MON=3 MGR=3 ../src/vstart.sh -n -x
diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst
index 7be4350ea..43c23fa71 100644
--- a/doc/dev/osd_internals/manifest.rst
+++ b/doc/dev/osd_internals/manifest.rst
@@ -218,6 +218,8 @@ we may want to exploit.
The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
clones as part of leak detection.
+.. _osd-make-writeable:
+
An important question is how we deal with the fact that many clones
will frequently have references to the same backing chunks at the same
offset. In particular, ``make_writeable`` will generally create a clone
diff --git a/doc/dev/osd_internals/snaps.rst b/doc/dev/osd_internals/snaps.rst
index 5ebd0884a..736d0add5 100644
--- a/doc/dev/osd_internals/snaps.rst
+++ b/doc/dev/osd_internals/snaps.rst
@@ -23,12 +23,11 @@ The difference between *pool snaps* and *self managed snaps* from the
OSD's point of view lies in whether the *SnapContext* comes to the OSD
via the client's MOSDOp or via the most recent OSDMap.
-See OSD::make_writeable
+See :ref:`manifest.rst <osd-make-writeable>` for more information.
Ondisk Structures
-----------------
-Each object has in the PG collection a *head* object (or *snapdir*, which we
-will come to shortly) and possibly a set of *clone* objects.
+Each object has in the PG collection a *head* object and possibly a set of *clone* objects.
Each hobject_t has a snap field. For the *head* (the only writeable version
of an object), the snap field is set to CEPH_NOSNAP. For the *clones*, the
snap field is set to the *seq* of the *SnapContext* at their creation.
@@ -47,8 +46,12 @@ The *head* object contains a *SnapSet* encoded in an attribute, which tracks
3. Overlapping intervals between clones for tracking space usage
4. Clone size
-If the *head* is deleted while there are still clones, a *snapdir* object
-is created instead to house the *SnapSet*.
+The *head* can't be deleted while there are still clones. Instead, it is
+marked as whiteout (``object_info_t::FLAG_WHITEOUT``) in order to house the
+*SnapSet* contained in it.
+In that case, the *head* object no longer logically exists.
+
+See: should_whiteout()
Additionally, the *object_info_t* on each clone includes a vector of snaps
for which clone is defined.
@@ -126,3 +129,111 @@ up to 8 prefixes need to be checked to determine all hobjects in a particular
snap for a particular PG. Upon split, the prefixes to check on the parent
are adjusted such that only the objects remaining in the PG will be visible.
The children will immediately have the correct mapping.
+
+clone_overlap
+-------------
+Each SnapSet attached to the *head* object contains the overlapping intervals
+between clone objects for optimizing space.
+The overlapping intervals are stored within the ``clone_overlap`` map, each element in the
+map stores the snap ID and the corresponding overlap with the next newest clone.
+
+See the following example using a 4 byte object:
+
++--------+---------+
+| object | content |
++========+=========+
+| head | [AAAA] |
++--------+---------+
+
+listsnaps output is as follows:
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| head | - | 4 | |
++---------+-------+------+---------+
+
+After taking a snapshot (ID 1) and re-writing the first 2 bytes of the object,
+the clone created will overlap with the new *head* object in its last 2 bytes.
+
++------------+---------+
+| object | content |
++============+=========+
+| head | [BBAA] |
++------------+---------+
+| clone ID 1 | [AAAA] |
++------------+---------+
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| 1 | 1 | 4 | [2~2] |
++---------+-------+------+---------+
+| head | - | 4 | |
++---------+-------+------+---------+
+
+By taking another snapshot (ID 2) and this time re-writing only the first 1 byte of the object,
+the clone created (ID 2) will overlap with the new *head* object in its last 3 bytes.
+While the oldest clone (ID 1) will overlap with the newest clone in its last 2 bytes.
+
++------------+---------+
+| object | content |
++============+=========+
+| head | [CBAA] |
++------------+---------+
+| clone ID 2 | [BBAA] |
++------------+---------+
+| clone ID 1 | [AAAA] |
++------------+---------+
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| 1 | 1 | 4 | [2~2] |
++---------+-------+------+---------+
+| 2 | 2 | 4 | [1~3] |
++---------+-------+------+---------+
+| head | - | 4 | |
++---------+-------+------+---------+
+
+If the *head* object will be completely re-written by re-writing 4 bytes,
+the only existing overlap that will remain will be between the two clones.
+
++------------+---------+
+| object | content |
++============+=========+
+| head | [DDDD] |
++------------+---------+
+| clone ID 2 | [BBAA] |
++------------+---------+
+| clone ID 1 | [AAAA] |
++------------+---------+
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| 1 | 1 | 4 | [2~2] |
++---------+-------+------+---------+
+| 2 | 2 | 4 | |
++---------+-------+------+---------+
+| head | - | 4 | |
++---------+-------+------+---------+
+
+Lastly, after the last snap (ID 2) is removed and snaptrim kicks in,
+no overlapping intervals will remain:
+
++------------+---------+
+| object | content |
++============+=========+
+| head | [DDDD] |
++------------+---------+
+| clone ID 1 | [AAAA] |
++------------+---------+
+
++---------+-------+------+---------+
+| cloneid | snaps | size | overlap |
++=========+=======+======+=========+
+| 1 | 1 | 4 | |
++---------+-------+------+---------+
+| head | - | 4 | |
++---------+-------+------+---------+
diff --git a/doc/dev/peering.rst b/doc/dev/peering.rst
index 3960e14ca..97a319129 100644
--- a/doc/dev/peering.rst
+++ b/doc/dev/peering.rst
@@ -6,98 +6,93 @@ Concepts
--------
*Peering*
- the process of bringing all of the OSDs that store
- a Placement Group (PG) into agreement about the state
- of all of the objects (and their metadata) in that PG.
- Note that agreeing on the state does not mean that
- they all have the latest contents.
+ the process of bringing all of the OSDs that store a Placement Group (PG)
+ into agreement about the state of all of the objects in that PG and all of
+ the metadata associated with those objects. Two OSDs can agree on the state
+ of the objects in the placement group yet still may not necessarily have the
+ latest contents.
*Acting set*
- the ordered list of OSDs who are (or were as of some epoch)
- responsible for a particular PG.
+ the ordered list of OSDs that are (or were as of some epoch) responsible for
+ a particular PG.
*Up set*
- the ordered list of OSDs responsible for a particular PG for
- a particular epoch according to CRUSH. Normally this
- is the same as the *acting set*, except when the *acting set* has been
- explicitly overridden via *PG temp* in the OSDMap.
+ the ordered list of OSDs responsible for a particular PG for a particular
+ epoch, according to CRUSH. This is the same as the *acting set* except when
+ the *acting set* has been explicitly overridden via *PG temp* in the OSDMap.
*PG temp*
- a temporary placement group acting set used while backfilling the
- primary osd. Let say acting is [0,1,2] and we are
- active+clean. Something happens and acting is now [3,1,2]. osd 3 is
- empty and can't serve reads although it is the primary. osd.3 will
- see that and request a *PG temp* of [1,2,3] to the monitors using a
- MOSDPGTemp message so that osd.1 temporarily becomes the
- primary. It will select osd.3 as a backfill peer and continue to
- serve reads and writes while osd.3 is backfilled. When backfilling
- is complete, *PG temp* is discarded and the acting set changes back
- to [3,1,2] and osd.3 becomes the primary.
+ a temporary placement group acting set that is used while backfilling the
+ primary OSD. Assume that the acting set is ``[0,1,2]`` and we are
+ ``active+clean``. Now assume that something happens and the acting set
+ becomes ``[2,1,2]``. Under these circumstances, OSD ``3`` is empty and can't
+ serve reads even though it is the primary. ``osd.3`` will respond by
+ requesting a *PG temp* of ``[1,2,3]`` to the monitors using a ``MOSDPGTemp``
+ message, and ``osd.1`` will become the primary temporarily. ``osd.1`` will
+ select ``osd.3`` as a backfill peer and will continue to serve reads and
+ writes while ``osd.3`` is backfilled. When backfilling is complete, *PG
+ temp* is discarded. The acting set changes back to ``[3,1,2]`` and ``osd.3``
+ becomes the primary.
*current interval* or *past interval*
- a sequence of OSD map epochs during which the *acting set* and *up
- set* for particular PG do not change
+ a sequence of OSD map epochs during which the *acting set* and the *up
+ set* for particular PG do not change.
*primary*
- the (by convention first) member of the *acting set*,
- who is responsible for coordination peering, and is
- the only OSD that will accept client initiated
- writes to objects in a placement group.
+ the member of the *acting set* that is responsible for coordination peering.
+ The only OSD that accepts client-initiated writes to the objects in a
+ placement group. By convention, the primary is the first member of the
+ *acting set*.
*replica*
- a non-primary OSD in the *acting set* for a placement group
- (and who has been recognized as such and *activated* by the primary).
+ a non-primary OSD in the *acting set* of a placement group. A replica has
+ been recognized as a non-primary OSD and has been *activated* by the
+ primary.
*stray*
- an OSD who is not a member of the current *acting set*, but
- has not yet been told that it can delete its copies of a
- particular placement group.
+ an OSD that is not a member of the current *acting set* and has not yet been
+ told to delete its copies of a particular placement group.
*recovery*
- ensuring that copies of all of the objects in a PG
- are on all of the OSDs in the *acting set*. Once
- *peering* has been performed, the primary can start
- accepting write operations, and *recovery* can proceed
- in the background.
+ the process of ensuring that copies of all of the objects in a PG are on all
+ of the OSDs in the *acting set*. After *peering* has been performed, the
+ primary can begin accepting write operations and *recovery* can proceed in
+ the background.
*PG info*
- basic metadata about the PG's creation epoch, the version
- for the most recent write to the PG, *last epoch started*, *last
- epoch clean*, and the beginning of the *current interval*. Any
- inter-OSD communication about PGs includes the *PG info*, such that
- any OSD that knows a PG exists (or once existed) also has a lower
- bound on *last epoch clean* or *last epoch started*.
+ basic metadata about the PG's creation epoch, the version for the most
+ recent write to the PG, the *last epoch started*, the *last epoch clean*,
+ and the beginning of the *current interval*. Any inter-OSD communication
+ about PGs includes the *PG info*, such that any OSD that knows a PG exists
+ (or once existed) and also has a lower bound on *last epoch clean* or *last
+ epoch started*.
*PG log*
- a list of recent updates made to objects in a PG.
- Note that these logs can be truncated after all OSDs
- in the *acting set* have acknowledged up to a certain
- point.
+ a list of recent updates made to objects in a PG. These logs can be
+ truncated after all OSDs in the *acting set* have acknowledged the changes.
*missing set*
- Each OSD notes update log entries and if they imply updates to
- the contents of an object, adds that object to a list of needed
- updates. This list is called the *missing set* for that <OSD,PG>.
+ the set of all objects that have not yet had their contents updated to match
+ the log entries. The missing set is collated by each OSD. Missing sets are
+ kept track of on an ``<OSD,PG>`` basis.
*Authoritative History*
- a complete, and fully ordered set of operations that, if
- performed, would bring an OSD's copy of a Placement Group
- up to date.
+ a complete and fully-ordered set of operations that bring an OSD's copy of a
+ Placement Group up to date.
*epoch*
- a (monotonically increasing) OSD map version number
+ a (monotonically increasing) OSD map version number.
*last epoch start*
- the last epoch at which all nodes in the *acting set*
- for a particular placement group agreed on an
- *authoritative history*. At this point, *peering* is
- deemed to have been successful.
+ the last epoch at which all nodes in the *acting set* for a given placement
+ group agreed on an *authoritative history*. At the start of the last epoch,
+ *peering* is deemed to have been successful.
*up_thru*
before a primary can successfully complete the *peering* process,
it must inform a monitor that is alive through the current
OSD map epoch by having the monitor set its *up_thru* in the osd
- map. This helps peering ignore previous *acting sets* for which
+ map. This helps peering ignore previous *acting sets* for which
peering never completed after certain sequences of failures, such as
the second interval below:
@@ -107,10 +102,9 @@ Concepts
- *acting set* = [B] (B restarts, A does not)
*last epoch clean*
- the last epoch at which all nodes in the *acting set*
- for a particular placement group were completely
- up to date (both PG logs and object contents).
- At this point, *recovery* is deemed to have been
+ the last epoch at which all nodes in the *acting set* for a given placement
+ group were completely up to date (this includes both the PG's logs and the
+ PG's object contents). At this point, *recovery* is deemed to have been
completed.
Description of the Peering Process