summaryrefslogtreecommitdiffstats
path: root/doc/dev/crimson
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-04-21 11:54:28 +0000
commite6918187568dbd01842d8d1d2c808ce16a894239 (patch)
tree64f88b554b444a49f656b6c656111a145cbbaa28 /doc/dev/crimson
parentInitial commit. (diff)
downloadceph-e6918187568dbd01842d8d1d2c808ce16a894239.tar.xz
ceph-e6918187568dbd01842d8d1d2c808ce16a894239.zip
Adding upstream version 18.2.2.upstream/18.2.2
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'doc/dev/crimson')
-rw-r--r--doc/dev/crimson/crimson.rst480
-rw-r--r--doc/dev/crimson/error-handling.rst158
-rw-r--r--doc/dev/crimson/index.rst11
-rw-r--r--doc/dev/crimson/osd.rst54
-rw-r--r--doc/dev/crimson/pipeline.rst97
-rw-r--r--doc/dev/crimson/poseidonstore.rst586
6 files changed, 1386 insertions, 0 deletions
diff --git a/doc/dev/crimson/crimson.rst b/doc/dev/crimson/crimson.rst
new file mode 100644
index 000000000..cbc20b773
--- /dev/null
+++ b/doc/dev/crimson/crimson.rst
@@ -0,0 +1,480 @@
+=======
+crimson
+=======
+
+Crimson is the code name of ``crimson-osd``, which is the next
+generation ``ceph-osd``. It improves performance when using fast network
+and storage devices, employing state-of-the-art technologies including
+DPDK and SPDK. BlueStore continues to support HDDs and slower SSDs.
+Crimson aims to be backward compatible with the classic ``ceph-osd``.
+
+.. highlight:: console
+
+Building Crimson
+================
+
+Crimson is not enabled by default. Enable it at build time by running::
+
+ $ WITH_SEASTAR=true ./install-deps.sh
+ $ mkdir build && cd build
+ $ cmake -DWITH_SEASTAR=ON ..
+
+Please note, `ASan`_ is enabled by default if Crimson is built from a source
+cloned using ``git``.
+
+.. _ASan: https://github.com/google/sanitizers/wiki/AddressSanitizer
+
+Testing crimson with cephadm
+===============================
+
+The Ceph CI/CD pipeline builds containers with
+``crimson-osd`` subsitituted for ``ceph-osd``.
+
+Once a branch at commit <sha1> has been built and is available in
+``shaman``, you can deploy it using the cephadm instructions outlined
+in :ref:`cephadm` with the following adaptations.
+
+First, while performing the initial bootstrap, use the ``--image`` flag to
+use a Crimson build:
+
+.. prompt:: bash #
+
+ cephadm --image quay.ceph.io/ceph-ci/ceph:<sha1>-crimson --allow-mismatched-release bootstrap ...
+
+You'll likely need to supply the ``--allow-mismatched-release`` flag to
+use a non-release branch.
+
+Additionally, prior to deploying OSDs, you'll need enable Crimson to
+direct the default pools to be created as Crimson pools. From the cephadm shell run:
+
+.. prompt:: bash #
+
+ ceph config set global 'enable_experimental_unrecoverable_data_corrupting_features' crimson
+ ceph osd set-allow-crimson --yes-i-really-mean-it
+ ceph config set mon osd_pool_default_crimson true
+
+The first command enables the ``crimson`` experimental feature. Crimson
+is highly experimental, and malfunctions including crashes
+and data loss are to be expected.
+
+The second enables the ``allow_crimson`` OSDMap flag. The monitor will
+not allow ``crimson-osd`` to boot without that flag.
+
+The last causes pools to be created by default with the ``crimson`` flag.
+Crimson pools are restricted to operations supported by Crimson.
+``Crimson-osd`` won't instantiate PGs from non-Crimson pools.
+
+Running Crimson
+===============
+
+As you might expect, Crimson does not yet have as extensive a feature set as does ``ceph-osd``.
+
+object store backend
+--------------------
+
+At the moment, ``crimson-osd`` offers both native and alienized object store
+backends. The native object store backends perform IO using the SeaStar reactor.
+They are:
+
+.. describe:: cyanstore
+
+ CyanStore is modeled after memstore in the classic OSD.
+
+.. describe:: seastore
+
+ Seastore is still under active development.
+
+The alienized object store backends are backed by a thread pool, which
+is a proxy of the alienstore adaptor running in Seastar. The proxy issues
+requests to object stores running in alien threads, i.e., worker threads not
+managed by the Seastar framework. They are:
+
+.. describe:: memstore
+
+ The memory backed object store
+
+.. describe:: bluestore
+
+ The object store used by the classic ``ceph-osd``
+
+daemonize
+---------
+
+Unlike ``ceph-osd``, ``crimson-osd`` does not daemonize itself even if the
+``daemonize`` option is enabled. In order to read this option, ``crimson-osd``
+needs to ready its config sharded service, but this sharded service lives
+in the Seastar reactor. If we fork a child process and exit the parent after
+starting the Seastar engine, that will leave us with a single thread which is
+a replica of the thread that called `fork()`_. Tackling this problem in Crimson
+would unnecessarily complicate the code.
+
+Since supported GNU/Linux distributions use ``systemd``, which is able to
+daemonize the application, there is no need to daemonize ourselves.
+Those using sysvinit can use ``start-stop-daemon`` to daemonize ``crimson-osd``.
+If this is does not work out, a helper utility may be devised.
+
+.. _fork(): http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
+
+logging
+-------
+
+``Crimson-osd`` currently uses the logging utility offered by Seastar. See
+``src/common/dout.h`` for the mapping between Ceph logging levels to
+the severity levels in Seastar. For instance, messages sent to ``derr``
+will be issued using ``logger::error()``, and the messages with a debug level
+greater than ``20`` will be issued using ``logger::trace()``.
+
++---------+---------+
+| ceph | seastar |
++---------+---------+
+| < 0 | error |
++---------+---------+
+| 0 | warn |
++---------+---------+
+| [1, 6) | info |
++---------+---------+
+| [6, 20] | debug |
++---------+---------+
+| > 20 | trace |
++---------+---------+
+
+Note that ``crimson-osd``
+does not send log messages directly to a specified ``log_file``. It writes
+the logging messages to stdout and/or syslog. This behavior can be
+changed using ``--log-to-stdout`` and ``--log-to-syslog`` command line
+options. By default, ``log-to-stdout`` is enabled, and ``--log-to-syslog`` is disabled.
+
+
+vstart.sh
+---------
+
+The following options aree handy when using ``vstart.sh``,
+
+``--crimson``
+ Start ``crimson-osd`` instead of ``ceph-osd``.
+
+``--nodaemon``
+ Do not daemonize the service.
+
+``--redirect-output``
+ Redirect the ``stdout`` and ``stderr`` to ``out/$type.$num.stdout``.
+
+``--osd-args``
+ Pass extra command line options to ``crimson-osd`` or ``ceph-osd``.
+ This is useful for passing Seastar options to ``crimson-osd``. For
+ example, one can supply ``--osd-args "--memory 2G"`` to set the amount of
+ memory to use. Please refer to the output of::
+
+ crimson-osd --help-seastar
+
+ for additional Seastar-specific command line options.
+
+``--cyanstore``
+ Use CyanStore as the object store backend.
+
+``--bluestore``
+ Use the alienized BlueStore as the object store backend. This is the default.
+
+``--memstore``
+ Use the alienized MemStore as the object store backend.
+
+``--seastore``
+ Use SeaStore as the back end object store.
+
+``--seastore-devs``
+ Specify the block device used by SeaStore.
+
+``--seastore-secondary-devs``
+ Optional. SeaStore supports multiple devices. Enable this feature by
+ passing the block device to this option.
+
+``--seastore-secondary-devs-type``
+ Optional. Specify the type of secondary devices. When the secondary
+ device is slower than main device passed to ``--seastore-devs``, the cold
+ data in faster device will be evicted to the slower devices over time.
+ Valid types include ``HDD``, ``SSD``(default), ``ZNS``, and ``RANDOM_BLOCK_SSD``
+ Note secondary devices should not be faster than the main device.
+
+``--seastore``
+ Use SeaStore as the object store backend.
+
+To start a cluster with a single Crimson node, run::
+
+ $ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \
+ --without-dashboard --cyanstore \
+ --crimson --redirect-output \
+ --osd-args "--memory 4G"
+
+Here we assign 4 GiB memory and a single thread running on core-0 to ``crimson-osd``.
+
+Another SeaStore example::
+
+ $ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \
+ --without-dashboard --seastore \
+ --crimson --redirect-output \
+ --seastore-devs /dev/sda \
+ --seastore-secondary-devs /dev/sdb \
+ --seastore-secondary-devs-type HDD
+
+Stop this ``vstart`` cluster by running::
+
+ $ ../src/stop.sh --crimson
+
+Metrics and Tracing
+===================
+
+Crimson offers three ways to report stats and metrics.
+
+pg stats reported to mgr
+------------------------
+
+Crimson collects the per-pg, per-pool, and per-osd stats in a `MPGStats`
+message which is sent to the Ceph Managers. Manager modules can query
+them using the `MgrModule.get()` method.
+
+asock command
+-------------
+
+An admin socket command is offered for dumping metrics::
+
+ $ ceph tell osd.0 dump_metrics
+ $ ceph tell osd.0 dump_metrics reactor_utilization
+
+Here `reactor_utilization` is an optional string allowing us to filter
+the dumped metrics by prefix.
+
+Prometheus text protocol
+------------------------
+
+The listening port and address can be configured using the command line options of
+`--prometheus_port`
+see `Prometheus`_ for more details.
+
+.. _Prometheus: https://github.com/scylladb/seastar/blob/master/doc/prometheus.md
+
+Profiling Crimson
+=================
+
+fio
+---
+
+``crimson-store-nbd`` exposes configurable ``FuturizedStore`` internals as an
+NBD server for use with ``fio``.
+
+In order to use ``fio`` to test ``crimson-store-nbd``, perform the below steps.
+
+#. You will need to install ``libnbd``, and compile it into ``fio``
+
+ .. prompt:: bash $
+
+ apt-get install libnbd-dev
+ git clone git://git.kernel.dk/fio.git
+ cd fio
+ ./configure --enable-libnbd
+ make
+
+#. Build ``crimson-store-nbd``
+
+ .. prompt:: bash $
+
+ cd build
+ ninja crimson-store-nbd
+
+#. Run the ``crimson-store-nbd`` server with a block device. Specify
+ the path to the raw device, for example ``/dev/nvme1n1``, in place of the created
+ file for testing with a block device.
+
+ .. prompt:: bash $
+
+ export disk_img=/tmp/disk.img
+ export unix_socket=/tmp/store_nbd_socket.sock
+ rm -f $disk_img $unix_socket
+ truncate -s 512M $disk_img
+ ./bin/crimson-store-nbd \
+ --device-path $disk_img \
+ --smp 1 \
+ --mkfs true \
+ --type transaction_manager \
+ --uds-path ${unix_socket} &
+
+ Below are descriptions of these command line arguments:
+
+ ``--smp``
+ The number of CPU cores to use (Symmetric MultiProcessor)
+
+ ``--mkfs``
+ Initialize the device first.
+
+ ``--type``
+ The back end to use. If ``transaction_manager`` is specified, SeaStore's
+ ``TransactionManager`` and ``BlockSegmentManager`` are used to emulate a
+ block device. Otherwise, this option is used to choose a backend of
+ ``FuturizedStore``, where the whole "device" is divided into multiple
+ fixed-size objects whose size is specified by ``--object-size``. So, if
+ you are only interested in testing the lower-level implementation of
+ SeaStore like logical address translation layer and garbage collection
+ without the object store semantics, ``transaction_manager`` would be a
+ better choice.
+
+#. Create a ``fio`` job file named ``nbd.fio``
+
+ .. code:: ini
+
+ [global]
+ ioengine=nbd
+ uri=nbd+unix:///?socket=${unix_socket}
+ rw=randrw
+ time_based
+ runtime=120
+ group_reporting
+ iodepth=1
+ size=512M
+
+ [job0]
+ offset=0
+
+#. Test the Crimson object store, using the custom ``fio`` built just now
+
+ .. prompt:: bash $
+
+ ./fio nbd.fio
+
+CBT
+---
+We can use `cbt`_ for performance tests::
+
+ $ git checkout main
+ $ make crimson-osd
+ $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml
+ $ git checkout yet-another-pr
+ $ make crimson-osd
+ $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml
+ $ ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v
+ 19:48:23 - INFO - cbt - prefill/gen8/0: bandwidth: (or (greater) (near 0.05)):: 0.183165/0.186155 => accepted
+ 19:48:23 - INFO - cbt - prefill/gen8/0: iops_avg: (or (greater) (near 0.05)):: 46.0/47.0 => accepted
+ 19:48:23 - WARNING - cbt - prefill/gen8/0: iops_stddev: (or (less) (near 0.05)):: 10.4403/6.65833 => rejected
+ 19:48:23 - INFO - cbt - prefill/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.340868/0.333712 => accepted
+ 19:48:23 - INFO - cbt - prefill/gen8/1: bandwidth: (or (greater) (near 0.05)):: 0.190447/0.177619 => accepted
+ 19:48:23 - INFO - cbt - prefill/gen8/1: iops_avg: (or (greater) (near 0.05)):: 48.0/45.0 => accepted
+ 19:48:23 - INFO - cbt - prefill/gen8/1: iops_stddev: (or (less) (near 0.05)):: 6.1101/9.81495 => accepted
+ 19:48:23 - INFO - cbt - prefill/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.325163/0.350251 => accepted
+ 19:48:23 - INFO - cbt - seq/gen8/0: bandwidth: (or (greater) (near 0.05)):: 1.24654/1.22336 => accepted
+ 19:48:23 - INFO - cbt - seq/gen8/0: iops_avg: (or (greater) (near 0.05)):: 319.0/313.0 => accepted
+ 19:48:23 - INFO - cbt - seq/gen8/0: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted
+ 19:48:23 - INFO - cbt - seq/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.0497733/0.0509029 => accepted
+ 19:48:23 - INFO - cbt - seq/gen8/1: bandwidth: (or (greater) (near 0.05)):: 1.22717/1.11372 => accepted
+ 19:48:23 - INFO - cbt - seq/gen8/1: iops_avg: (or (greater) (near 0.05)):: 314.0/285.0 => accepted
+ 19:48:23 - INFO - cbt - seq/gen8/1: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted
+ 19:48:23 - INFO - cbt - seq/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.0508262/0.0557337 => accepted
+ 19:48:23 - WARNING - cbt - 1 tests failed out of 16
+
+Here we compile and run the same test against two branches: ``main`` and ``yet-another-pr``.
+We then compare the results. Along with every test case, a set of rules is defined to check for
+performance regressions when comparing the sets of test results. If a possible regression is found, the rule and
+corresponding test results are highlighted.
+
+.. _cbt: https://github.com/ceph/cbt
+
+Hacking Crimson
+===============
+
+
+Seastar Documents
+-----------------
+
+See `Seastar Tutorial <https://github.com/scylladb/seastar/blob/master/doc/tutorial.md>`_ .
+Or build a browsable version and start an HTTP server::
+
+ $ cd seastar
+ $ ./configure.py --mode debug
+ $ ninja -C build/debug docs
+ $ python3 -m http.server -d build/debug/doc/html
+
+You might want to install ``pandoc`` and other dependencies beforehand.
+
+Debugging Crimson
+=================
+
+Debugging with GDB
+------------------
+
+The `tips`_ for debugging Scylla also apply to Crimson.
+
+.. _tips: https://github.com/scylladb/scylla/blob/master/docs/dev/debugging.md#tips-and-tricks
+
+Human-readable backtraces with addr2line
+----------------------------------------
+
+When a Seastar application crashes, it leaves us with a backtrace of addresses, like::
+
+ Segmentation fault.
+ Backtrace:
+ 0x00000000108254aa
+ 0x00000000107f74b9
+ 0x00000000105366cc
+ 0x000000001053682c
+ 0x00000000105d2c2e
+ 0x0000000010629b96
+ 0x0000000010629c31
+ 0x00002a02ebd8272f
+ 0x00000000105d93ee
+ 0x00000000103eff59
+ 0x000000000d9c1d0a
+ /lib/x86_64-linux-gnu/libc.so.6+0x000000000002409a
+ 0x000000000d833ac9
+ Segmentation fault
+
+The ``seastar-addr2line`` utility provided by Seastar can be used to map these
+addresses to functions. The script expects input on ``stdin``,
+so we need to copy and paste the above addresses, then send EOF by inputting
+``control-D`` in the terminal. One might use ``echo`` or ``cat`` instead`::
+
+ $ ../src/seastar/scripts/seastar-addr2line -e bin/crimson-osd
+
+ 0x00000000108254aa
+ 0x00000000107f74b9
+ 0x00000000105366cc
+ 0x000000001053682c
+ 0x00000000105d2c2e
+ 0x0000000010629b96
+ 0x0000000010629c31
+ 0x00002a02ebd8272f
+ 0x00000000105d93ee
+ 0x00000000103eff59
+ 0x000000000d9c1d0a
+ 0x00000000108254aa
+ [Backtrace #0]
+ seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1136
+ seastar::print_with_backtrace(seastar::backtrace_buffer&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1157
+ seastar::print_with_backtrace(char const*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1164
+ seastar::sigsegv_action() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5119
+ seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5105
+ seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5101
+ ?? ??:0
+ seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5418
+ seastar::app_template::run_deprecated(int, char**, std::function<void ()>&&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/app-template.cc:173 (discriminator 5)
+ main at /home/kefu/dev/ceph/build/../src/crimson/osd/main.cc:131 (discriminator 1)
+
+Note that ``seastar-addr2line`` is able to extract addresses from
+its input, so you can also paste the log messages as below::
+
+ 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:Backtrace:
+ 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e78dbc
+ 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e7f0
+ 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e8b8
+ 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e985
+ 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: /lib64/libpthread.so.0+0x0000000000012dbf
+
+Unlike the classic ``ceph-osd``, Crimson does not print a human-readable backtrace when it
+handles fatal signals like `SIGSEGV` or `SIGABRT`. It is also more complicated
+with a stripped binary. So instead of planting a signal handler for
+those signals into Crimson, we can use `script/ceph-debug-docker.sh` to map
+addresses in the backtrace::
+
+ # assuming you are under the source tree of ceph
+ $ ./src/script/ceph-debug-docker.sh --flavor crimson master:27e237c137c330ebb82627166927b7681b20d0aa centos:8
+ ....
+ [root@3deb50a8ad51 ~]# wget -q https://raw.githubusercontent.com/scylladb/seastar/master/scripts/seastar-addr2line
+ [root@3deb50a8ad51 ~]# dnf install -q -y file
+ [root@3deb50a8ad51 ~]# python3 seastar-addr2line -e /usr/bin/crimson-osd
+ # paste the backtrace here
diff --git a/doc/dev/crimson/error-handling.rst b/doc/dev/crimson/error-handling.rst
new file mode 100644
index 000000000..185868e70
--- /dev/null
+++ b/doc/dev/crimson/error-handling.rst
@@ -0,0 +1,158 @@
+==============
+error handling
+==============
+
+
+In Seastar, a ``future`` represents a value not yet available but that can become
+available later. ``future`` can have one of following states:
+
+* unavailable: value is not available yet,
+* value,
+* failed: an exception was thrown when computing the value. This exception has
+ been captured and stored in the ``future`` instance via ``std::exception_ptr``.
+
+In the last case, the exception can be processed using ``future::handle_exception()`` or
+``future::handle_exception_type()``. Seastar even provides ``future::or_terminate()`` to
+terminate the program if the future fails.
+
+But in Crimson, quite a few errors are not serious enough to fail the program entirely.
+For instance, if we try to look up an object by its object id, and that operation could
+fail because the object does not exist or it is corrupted, we need to recover that object
+for fulfilling the request instead of terminating the process.
+
+In other words, these errors are expected. Moreover, the performance of the unhappy path
+should also be on par with that of the happy path. Also, we want to have a way to ensure
+that all expected errors are handled. It should be something like the statical analysis
+performed by compiler to spit a warning if any enum value is not handled in a ``switch-case``
+statement.
+
+Unfortunately, ``seastar::future`` is not able to satisfy these two requirements.
+
+* Seastar imposes re-throwing an exception to dispatch between different types of
+ exceptions. This is not very performant nor even scalable as locking in the language's
+ runtime can occur.
+* Seastar does not encode the expected exception type in the type of the returned
+ ``seastar::future``. Only the type of the value is encoded. This imposes huge
+ mental load on programmers as ensuring that all intended errors are indeed handled
+ requires manual code audit.
+
+.. highlight:: c++
+
+So, "errorator" is created. It is a wrapper around the vanilla ``seastar::future``.
+It addresses the performance and scalability issues while embedding the information
+about all expected types-of-errors to the type-of-future.::
+
+ using ertr = crimson::errorator<crimson::ct_error::enoent,
+ crimson::ct_error::einval>;
+
+In above example we defined an errorator that allows for two error types:
+
+* ``crimson::ct_error::enoent`` and
+* ``crimson::ct_error::einval``.
+
+These (and other ones in the ``crimson::ct_error`` namespace) are basically
+unthrowable wrappers over ``std::error_code`` to exclude accidental throwing
+and ensure signaling errors in a way that enables compile-time checking.
+
+The most fundamental thing in an errorator is a descendant of ``seastar::future``
+which can be used as e.g. function's return type::
+
+ static ertr::future<int> foo(int bar) {
+ if (bar == 42) {
+ return crimson::ct_error::einval::make();
+ } else {
+ return ertr::make_ready_future(bar);
+ }
+ }
+
+It's worth to note that returning an error that is not a part the errorator's error set
+would result in a compile-time error::
+
+ static ertr::future<int> foo(int bar) {
+ // Oops, input_output_error is not allowed in `ertr`. static_assert() will
+ // terminate the compilation. This behaviour is absolutely fundamental for
+ // callers -- to figure out about all possible errors they need to worry
+ // about is enough to just take a look on the function's signature; reading
+ // through its implementation is not necessary anymore!
+ return crimson::ct_error::input_output_error::make();
+ }
+
+The errorator concept goes further. It not only provides callers with the information
+about all potential errors embedded in the function's type; it also ensures at the caller
+site that all these errors are handled. As the reader probably know, the main method
+in ``seastar::future`` is ``then()``. On errorated future it is available but only if errorator's
+error set is empty (literally: ``errorator<>::future``); otherwise callers have
+to use ``safe_then()`` instead::
+
+ seastar::future<> baz() {
+ return foo(42).safe_then(
+ [] (const int bar) {
+ std::cout << "the optimistic path! got bar=" << bar << std::endl
+ return ertr::now();
+ },
+ ertr::all_same_way(const std::error_code& err) {
+ // handling errors removes them from errorator's error set
+ std::cout << "the error path! got err=" << err << std::endl;
+ return ertr::now();
+ }).then([] {
+ // as all errors have been handled, errorator's error set became
+ // empty and the future instance returned from `safe_then()` has
+ // `then()` available!
+ return seastar::now();
+ });
+ }
+
+In the above example ``ertr::all_same_way`` has been used to handle all errors in the same
+manner. This is not obligatory -- a caller can handle each of them separately. Moreover,
+it can provide a handler for only a subset of errors. The price for that is the availability
+of ``then()``::
+
+ using einval_ertr = crimson::errorator<crimson::ct_error::einval>;
+
+ // we can't return seastar::future<> (aka errorator<>::future<>) as handling
+ // as this level deals only with enoent leaving einval without a handler.
+ // handling it becomes a responsibility of a caller of `baz()`.
+ einval_ertr::future<> baz() {
+ return foo(42).safe_then(
+ [] (const int bar) {
+ std::cout << "the optimistic path! got bar=" << bar << std::endl
+ return ertr::now();
+ },
+ // provide a handler only for crimson::ct_error::enoent.
+ // crimson::ct_error::einval stays unhandled!
+ crimson::ct_error::enoent::handle([] {
+ std::cout << "the enoent error path!" << std::endl;
+ return ertr::now();
+ }));
+ // .safe_then() above returned `errorator<crimson::ct_error::einval>::future<>`
+ // which lacks `then()`.
+ }
+
+That is, handling errors removes them from errorated future's error set. This works
+in the opposite direction too -- returning new errors in ``safe_then()`` appends them
+the error set. Of course, this set must be compliant with error set in the ``baz()``'s
+signature::
+
+ using broader_ertr = crimson::errorator<crimson::ct_error::enoent,
+ crimson::ct_error::einval,
+ crimson::ct_error::input_output_error>;
+
+ broader_ertr::future<> baz() {
+ return foo(42).safe_then(
+ [] (const int bar) {
+ std::cout << "oops, the optimistic path generates a new error!";
+ return crimson::ct_error::input_output_error::make();
+ },
+ // we have a special handler to delegate the handling up. For convenience,
+ // the same behaviour is available as single argument-taking variant of
+ // `safe_then()`.
+ ertr::pass_further{});
+ }
+
+As it can be seen, handling and signaling errors in ``safe_then()`` is basically
+an operation on the error set checked at compile-time.
+
+More details can be found in `the slides from ceph::errorator<> throw/catch-free,
+compile time-checked exceptions for seastar::future<>
+<https://www.slideshare.net/ScyllaDB/cepherrorator-throwcatchfree-compile-timechecked-exceptions-for-seastarfuture>`_
+presented at the Seastar Summit 2019.
diff --git a/doc/dev/crimson/index.rst b/doc/dev/crimson/index.rst
new file mode 100644
index 000000000..55f071825
--- /dev/null
+++ b/doc/dev/crimson/index.rst
@@ -0,0 +1,11 @@
+===============================
+Crimson developer documentation
+===============================
+
+.. rubric:: Contents
+
+.. toctree::
+ :glob:
+
+ *
+
diff --git a/doc/dev/crimson/osd.rst b/doc/dev/crimson/osd.rst
new file mode 100644
index 000000000..f7f132b3f
--- /dev/null
+++ b/doc/dev/crimson/osd.rst
@@ -0,0 +1,54 @@
+osd
+===
+
+.. graphviz::
+
+ digraph osd {
+ node [shape = doublecircle]; "start" "end";
+ node [shape = circle];
+ start -> preboot;
+ waiting_for_healthy [label = "waiting\nfor\nhealthy"];
+ waiting_for_healthy -> waiting_for_healthy [label = "tick"];
+ waiting_for_healthy -> preboot [label = "i am healthy!"];
+ preboot -> booting [label = "send(MOSDBoot)"];
+ booting -> active [label = "recv(osdmap<up>)"];
+ active -> prestop [label = "stop()"];
+ active -> preboot [label = "recv(osdmap<down>)"];
+ active -> end [label = "kill(SIGINT)"];
+ active -> waiting_for_healthy [label = "i am unhealthy!"]
+ prestop -> end [label = "recv(osdmap<down>)"];
+ }
+
+.. describe:: waiting_for_healthy
+
+ If an OSD daemon is able to connected to its heartbeat peers, and its own
+ internal heartbeat does not fail, it is considered healthy. Otherwise, it
+ puts itself in the state of `waiting_for_healthy`, and check its own
+ reachability and internal heartbeat periodically.
+
+.. describe:: preboot
+
+ OSD sends an `MOSDBoot` message to the connected monitor to inform the
+ cluster that it's ready to serve, so that the quorum can mark it `up`
+ in the osdmap.
+
+.. describe:: booting
+
+ Before being marked as `up`, an OSD has to stay in its `booting` state.
+
+.. describe:: active
+
+ Upon receiving an osdmap marking the OSD as `up`, it transits to `active`
+ state. After that, it is entitled to do its business. But the OSD service
+ can be fully stopped or suspended due to various reasons. For instance,
+ the osd services can be stopped by administrator manually, or marked `stop`
+ in the osdmap. Or any of its IP addresses does not match with the
+ corresponding one configured in osdmap, it transits to `preboot` if
+ it considers itself healthy.
+
+.. describe:: prestop
+
+ The OSD transits to `prestop` unconditionally upon request of `stop`.
+ But before bidding us farewell, it tries to get the acknowledge from
+ the monitor by sending an `MOSDMarkMeDown`, and waiting for an response
+ of updated osdmap or another `MOSDMarkMeDown` message.
diff --git a/doc/dev/crimson/pipeline.rst b/doc/dev/crimson/pipeline.rst
new file mode 100644
index 000000000..e9115c6d7
--- /dev/null
+++ b/doc/dev/crimson/pipeline.rst
@@ -0,0 +1,97 @@
+==============================
+The ``ClientRequest`` pipeline
+==============================
+
+In crimson, exactly like in the classical OSD, a client request has data and
+ordering dependencies which must be satisfied before processing (actually
+a particular phase of) can begin. As one of the goals behind crimson is to
+preserve the compatibility with the existing OSD incarnation, the same semantic
+must be assured. An obvious example of such data dependency is the fact that
+an OSD needs to have a version of OSDMap that matches the one used by the client
+(``Message::get_min_epoch()``).
+
+If a dependency is not satisfied, the processing stops. It is crucial to note
+the same must happen to all other requests that are sequenced-after (due to
+their ordering requirements).
+
+There are a few cases when the blocking of a client request can happen.
+
+
+ ``ClientRequest::ConnectionPipeline::await_map``
+ wait for particular OSDMap version is available at the OSD level
+ ``ClientRequest::ConnectionPipeline::get_pg``
+ wait a particular PG becomes available on OSD
+ ``ClientRequest::PGPipeline::await_map``
+ wait on a PG being advanced to particular epoch
+ ``ClientRequest::PGPipeline::wait_for_active``
+ wait for a PG to become *active* (i.e. have ``is_active()`` asserted)
+ ``ClientRequest::PGPipeline::recover_missing``
+ wait on an object to be recovered (i.e. leaving the ``missing`` set)
+ ``ClientRequest::PGPipeline::get_obc``
+ wait on an object to be available for locking. The ``obc`` will be locked
+ before this operation is allowed to continue
+ ``ClientRequest::PGPipeline::process``
+ wait if any other ``MOSDOp`` message is handled against this PG
+
+At any moment, a ``ClientRequest`` being served should be in one and only one
+of the phases described above. Similarly, an object denoting particular phase
+can host not more than a single ``ClientRequest`` the same time. At low-level
+this is achieved with a combination of a barrier and an exclusive lock.
+They implement the semantic of a semaphore with a single slot for these exclusive
+phases.
+
+As the execution advances, request enters next phase and leaves the current one
+freeing it for another ``ClientRequest`` instance. All these phases form a pipeline
+which assures the order is preserved.
+
+These pipeline phases are divided into two ordering domains: ``ConnectionPipeline``
+and ``PGPipeline``. The former ensures order across a client connection while
+the latter does that across a PG. That is, requests originating from the same
+connection are executed in the same order as they were sent by the client.
+The same applies to the PG domain: when requests from multiple connections reach
+a PG, they are executed in the same order as they entered a first blocking phase
+of the ``PGPipeline``.
+
+Comparison with the classical OSD
+----------------------------------
+As the audience of this document are Ceph Developers, it seems reasonable to
+match the phases of crimson's ``ClientRequest`` pipeline with the blocking
+stages in the classical OSD. The names in the right column are names of
+containers (lists and maps) used to implement these stages. They are also
+already documented in the ``PG.h`` header.
+
++----------------------------------------+--------------------------------------+
+| crimson | ceph-osd waiting list |
++========================================+======================================+
+|``ConnectionPipeline::await_map`` | ``OSDShardPGSlot::waiting`` and |
+|``ConnectionPipeline::get_pg`` | ``OSDShardPGSlot::waiting_peering`` |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::await_map`` | ``PG::waiting_for_map`` |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::wait_for_active`` | ``PG::waiting_for_peered`` |
+| +--------------------------------------+
+| | ``PG::waiting_for_flush`` |
+| +--------------------------------------+
+| | ``PG::waiting_for_active`` |
++----------------------------------------+--------------------------------------+
+|To be done (``PG_STATE_LAGGY``) | ``PG::waiting_for_readable`` |
++----------------------------------------+--------------------------------------+
+|To be done | ``PG::waiting_for_scrub`` |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::recover_missing`` | ``PG::waiting_for_unreadable_object``|
+| +--------------------------------------+
+| | ``PG::waiting_for_degraded_object`` |
++----------------------------------------+--------------------------------------+
+|To be done (proxying) | ``PG::waiting_for_blocked_object`` |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::get_obc`` | *obc rwlocks* |
++----------------------------------------+--------------------------------------+
+|``PGPipeline::process`` | ``PG::lock`` (roughly) |
++----------------------------------------+--------------------------------------+
+
+
+As the last word it might be worth to emphasize that the ordering implementations
+in both classical OSD and in crimson are stricter than a theoretical minimum one
+required by the RADOS protocol. For instance, we could parallelize read operations
+targeting the same object at the price of extra complexity but we don't -- the
+simplicity has won.
diff --git a/doc/dev/crimson/poseidonstore.rst b/doc/dev/crimson/poseidonstore.rst
new file mode 100644
index 000000000..7c54c029a
--- /dev/null
+++ b/doc/dev/crimson/poseidonstore.rst
@@ -0,0 +1,586 @@
+===============
+ PoseidonStore
+===============
+
+Key concepts and goals
+======================
+
+* As one of the pluggable backend stores for Crimson, PoseidonStore targets only
+ high-end NVMe SSDs (not concerned with ZNS devices).
+* Designed entirely for low CPU consumption
+
+ - Hybrid update strategies for different data types (in-place, out-of-place) to
+ minimize CPU consumption by reducing host-side GC.
+ - Remove a black-box component like RocksDB and a file abstraction layer in BlueStore
+ to avoid unnecessary overheads (e.g., data copy and serialization/deserialization)
+ - Utilize NVMe feature (atomic large write command, Atomic Write Unit Normal).
+ Make use of io_uring, new kernel asynchronous I/O interface, to selectively use the interrupt
+ driven mode for CPU efficiency (or polled mode for low latency).
+* Sharded data/processing model
+
+Background
+----------
+
+Both in-place and out-of-place update strategies have their pros and cons.
+
+* Log-structured store
+
+ Log-structured based storage system is a typical example that adopts an update-out-of-place approach.
+ It never modifies the written data. Writes always go to the end of the log. It enables I/O sequentializing.
+
+ * Pros
+
+ - Without a doubt, one sequential write is enough to store the data
+ - It naturally supports transaction (this is no overwrite, so the store can rollback
+ previous stable state)
+ - Flash friendly (it mitigates GC burden on SSDs)
+ * Cons
+
+ - There is host-side GC that induces overheads
+
+ - I/O amplification (host-side)
+ - More host-CPU consumption
+
+ - Slow metadata lookup
+ - Space overhead (live and unused data co-exist)
+
+* In-place update store
+
+ The update-in-place strategy has been used widely for conventional file systems such as ext4 and xfs.
+ Once a block has been placed in a given disk location, it doesn't move.
+ Thus, writes go to the corresponding location in the disk.
+
+ * Pros
+
+ - Less host-CPU consumption (No host-side GC is required)
+ - Fast lookup
+ - No additional space for log-structured, but there is internal fragmentation
+ * Cons
+
+ - More writes occur to record the data (metadata and data section are separated)
+ - It cannot support transaction. Some form of WAL required to ensure update atomicity
+ in the general case
+ - Flash unfriendly (Give more burdens on SSDs due to device-level GC)
+
+Motivation and Key idea
+-----------------------
+
+In modern distributed storage systems, a server node can be equipped with multiple
+NVMe storage devices. In fact, ten or more NVMe SSDs could be attached on a server.
+As a result, it is hard to achieve NVMe SSD's full performance due to the limited CPU resources
+available in a server node. In such environments, CPU tends to become a performance bottleneck.
+Thus, now we should focus on minimizing host-CPU consumption, which is the same as the Crimson's objective.
+
+Towards an object store highly optimized for CPU consumption, three design choices have been made.
+
+* **PoseidonStore does not have a black-box component like RocksDB in BlueStore.**
+
+ Thus, it can avoid unnecessary data copy and serialization/deserialization overheads.
+ Moreover, we can remove an unnecessary file abstraction layer, which was required to run RocksDB.
+ Object data and metadata is now directly mapped to the disk blocks.
+ Eliminating all these overheads will reduce CPU consumption (e.g., pre-allocation, NVME atomic feature).
+
+* **PoseidonStore uses hybrid update strategies for different data size, similar to BlueStore.**
+
+ As we discussed, both in-place and out-of-place update strategies have their pros and cons.
+ Since CPU is only bottlenecked under small I/O workloads, we chose update-in-place for small I/Os to minimize CPU consumption
+ while choosing update-out-of-place for large I/O to avoid double write. Double write for small data may be better than host-GC overhead
+ in terms of CPU consumption in the long run. Although it leaves GC entirely up to SSDs,
+
+* **PoseidonStore makes use of io_uring, new kernel asynchronous I/O interface to exploit interrupt-driven I/O.**
+
+ User-space driven I/O solutions like SPDK provide high I/O performance by avoiding syscalls and enabling zero-copy
+ access from the application. However, it does not support interrupt-driven I/O, which is only possible with kernel-space driven I/O.
+ Polling is good for low-latency but bad for CPU efficiency. On the other hand, interrupt is good for CPU efficiency and bad for
+ low-latency (but not that bad as I/O size increases). Note that network acceleration solutions like DPDK also excessively consume
+ CPU resources for polling. Using polling both for network and storage processing aggravates CPU consumption.
+ Since network is typically much faster and has a higher priority than storage, polling should be applied only to network processing.
+
+high-end NVMe SSD has enough powers to handle more works. Also, SSD lifespan is not a practical concern these days
+(there is enough program-erase cycle limit [#f1]_). On the other hand, for large I/O workloads, the host can afford process host-GC.
+Also, the host can garbage collect invalid objects more effectively when their size is large
+
+Observation
+-----------
+
+Two data types in Ceph
+
+* Data (object data)
+
+ - The cost of double write is high
+ - The best method to store this data is in-place update
+
+ - At least two operations required to store the data: 1) data and 2) location of
+ data. Nevertheless, a constant number of operations would be better than out-of-place
+ even if it aggravates WAF in SSDs
+
+* Metadata or small data (e.g., object_info_t, snapset, pg_log, and collection)
+
+ - Multiple small-sized metadata entries for an object
+ - The best solution to store this data is WAL + Using cache
+
+ - The efficient way to store metadata is to merge all metadata related to data
+ and store it though a single write operation even though it requires background
+ flush to update the data partition
+
+
+Design
+======
+.. ditaa::
+
+ +-WAL partition-|----------------------Data partition-------------------------------+
+ | Sharded partition |
+ +-----------------------------------------------------------------------------------+
+ | WAL -> | | Super block | Freelist info | Onode radix tree info| Data blocks |
+ +-----------------------------------------------------------------------------------+
+ | Sharded partition 2
+ +-----------------------------------------------------------------------------------+
+ | WAL -> | | Super block | Freelist info | Onode radix tree info| Data blocks |
+ +-----------------------------------------------------------------------------------+
+ | Sharded partition N
+ +-----------------------------------------------------------------------------------+
+ | WAL -> | | Super block | Freelist info | Onode radix tree info| Data blocks |
+ +-----------------------------------------------------------------------------------+
+ | Global information (in reverse order)
+ +-----------------------------------------------------------------------------------+
+ | Global WAL -> | | SB | Freelist | |
+ +-----------------------------------------------------------------------------------+
+
+
+* WAL
+
+ - Log, metadata and small data are stored in the WAL partition
+ - Space within the WAL partition is continually reused in a circular manner
+ - Flush data to trim WAL as necessary
+* Disk layout
+
+ - Data blocks are metadata blocks or data blocks
+ - Freelist manages the root of free space B+tree
+ - Super block contains management info for a data partition
+ - Onode radix tree info contains the root of onode radix tree
+
+
+I/O procedure
+-------------
+* Write
+
+ For incoming writes, data is handled differently depending on the request size;
+ data is either written twice (WAL) or written in a log-structured manner.
+
+ #. If Request Size ≤ Threshold (similar to minimum allocation size in BlueStore)
+
+ Write data and metadata to [WAL] —flush—> Write them to [Data section (in-place)] and
+ [Metadata section], respectively.
+
+ Since the CPU becomes the bottleneck for small I/O workloads, in-place update scheme is used.
+ Double write for small data may be better than host-GC overhead in terms of CPU consumption
+ in the long run
+ #. Else if Request Size > Threshold
+
+ Append data to [Data section (log-structure)] —> Write the corresponding metadata to [WAL]
+ —flush—> Write the metadata to [Metadata section]
+
+ For large I/O workloads, the host can afford process host-GC
+ Also, the host can garbage collect invalid objects more effectively when their size is large
+
+ Note that Threshold can be configured to a very large number so that only the scenario (1) occurs.
+ With this design, we can control the overall I/O procedure with the optimizations for crimson
+ as described above.
+
+ * Detailed flow
+
+ We make use of a NVMe write command which provides atomicity guarantees (Atomic Write Unit Power Fail)
+ For example, 512 Kbytes of data can be atomically written at once without fsync().
+
+ * stage 1
+
+ - if the data is small
+ WAL (written) --> | TxBegin A | Log Entry | TxEnd A |
+ Append a log entry that contains pg_log, snapset, object_infot_t and block allocation
+ using NVMe atomic write command on the WAL
+ - if the data is large
+ Data partition (written) --> | Data blocks |
+ * stage 2
+
+ - if the data is small
+ No need.
+ - if the data is large
+ Then, append the metadata to WAL.
+ WAL --> | TxBegin A | Log Entry | TxEnd A |
+
+* Read
+
+ - Use the cached object metadata to find out the data location
+ - If not cached, need to search WAL after checkpoint and Object meta partition to find the
+ latest meta data
+
+* Flush (WAL --> Data partition)
+
+ - Flush WAL entries that have been committed. There are two conditions
+ (1. the size of WAL is close to full, 2. a signal to flush).
+ We can mitigate the overhead of frequent flush via batching processing, but it leads to
+ delaying completion.
+
+
+Crash consistency
+------------------
+
+* Large case
+
+ #. Crash occurs right after writing Data blocks
+
+ - Data partition --> | Data blocks |
+ - We don't need to care this case. Data is not allocated yet. The blocks will be reused.
+ #. Crash occurs right after WAL
+
+ - Data partition --> | Data blocks |
+ - WAL --> | TxBegin A | Log Entry | TxEnd A |
+ - Write procedure is completed, so there is no data loss or inconsistent state
+
+* Small case
+
+ #. Crash occurs right after writing WAL
+
+ - WAL --> | TxBegin A | Log Entry| TxEnd A |
+ - All data has been written
+
+
+Comparison
+----------
+
+* Best case (pre-allocation)
+
+ - Only need writes on both WAL and Data partition without updating object metadata (for the location).
+* Worst case
+
+ - At least three writes are required additionally on WAL, object metadata, and data blocks.
+ - If the flush from WAL to the data partition occurs frequently, radix tree onode structure needs to be update
+ in many times. To minimize such overhead, we can make use of batch processing to minimize the update on the tree
+ (the data related to the object has a locality because it will have the same parent node, so updates can be minimized)
+
+* WAL needs to be flushed if the WAL is close to full or a signal to flush.
+
+ - The premise behind this design is OSD can manage the latest metadata as a single copy. So,
+ appended entries are not to be read
+* Either best of the worst case does not produce severe I/O amplification (it produce I/Os, but I/O rate is constant)
+ unlike LSM-tree DB (the proposed design is similar to LSM-tree which has only level-0)
+
+
+Detailed Design
+===============
+
+* Onode lookup
+
+ * Radix tree
+ Our design is entirely based on the prefix tree. Ceph already makes use of the characteristic of OID's prefix to split or search
+ the OID (e.g., pool id + hash + oid). So, the prefix tree fits well to store or search the object. Our scheme is designed
+ to lookup the prefix tree efficiently.
+
+ * Sharded partition
+ A few bits (leftmost bits of the hash) of the OID determine a sharded partition where the object is located.
+ For example, if the number of partitions is configured as four, The entire space of the hash in hobject_t
+ can be divided into four domains (0x0xxx ~ 0x3xxx, 0x4xxx ~ 0x7xxx, 0x8xxx ~ 0xBxxx and 0xCxxx ~ 0xFxxx).
+
+ * Ondisk onode
+
+ .. code-block:: c
+
+ struct onode {
+ extent_tree block_maps;
+ b+_tree omaps;
+ map xattrs;
+ }
+
+ onode contains the radix tree nodes for lookup, which means we can search for objects using tree node information in onode.
+ Also, if the data size is small, the onode can embed the data and xattrs.
+ The onode is fixed size (256 or 512 byte). On the other hands, omaps and block_maps are variable-length by using pointers in the onode.
+
+ .. ditaa::
+
+ +----------------+------------+--------+
+ | on\-disk onode | block_maps | omaps |
+ +----------+-----+------------+--------+
+ | ^ ^
+ | | |
+ +-----------+---------+
+
+
+ * Lookup
+ The location of the root of onode tree is specified on Onode radix tree info, so we can find out where the object
+ is located by using the root of prefix tree. For example, shared partition is determined by OID as described above.
+ Using the rest of the OID's bits and radix tree, lookup procedure find outs the location of the onode.
+ The extent tree (block_maps) contains where data chunks locate, so we finally figure out the data location.
+
+
+* Allocation
+
+ * Sharded partitions
+
+ The entire disk space is divided into several data chunks called sharded partition (SP).
+ Each SP has its own data structures to manage the partition.
+
+ * Data allocation
+
+ As we explained above, the management infos (e.g., super block, freelist info, onode radix tree info) are pre-allocated
+ in each shared partition. Given OID, we can map any data in Data block section to the extent tree in the onode.
+ Blocks can be allocated by searching the free space tracking data structure (we explain below).
+
+ ::
+
+ +-----------------------------------+
+ | onode radix tree root node block |
+ | (Per-SP Meta) |
+ | |
+ | # of records |
+ | left_sibling / right_sibling |
+ | +--------------------------------+|
+ | | keys[# of records] ||
+ | | +-----------------------------+||
+ | | | start onode ID |||
+ | | | ... |||
+ | | +-----------------------------+||
+ | +--------------------------------||
+ | +--------------------------------+|
+ | | ptrs[# of records] ||
+ | | +-----------------------------+||
+ | | | SP block number |||
+ | | | ... |||
+ | | +-----------------------------+||
+ | +--------------------------------+|
+ +-----------------------------------+
+
+ * Free space tracking
+ The freespace is tracked on a per-SP basis. We can use extent-based B+tree in XFS for free space tracking.
+ The freelist info contains the root of free space B+tree. Granularity is a data block in Data blocks partition.
+ The data block is the smallest and fixed size unit of data.
+
+ ::
+
+ +-----------------------------------+
+ | Free space B+tree root node block |
+ | (Per-SP Meta) |
+ | |
+ | # of records |
+ | left_sibling / right_sibling |
+ | +--------------------------------+|
+ | | keys[# of records] ||
+ | | +-----------------------------+||
+ | | | startblock / blockcount |||
+ | | | ... |||
+ | | +-----------------------------+||
+ | +--------------------------------||
+ | +--------------------------------+|
+ | | ptrs[# of records] ||
+ | | +-----------------------------+||
+ | | | SP block number |||
+ | | | ... |||
+ | | +-----------------------------+||
+ | +--------------------------------+|
+ +-----------------------------------+
+
+* Omap and xattr
+ In this design, omap and xattr data is tracked by b+tree in onode. The onode only has the root node of b+tree.
+ The root node contains entries which indicate where the key onode exists.
+ So, if we know the onode, omap can be found via omap b+tree.
+
+* Fragmentation
+
+ - Internal fragmentation
+
+ We pack different types of data/metadata in a single block as many as possible to reduce internal fragmentation.
+ Extent-based B+tree may help reduce this further by allocating contiguous blocks that best fit for the object
+
+ - External fragmentation
+
+ Frequent object create/delete may lead to external fragmentation
+ In this case, we need cleaning work (GC-like) to address this.
+ For this, we are referring the NetApp’s Continuous Segment Cleaning, which seems similar to the SeaStore’s approach
+ Countering Fragmentation in an Enterprise Storage System (NetApp, ACM TOS, 2020)
+
+.. ditaa::
+
+
+ +---------------+-------------------+-------------+
+ | Freelist info | Onode radix tree | Data blocks +-------+
+ +---------------+---------+---------+-+-----------+ |
+ | | |
+ +--------------------+ | |
+ | | |
+ | OID | |
+ | | |
+ +---+---+ | |
+ | Root | | |
+ +---+---+ | |
+ | | |
+ v | |
+ /-----------------------------\ | |
+ | Radix tree | | v
+ +---------+---------+---------+ | /---------------\
+ | onode | ... | ... | | | Num Chunk |
+ +---------+---------+---------+ | | |
+ +--+ onode | ... | ... | | | <Offset, len> |
+ | +---------+---------+---------+ | | <Offset, len> +-------+
+ | | | ... | |
+ | | +---------------+ |
+ | | ^ |
+ | | | |
+ | | | |
+ | | | |
+ | /---------------\ /-------------\ | | v
+ +->| onode | | onode |<---+ | /------------+------------\
+ +---------------+ +-------------+ | | Block0 | Block1 |
+ | OID | | OID | | +------------+------------+
+ | Omaps | | Omaps | | | Data | Data |
+ | Data Extent | | Data Extent +-----------+ +------------+------------+
+ +---------------+ +-------------+
+
+WAL
+---
+Each SP has a WAL.
+The data written to the WAL are metadata updates, free space update and small data.
+Note that only data smaller than the predefined threshold needs to be written to the WAL.
+The larger data is written to the unallocated free space and its onode's extent_tree is updated accordingly
+(also on-disk extent tree). We statically allocate WAL partition aside from data partition pre-configured.
+
+
+Partition and Reactor thread
+----------------------------
+In early stage development, PoseidonStore will employ static allocation of partition. The number of sharded partitions
+is fixed and the size of each partition also should be configured before running cluster.
+But, the number of partitions can grow as below. We leave this as a future work.
+Also, each reactor thread has a static set of SPs.
+
+.. ditaa::
+
+ +------+------+-------------+------------------+
+ | SP 1 | SP N | --> <-- | global partition |
+ +------+------+-------------+------------------+
+
+
+
+Cache
+-----
+There are mainly two cache data structures; onode cache and block cache.
+It looks like below.
+
+#. Onode cache:
+ lru_map <OID, OnodeRef>;
+#. Block cache (data and omap):
+ Data cache --> lru_map <paddr, value>
+
+To fill the onode data structure, the target onode needs to be retrieved using the prefix tree.
+Block cache is used for caching a block contents. For a transaction, all the updates to blocks
+(including object meta block, data block) are first performed in the in-memory block cache.
+After writing a transaction to the WAL, the dirty blocks are flushed to their respective locations in the
+respective partitions.
+PoseidonStore can configure cache size for each type. Simple LRU cache eviction strategy can be used for both.
+
+
+Sharded partitions (with cross-SP transaction)
+----------------------------------------------
+The entire disk space is divided into a number of chunks called sharded partitions (SP).
+The prefixes of the parent collection ID (original collection ID before collection splitting. That is, hobject.hash)
+is used to map any collections to SPs.
+We can use BlueStore's approach for collection splitting, changing the number of significant bits for the collection prefixes.
+Because the prefixes of the parent collection ID do not change even after collection splitting, the mapping between
+the collection and SP are maintained.
+The number of SPs may be configured to match the number of CPUs allocated for each disk so that each SP can hold
+a number of objects large enough for cross-SP transaction not to occur.
+
+In case of need of cross-SP transaction, we could use the global WAL. The coordinator thread (mainly manages global partition) handles
+cross-SP transaction via acquire the source SP and target SP locks before processing the cross-SP transaction.
+Source and target probably are blocked.
+
+For the load unbalanced situation,
+Poseidonstore can create partitions to make full use of entire space efficiently and provide load balaning.
+
+
+CoW/Clone
+---------
+As for CoW/Clone, a clone has its own onode like other normal objects.
+
+Although each clone has its own onode, data blocks should be shared between the original object and clones
+if there are no changes on them to minimize the space overhead.
+To do so, the reference count for the data blocks is needed to manage those shared data blocks.
+
+To deal with the data blocks which has the reference count, poseidon store makes use of shared_blob
+which maintains the referenced data block.
+
+As shown the figure as below,
+the shared_blob tracks the data blocks shared between other onodes by using a reference count.
+The shared_blobs are managed by shared_blob_list in the superblock.
+
+
+.. ditaa::
+
+
+ /----------\ /----------\
+ | Object A | | Object B |
+ +----------+ +----------+
+ | Extent | | Extent |
+ +---+--+---+ +--+----+--+
+ | | | |
+ | | +----------+ |
+ | | | |
+ | +---------------+ |
+ | | | |
+ v v v v
+ +---------------+---------------+
+ | Data block 1 | Data block 2 |
+ +-------+-------+------+--------+
+ | |
+ v v
+ /---------------+---------------\
+ | shared_blob 1 | shared_blob 2 |
+ +---------------+---------------+ shared_blob_list
+ | refcount | refcount |
+ +---------------+---------------+
+
+Plans
+=====
+
+All PRs should contain unit tests to verify its minimal functionality.
+
+* WAL and block cache implementation
+
+ As a first step, we are going to build the WAL including the I/O procedure to read/write the WAL.
+ With WAL development, the block cache needs to be developed together.
+ Besides, we are going to add an I/O library to read/write from/to the NVMe storage to
+ utilize NVMe feature and the asynchronous interface.
+
+* Radix tree and onode
+
+ First, submit a PR against this file with a more detailed on disk layout and lookup strategy for the onode radix tree.
+ Follow up with implementation based on the above design once design PR is merged.
+ The second PR will be the implementation regarding radix tree which is the key structure to look up
+ objects.
+
+* Extent tree
+
+ This PR is the extent tree to manage data blocks in the onode. We build the extent tree, and
+ demonstrate how it works when looking up the object.
+
+* B+tree for omap
+
+ We will put together a simple key/value interface for omap. This probably will be a separate PR.
+
+* CoW/Clone
+
+ To support CoW/Clone, shared_blob and shared_blob_list will be added.
+
+* Integration to Crimson as to I/O interfaces
+
+ At this stage, interfaces for interacting with Crimson such as queue_transaction(), read(), clone_range(), etc.
+ should work right.
+
+* Configuration
+
+ We will define Poseidon store configuration in detail.
+
+* Stress test environment and integration to teuthology
+
+ We will add stress tests and teuthology suites.
+
+.. rubric:: Footnotes
+
+.. [#f1] Stathis Maneas, Kaveh Mahdaviani, Tim Emami, Bianca Schroeder: A Study of SSD Reliability in Large Scale Enterprise Storage Deployments. FAST 2020: 137-149