From e6918187568dbd01842d8d1d2c808ce16a894239 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 21 Apr 2024 13:54:28 +0200 Subject: Adding upstream version 18.2.2. Signed-off-by: Daniel Baumann --- doc/dev/crimson/crimson.rst | 480 ++++++++++++++++++++++++++++++ doc/dev/crimson/error-handling.rst | 158 ++++++++++ doc/dev/crimson/index.rst | 11 + doc/dev/crimson/osd.rst | 54 ++++ doc/dev/crimson/pipeline.rst | 97 ++++++ doc/dev/crimson/poseidonstore.rst | 586 +++++++++++++++++++++++++++++++++++++ 6 files changed, 1386 insertions(+) create mode 100644 doc/dev/crimson/crimson.rst create mode 100644 doc/dev/crimson/error-handling.rst create mode 100644 doc/dev/crimson/index.rst create mode 100644 doc/dev/crimson/osd.rst create mode 100644 doc/dev/crimson/pipeline.rst create mode 100644 doc/dev/crimson/poseidonstore.rst (limited to 'doc/dev/crimson') diff --git a/doc/dev/crimson/crimson.rst b/doc/dev/crimson/crimson.rst new file mode 100644 index 000000000..cbc20b773 --- /dev/null +++ b/doc/dev/crimson/crimson.rst @@ -0,0 +1,480 @@ +======= +crimson +======= + +Crimson is the code name of ``crimson-osd``, which is the next +generation ``ceph-osd``. It improves performance when using fast network +and storage devices, employing state-of-the-art technologies including +DPDK and SPDK. BlueStore continues to support HDDs and slower SSDs. +Crimson aims to be backward compatible with the classic ``ceph-osd``. + +.. highlight:: console + +Building Crimson +================ + +Crimson is not enabled by default. Enable it at build time by running:: + + $ WITH_SEASTAR=true ./install-deps.sh + $ mkdir build && cd build + $ cmake -DWITH_SEASTAR=ON .. + +Please note, `ASan`_ is enabled by default if Crimson is built from a source +cloned using ``git``. + +.. _ASan: https://github.com/google/sanitizers/wiki/AddressSanitizer + +Testing crimson with cephadm +=============================== + +The Ceph CI/CD pipeline builds containers with +``crimson-osd`` subsitituted for ``ceph-osd``. + +Once a branch at commit has been built and is available in +``shaman``, you can deploy it using the cephadm instructions outlined +in :ref:`cephadm` with the following adaptations. + +First, while performing the initial bootstrap, use the ``--image`` flag to +use a Crimson build: + +.. prompt:: bash # + + cephadm --image quay.ceph.io/ceph-ci/ceph:-crimson --allow-mismatched-release bootstrap ... + +You'll likely need to supply the ``--allow-mismatched-release`` flag to +use a non-release branch. + +Additionally, prior to deploying OSDs, you'll need enable Crimson to +direct the default pools to be created as Crimson pools. From the cephadm shell run: + +.. prompt:: bash # + + ceph config set global 'enable_experimental_unrecoverable_data_corrupting_features' crimson + ceph osd set-allow-crimson --yes-i-really-mean-it + ceph config set mon osd_pool_default_crimson true + +The first command enables the ``crimson`` experimental feature. Crimson +is highly experimental, and malfunctions including crashes +and data loss are to be expected. + +The second enables the ``allow_crimson`` OSDMap flag. The monitor will +not allow ``crimson-osd`` to boot without that flag. + +The last causes pools to be created by default with the ``crimson`` flag. +Crimson pools are restricted to operations supported by Crimson. +``Crimson-osd`` won't instantiate PGs from non-Crimson pools. + +Running Crimson +=============== + +As you might expect, Crimson does not yet have as extensive a feature set as does ``ceph-osd``. + +object store backend +-------------------- + +At the moment, ``crimson-osd`` offers both native and alienized object store +backends. The native object store backends perform IO using the SeaStar reactor. +They are: + +.. describe:: cyanstore + + CyanStore is modeled after memstore in the classic OSD. + +.. describe:: seastore + + Seastore is still under active development. + +The alienized object store backends are backed by a thread pool, which +is a proxy of the alienstore adaptor running in Seastar. The proxy issues +requests to object stores running in alien threads, i.e., worker threads not +managed by the Seastar framework. They are: + +.. describe:: memstore + + The memory backed object store + +.. describe:: bluestore + + The object store used by the classic ``ceph-osd`` + +daemonize +--------- + +Unlike ``ceph-osd``, ``crimson-osd`` does not daemonize itself even if the +``daemonize`` option is enabled. In order to read this option, ``crimson-osd`` +needs to ready its config sharded service, but this sharded service lives +in the Seastar reactor. If we fork a child process and exit the parent after +starting the Seastar engine, that will leave us with a single thread which is +a replica of the thread that called `fork()`_. Tackling this problem in Crimson +would unnecessarily complicate the code. + +Since supported GNU/Linux distributions use ``systemd``, which is able to +daemonize the application, there is no need to daemonize ourselves. +Those using sysvinit can use ``start-stop-daemon`` to daemonize ``crimson-osd``. +If this is does not work out, a helper utility may be devised. + +.. _fork(): http://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html + +logging +------- + +``Crimson-osd`` currently uses the logging utility offered by Seastar. See +``src/common/dout.h`` for the mapping between Ceph logging levels to +the severity levels in Seastar. For instance, messages sent to ``derr`` +will be issued using ``logger::error()``, and the messages with a debug level +greater than ``20`` will be issued using ``logger::trace()``. + ++---------+---------+ +| ceph | seastar | ++---------+---------+ +| < 0 | error | ++---------+---------+ +| 0 | warn | ++---------+---------+ +| [1, 6) | info | ++---------+---------+ +| [6, 20] | debug | ++---------+---------+ +| > 20 | trace | ++---------+---------+ + +Note that ``crimson-osd`` +does not send log messages directly to a specified ``log_file``. It writes +the logging messages to stdout and/or syslog. This behavior can be +changed using ``--log-to-stdout`` and ``--log-to-syslog`` command line +options. By default, ``log-to-stdout`` is enabled, and ``--log-to-syslog`` is disabled. + + +vstart.sh +--------- + +The following options aree handy when using ``vstart.sh``, + +``--crimson`` + Start ``crimson-osd`` instead of ``ceph-osd``. + +``--nodaemon`` + Do not daemonize the service. + +``--redirect-output`` + Redirect the ``stdout`` and ``stderr`` to ``out/$type.$num.stdout``. + +``--osd-args`` + Pass extra command line options to ``crimson-osd`` or ``ceph-osd``. + This is useful for passing Seastar options to ``crimson-osd``. For + example, one can supply ``--osd-args "--memory 2G"`` to set the amount of + memory to use. Please refer to the output of:: + + crimson-osd --help-seastar + + for additional Seastar-specific command line options. + +``--cyanstore`` + Use CyanStore as the object store backend. + +``--bluestore`` + Use the alienized BlueStore as the object store backend. This is the default. + +``--memstore`` + Use the alienized MemStore as the object store backend. + +``--seastore`` + Use SeaStore as the back end object store. + +``--seastore-devs`` + Specify the block device used by SeaStore. + +``--seastore-secondary-devs`` + Optional. SeaStore supports multiple devices. Enable this feature by + passing the block device to this option. + +``--seastore-secondary-devs-type`` + Optional. Specify the type of secondary devices. When the secondary + device is slower than main device passed to ``--seastore-devs``, the cold + data in faster device will be evicted to the slower devices over time. + Valid types include ``HDD``, ``SSD``(default), ``ZNS``, and ``RANDOM_BLOCK_SSD`` + Note secondary devices should not be faster than the main device. + +``--seastore`` + Use SeaStore as the object store backend. + +To start a cluster with a single Crimson node, run:: + + $ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \ + --without-dashboard --cyanstore \ + --crimson --redirect-output \ + --osd-args "--memory 4G" + +Here we assign 4 GiB memory and a single thread running on core-0 to ``crimson-osd``. + +Another SeaStore example:: + + $ MGR=1 MON=1 OSD=1 MDS=0 RGW=0 ../src/vstart.sh -n -x \ + --without-dashboard --seastore \ + --crimson --redirect-output \ + --seastore-devs /dev/sda \ + --seastore-secondary-devs /dev/sdb \ + --seastore-secondary-devs-type HDD + +Stop this ``vstart`` cluster by running:: + + $ ../src/stop.sh --crimson + +Metrics and Tracing +=================== + +Crimson offers three ways to report stats and metrics. + +pg stats reported to mgr +------------------------ + +Crimson collects the per-pg, per-pool, and per-osd stats in a `MPGStats` +message which is sent to the Ceph Managers. Manager modules can query +them using the `MgrModule.get()` method. + +asock command +------------- + +An admin socket command is offered for dumping metrics:: + + $ ceph tell osd.0 dump_metrics + $ ceph tell osd.0 dump_metrics reactor_utilization + +Here `reactor_utilization` is an optional string allowing us to filter +the dumped metrics by prefix. + +Prometheus text protocol +------------------------ + +The listening port and address can be configured using the command line options of +`--prometheus_port` +see `Prometheus`_ for more details. + +.. _Prometheus: https://github.com/scylladb/seastar/blob/master/doc/prometheus.md + +Profiling Crimson +================= + +fio +--- + +``crimson-store-nbd`` exposes configurable ``FuturizedStore`` internals as an +NBD server for use with ``fio``. + +In order to use ``fio`` to test ``crimson-store-nbd``, perform the below steps. + +#. You will need to install ``libnbd``, and compile it into ``fio`` + + .. prompt:: bash $ + + apt-get install libnbd-dev + git clone git://git.kernel.dk/fio.git + cd fio + ./configure --enable-libnbd + make + +#. Build ``crimson-store-nbd`` + + .. prompt:: bash $ + + cd build + ninja crimson-store-nbd + +#. Run the ``crimson-store-nbd`` server with a block device. Specify + the path to the raw device, for example ``/dev/nvme1n1``, in place of the created + file for testing with a block device. + + .. prompt:: bash $ + + export disk_img=/tmp/disk.img + export unix_socket=/tmp/store_nbd_socket.sock + rm -f $disk_img $unix_socket + truncate -s 512M $disk_img + ./bin/crimson-store-nbd \ + --device-path $disk_img \ + --smp 1 \ + --mkfs true \ + --type transaction_manager \ + --uds-path ${unix_socket} & + + Below are descriptions of these command line arguments: + + ``--smp`` + The number of CPU cores to use (Symmetric MultiProcessor) + + ``--mkfs`` + Initialize the device first. + + ``--type`` + The back end to use. If ``transaction_manager`` is specified, SeaStore's + ``TransactionManager`` and ``BlockSegmentManager`` are used to emulate a + block device. Otherwise, this option is used to choose a backend of + ``FuturizedStore``, where the whole "device" is divided into multiple + fixed-size objects whose size is specified by ``--object-size``. So, if + you are only interested in testing the lower-level implementation of + SeaStore like logical address translation layer and garbage collection + without the object store semantics, ``transaction_manager`` would be a + better choice. + +#. Create a ``fio`` job file named ``nbd.fio`` + + .. code:: ini + + [global] + ioengine=nbd + uri=nbd+unix:///?socket=${unix_socket} + rw=randrw + time_based + runtime=120 + group_reporting + iodepth=1 + size=512M + + [job0] + offset=0 + +#. Test the Crimson object store, using the custom ``fio`` built just now + + .. prompt:: bash $ + + ./fio nbd.fio + +CBT +--- +We can use `cbt`_ for performance tests:: + + $ git checkout main + $ make crimson-osd + $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/baseline ../src/test/crimson/cbt/radosbench_4K_read.yaml + $ git checkout yet-another-pr + $ make crimson-osd + $ ../src/script/run-cbt.sh --cbt ~/dev/cbt -a /tmp/yap ../src/test/crimson/cbt/radosbench_4K_read.yaml + $ ~/dev/cbt/compare.py -b /tmp/baseline -a /tmp/yap -v + 19:48:23 - INFO - cbt - prefill/gen8/0: bandwidth: (or (greater) (near 0.05)):: 0.183165/0.186155 => accepted + 19:48:23 - INFO - cbt - prefill/gen8/0: iops_avg: (or (greater) (near 0.05)):: 46.0/47.0 => accepted + 19:48:23 - WARNING - cbt - prefill/gen8/0: iops_stddev: (or (less) (near 0.05)):: 10.4403/6.65833 => rejected + 19:48:23 - INFO - cbt - prefill/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.340868/0.333712 => accepted + 19:48:23 - INFO - cbt - prefill/gen8/1: bandwidth: (or (greater) (near 0.05)):: 0.190447/0.177619 => accepted + 19:48:23 - INFO - cbt - prefill/gen8/1: iops_avg: (or (greater) (near 0.05)):: 48.0/45.0 => accepted + 19:48:23 - INFO - cbt - prefill/gen8/1: iops_stddev: (or (less) (near 0.05)):: 6.1101/9.81495 => accepted + 19:48:23 - INFO - cbt - prefill/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.325163/0.350251 => accepted + 19:48:23 - INFO - cbt - seq/gen8/0: bandwidth: (or (greater) (near 0.05)):: 1.24654/1.22336 => accepted + 19:48:23 - INFO - cbt - seq/gen8/0: iops_avg: (or (greater) (near 0.05)):: 319.0/313.0 => accepted + 19:48:23 - INFO - cbt - seq/gen8/0: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted + 19:48:23 - INFO - cbt - seq/gen8/0: latency_avg: (or (less) (near 0.05)):: 0.0497733/0.0509029 => accepted + 19:48:23 - INFO - cbt - seq/gen8/1: bandwidth: (or (greater) (near 0.05)):: 1.22717/1.11372 => accepted + 19:48:23 - INFO - cbt - seq/gen8/1: iops_avg: (or (greater) (near 0.05)):: 314.0/285.0 => accepted + 19:48:23 - INFO - cbt - seq/gen8/1: iops_stddev: (or (less) (near 0.05)):: 0.0/0.0 => accepted + 19:48:23 - INFO - cbt - seq/gen8/1: latency_avg: (or (less) (near 0.05)):: 0.0508262/0.0557337 => accepted + 19:48:23 - WARNING - cbt - 1 tests failed out of 16 + +Here we compile and run the same test against two branches: ``main`` and ``yet-another-pr``. +We then compare the results. Along with every test case, a set of rules is defined to check for +performance regressions when comparing the sets of test results. If a possible regression is found, the rule and +corresponding test results are highlighted. + +.. _cbt: https://github.com/ceph/cbt + +Hacking Crimson +=============== + + +Seastar Documents +----------------- + +See `Seastar Tutorial `_ . +Or build a browsable version and start an HTTP server:: + + $ cd seastar + $ ./configure.py --mode debug + $ ninja -C build/debug docs + $ python3 -m http.server -d build/debug/doc/html + +You might want to install ``pandoc`` and other dependencies beforehand. + +Debugging Crimson +================= + +Debugging with GDB +------------------ + +The `tips`_ for debugging Scylla also apply to Crimson. + +.. _tips: https://github.com/scylladb/scylla/blob/master/docs/dev/debugging.md#tips-and-tricks + +Human-readable backtraces with addr2line +---------------------------------------- + +When a Seastar application crashes, it leaves us with a backtrace of addresses, like:: + + Segmentation fault. + Backtrace: + 0x00000000108254aa + 0x00000000107f74b9 + 0x00000000105366cc + 0x000000001053682c + 0x00000000105d2c2e + 0x0000000010629b96 + 0x0000000010629c31 + 0x00002a02ebd8272f + 0x00000000105d93ee + 0x00000000103eff59 + 0x000000000d9c1d0a + /lib/x86_64-linux-gnu/libc.so.6+0x000000000002409a + 0x000000000d833ac9 + Segmentation fault + +The ``seastar-addr2line`` utility provided by Seastar can be used to map these +addresses to functions. The script expects input on ``stdin``, +so we need to copy and paste the above addresses, then send EOF by inputting +``control-D`` in the terminal. One might use ``echo`` or ``cat`` instead`:: + + $ ../src/seastar/scripts/seastar-addr2line -e bin/crimson-osd + + 0x00000000108254aa + 0x00000000107f74b9 + 0x00000000105366cc + 0x000000001053682c + 0x00000000105d2c2e + 0x0000000010629b96 + 0x0000000010629c31 + 0x00002a02ebd8272f + 0x00000000105d93ee + 0x00000000103eff59 + 0x000000000d9c1d0a + 0x00000000108254aa + [Backtrace #0] + seastar::backtrace_buffer::append_backtrace() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1136 + seastar::print_with_backtrace(seastar::backtrace_buffer&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1157 + seastar::print_with_backtrace(char const*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:1164 + seastar::sigsegv_action() at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5119 + seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::operator()(int, siginfo_t*, void*) const at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5105 + seastar::install_oneshot_signal_handler<11, &seastar::sigsegv_action>()::{lambda(int, siginfo_t*, void*)#1}::_FUN(int, siginfo_t*, void*) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5101 + ?? ??:0 + seastar::smp::configure(boost::program_options::variables_map, seastar::reactor_config) at /home/kefu/dev/ceph/build/../src/seastar/src/core/reactor.cc:5418 + seastar::app_template::run_deprecated(int, char**, std::function&&) at /home/kefu/dev/ceph/build/../src/seastar/src/core/app-template.cc:173 (discriminator 5) + main at /home/kefu/dev/ceph/build/../src/crimson/osd/main.cc:131 (discriminator 1) + +Note that ``seastar-addr2line`` is able to extract addresses from +its input, so you can also paste the log messages as below:: + + 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr:Backtrace: + 2020-07-22T11:37:04.500 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e78dbc + 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e7f0 + 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e8b8 + 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: 0x0000000000e3e985 + 2020-07-22T11:37:04.501 INFO:teuthology.orchestra.run.smithi061.stderr: /lib64/libpthread.so.0+0x0000000000012dbf + +Unlike the classic ``ceph-osd``, Crimson does not print a human-readable backtrace when it +handles fatal signals like `SIGSEGV` or `SIGABRT`. It is also more complicated +with a stripped binary. So instead of planting a signal handler for +those signals into Crimson, we can use `script/ceph-debug-docker.sh` to map +addresses in the backtrace:: + + # assuming you are under the source tree of ceph + $ ./src/script/ceph-debug-docker.sh --flavor crimson master:27e237c137c330ebb82627166927b7681b20d0aa centos:8 + .... + [root@3deb50a8ad51 ~]# wget -q https://raw.githubusercontent.com/scylladb/seastar/master/scripts/seastar-addr2line + [root@3deb50a8ad51 ~]# dnf install -q -y file + [root@3deb50a8ad51 ~]# python3 seastar-addr2line -e /usr/bin/crimson-osd + # paste the backtrace here diff --git a/doc/dev/crimson/error-handling.rst b/doc/dev/crimson/error-handling.rst new file mode 100644 index 000000000..185868e70 --- /dev/null +++ b/doc/dev/crimson/error-handling.rst @@ -0,0 +1,158 @@ +============== +error handling +============== + + +In Seastar, a ``future`` represents a value not yet available but that can become +available later. ``future`` can have one of following states: + +* unavailable: value is not available yet, +* value, +* failed: an exception was thrown when computing the value. This exception has + been captured and stored in the ``future`` instance via ``std::exception_ptr``. + +In the last case, the exception can be processed using ``future::handle_exception()`` or +``future::handle_exception_type()``. Seastar even provides ``future::or_terminate()`` to +terminate the program if the future fails. + +But in Crimson, quite a few errors are not serious enough to fail the program entirely. +For instance, if we try to look up an object by its object id, and that operation could +fail because the object does not exist or it is corrupted, we need to recover that object +for fulfilling the request instead of terminating the process. + +In other words, these errors are expected. Moreover, the performance of the unhappy path +should also be on par with that of the happy path. Also, we want to have a way to ensure +that all expected errors are handled. It should be something like the statical analysis +performed by compiler to spit a warning if any enum value is not handled in a ``switch-case`` +statement. + +Unfortunately, ``seastar::future`` is not able to satisfy these two requirements. + +* Seastar imposes re-throwing an exception to dispatch between different types of + exceptions. This is not very performant nor even scalable as locking in the language's + runtime can occur. +* Seastar does not encode the expected exception type in the type of the returned + ``seastar::future``. Only the type of the value is encoded. This imposes huge + mental load on programmers as ensuring that all intended errors are indeed handled + requires manual code audit. + +.. highlight:: c++ + +So, "errorator" is created. It is a wrapper around the vanilla ``seastar::future``. +It addresses the performance and scalability issues while embedding the information +about all expected types-of-errors to the type-of-future.:: + + using ertr = crimson::errorator; + +In above example we defined an errorator that allows for two error types: + +* ``crimson::ct_error::enoent`` and +* ``crimson::ct_error::einval``. + +These (and other ones in the ``crimson::ct_error`` namespace) are basically +unthrowable wrappers over ``std::error_code`` to exclude accidental throwing +and ensure signaling errors in a way that enables compile-time checking. + +The most fundamental thing in an errorator is a descendant of ``seastar::future`` +which can be used as e.g. function's return type:: + + static ertr::future foo(int bar) { + if (bar == 42) { + return crimson::ct_error::einval::make(); + } else { + return ertr::make_ready_future(bar); + } + } + +It's worth to note that returning an error that is not a part the errorator's error set +would result in a compile-time error:: + + static ertr::future foo(int bar) { + // Oops, input_output_error is not allowed in `ertr`. static_assert() will + // terminate the compilation. This behaviour is absolutely fundamental for + // callers -- to figure out about all possible errors they need to worry + // about is enough to just take a look on the function's signature; reading + // through its implementation is not necessary anymore! + return crimson::ct_error::input_output_error::make(); + } + +The errorator concept goes further. It not only provides callers with the information +about all potential errors embedded in the function's type; it also ensures at the caller +site that all these errors are handled. As the reader probably know, the main method +in ``seastar::future`` is ``then()``. On errorated future it is available but only if errorator's +error set is empty (literally: ``errorator<>::future``); otherwise callers have +to use ``safe_then()`` instead:: + + seastar::future<> baz() { + return foo(42).safe_then( + [] (const int bar) { + std::cout << "the optimistic path! got bar=" << bar << std::endl + return ertr::now(); + }, + ertr::all_same_way(const std::error_code& err) { + // handling errors removes them from errorator's error set + std::cout << "the error path! got err=" << err << std::endl; + return ertr::now(); + }).then([] { + // as all errors have been handled, errorator's error set became + // empty and the future instance returned from `safe_then()` has + // `then()` available! + return seastar::now(); + }); + } + +In the above example ``ertr::all_same_way`` has been used to handle all errors in the same +manner. This is not obligatory -- a caller can handle each of them separately. Moreover, +it can provide a handler for only a subset of errors. The price for that is the availability +of ``then()``:: + + using einval_ertr = crimson::errorator; + + // we can't return seastar::future<> (aka errorator<>::future<>) as handling + // as this level deals only with enoent leaving einval without a handler. + // handling it becomes a responsibility of a caller of `baz()`. + einval_ertr::future<> baz() { + return foo(42).safe_then( + [] (const int bar) { + std::cout << "the optimistic path! got bar=" << bar << std::endl + return ertr::now(); + }, + // provide a handler only for crimson::ct_error::enoent. + // crimson::ct_error::einval stays unhandled! + crimson::ct_error::enoent::handle([] { + std::cout << "the enoent error path!" << std::endl; + return ertr::now(); + })); + // .safe_then() above returned `errorator::future<>` + // which lacks `then()`. + } + +That is, handling errors removes them from errorated future's error set. This works +in the opposite direction too -- returning new errors in ``safe_then()`` appends them +the error set. Of course, this set must be compliant with error set in the ``baz()``'s +signature:: + + using broader_ertr = crimson::errorator; + + broader_ertr::future<> baz() { + return foo(42).safe_then( + [] (const int bar) { + std::cout << "oops, the optimistic path generates a new error!"; + return crimson::ct_error::input_output_error::make(); + }, + // we have a special handler to delegate the handling up. For convenience, + // the same behaviour is available as single argument-taking variant of + // `safe_then()`. + ertr::pass_further{}); + } + +As it can be seen, handling and signaling errors in ``safe_then()`` is basically +an operation on the error set checked at compile-time. + +More details can be found in `the slides from ceph::errorator<> throw/catch-free, +compile time-checked exceptions for seastar::future<> +`_ +presented at the Seastar Summit 2019. diff --git a/doc/dev/crimson/index.rst b/doc/dev/crimson/index.rst new file mode 100644 index 000000000..55f071825 --- /dev/null +++ b/doc/dev/crimson/index.rst @@ -0,0 +1,11 @@ +=============================== +Crimson developer documentation +=============================== + +.. rubric:: Contents + +.. toctree:: + :glob: + + * + diff --git a/doc/dev/crimson/osd.rst b/doc/dev/crimson/osd.rst new file mode 100644 index 000000000..f7f132b3f --- /dev/null +++ b/doc/dev/crimson/osd.rst @@ -0,0 +1,54 @@ +osd +=== + +.. graphviz:: + + digraph osd { + node [shape = doublecircle]; "start" "end"; + node [shape = circle]; + start -> preboot; + waiting_for_healthy [label = "waiting\nfor\nhealthy"]; + waiting_for_healthy -> waiting_for_healthy [label = "tick"]; + waiting_for_healthy -> preboot [label = "i am healthy!"]; + preboot -> booting [label = "send(MOSDBoot)"]; + booting -> active [label = "recv(osdmap)"]; + active -> prestop [label = "stop()"]; + active -> preboot [label = "recv(osdmap)"]; + active -> end [label = "kill(SIGINT)"]; + active -> waiting_for_healthy [label = "i am unhealthy!"] + prestop -> end [label = "recv(osdmap)"]; + } + +.. describe:: waiting_for_healthy + + If an OSD daemon is able to connected to its heartbeat peers, and its own + internal heartbeat does not fail, it is considered healthy. Otherwise, it + puts itself in the state of `waiting_for_healthy`, and check its own + reachability and internal heartbeat periodically. + +.. describe:: preboot + + OSD sends an `MOSDBoot` message to the connected monitor to inform the + cluster that it's ready to serve, so that the quorum can mark it `up` + in the osdmap. + +.. describe:: booting + + Before being marked as `up`, an OSD has to stay in its `booting` state. + +.. describe:: active + + Upon receiving an osdmap marking the OSD as `up`, it transits to `active` + state. After that, it is entitled to do its business. But the OSD service + can be fully stopped or suspended due to various reasons. For instance, + the osd services can be stopped by administrator manually, or marked `stop` + in the osdmap. Or any of its IP addresses does not match with the + corresponding one configured in osdmap, it transits to `preboot` if + it considers itself healthy. + +.. describe:: prestop + + The OSD transits to `prestop` unconditionally upon request of `stop`. + But before bidding us farewell, it tries to get the acknowledge from + the monitor by sending an `MOSDMarkMeDown`, and waiting for an response + of updated osdmap or another `MOSDMarkMeDown` message. diff --git a/doc/dev/crimson/pipeline.rst b/doc/dev/crimson/pipeline.rst new file mode 100644 index 000000000..e9115c6d7 --- /dev/null +++ b/doc/dev/crimson/pipeline.rst @@ -0,0 +1,97 @@ +============================== +The ``ClientRequest`` pipeline +============================== + +In crimson, exactly like in the classical OSD, a client request has data and +ordering dependencies which must be satisfied before processing (actually +a particular phase of) can begin. As one of the goals behind crimson is to +preserve the compatibility with the existing OSD incarnation, the same semantic +must be assured. An obvious example of such data dependency is the fact that +an OSD needs to have a version of OSDMap that matches the one used by the client +(``Message::get_min_epoch()``). + +If a dependency is not satisfied, the processing stops. It is crucial to note +the same must happen to all other requests that are sequenced-after (due to +their ordering requirements). + +There are a few cases when the blocking of a client request can happen. + + + ``ClientRequest::ConnectionPipeline::await_map`` + wait for particular OSDMap version is available at the OSD level + ``ClientRequest::ConnectionPipeline::get_pg`` + wait a particular PG becomes available on OSD + ``ClientRequest::PGPipeline::await_map`` + wait on a PG being advanced to particular epoch + ``ClientRequest::PGPipeline::wait_for_active`` + wait for a PG to become *active* (i.e. have ``is_active()`` asserted) + ``ClientRequest::PGPipeline::recover_missing`` + wait on an object to be recovered (i.e. leaving the ``missing`` set) + ``ClientRequest::PGPipeline::get_obc`` + wait on an object to be available for locking. The ``obc`` will be locked + before this operation is allowed to continue + ``ClientRequest::PGPipeline::process`` + wait if any other ``MOSDOp`` message is handled against this PG + +At any moment, a ``ClientRequest`` being served should be in one and only one +of the phases described above. Similarly, an object denoting particular phase +can host not more than a single ``ClientRequest`` the same time. At low-level +this is achieved with a combination of a barrier and an exclusive lock. +They implement the semantic of a semaphore with a single slot for these exclusive +phases. + +As the execution advances, request enters next phase and leaves the current one +freeing it for another ``ClientRequest`` instance. All these phases form a pipeline +which assures the order is preserved. + +These pipeline phases are divided into two ordering domains: ``ConnectionPipeline`` +and ``PGPipeline``. The former ensures order across a client connection while +the latter does that across a PG. That is, requests originating from the same +connection are executed in the same order as they were sent by the client. +The same applies to the PG domain: when requests from multiple connections reach +a PG, they are executed in the same order as they entered a first blocking phase +of the ``PGPipeline``. + +Comparison with the classical OSD +---------------------------------- +As the audience of this document are Ceph Developers, it seems reasonable to +match the phases of crimson's ``ClientRequest`` pipeline with the blocking +stages in the classical OSD. The names in the right column are names of +containers (lists and maps) used to implement these stages. They are also +already documented in the ``PG.h`` header. + ++----------------------------------------+--------------------------------------+ +| crimson | ceph-osd waiting list | ++========================================+======================================+ +|``ConnectionPipeline::await_map`` | ``OSDShardPGSlot::waiting`` and | +|``ConnectionPipeline::get_pg`` | ``OSDShardPGSlot::waiting_peering`` | ++----------------------------------------+--------------------------------------+ +|``PGPipeline::await_map`` | ``PG::waiting_for_map`` | ++----------------------------------------+--------------------------------------+ +|``PGPipeline::wait_for_active`` | ``PG::waiting_for_peered`` | +| +--------------------------------------+ +| | ``PG::waiting_for_flush`` | +| +--------------------------------------+ +| | ``PG::waiting_for_active`` | ++----------------------------------------+--------------------------------------+ +|To be done (``PG_STATE_LAGGY``) | ``PG::waiting_for_readable`` | ++----------------------------------------+--------------------------------------+ +|To be done | ``PG::waiting_for_scrub`` | ++----------------------------------------+--------------------------------------+ +|``PGPipeline::recover_missing`` | ``PG::waiting_for_unreadable_object``| +| +--------------------------------------+ +| | ``PG::waiting_for_degraded_object`` | ++----------------------------------------+--------------------------------------+ +|To be done (proxying) | ``PG::waiting_for_blocked_object`` | ++----------------------------------------+--------------------------------------+ +|``PGPipeline::get_obc`` | *obc rwlocks* | ++----------------------------------------+--------------------------------------+ +|``PGPipeline::process`` | ``PG::lock`` (roughly) | ++----------------------------------------+--------------------------------------+ + + +As the last word it might be worth to emphasize that the ordering implementations +in both classical OSD and in crimson are stricter than a theoretical minimum one +required by the RADOS protocol. For instance, we could parallelize read operations +targeting the same object at the price of extra complexity but we don't -- the +simplicity has won. diff --git a/doc/dev/crimson/poseidonstore.rst b/doc/dev/crimson/poseidonstore.rst new file mode 100644 index 000000000..7c54c029a --- /dev/null +++ b/doc/dev/crimson/poseidonstore.rst @@ -0,0 +1,586 @@ +=============== + PoseidonStore +=============== + +Key concepts and goals +====================== + +* As one of the pluggable backend stores for Crimson, PoseidonStore targets only + high-end NVMe SSDs (not concerned with ZNS devices). +* Designed entirely for low CPU consumption + + - Hybrid update strategies for different data types (in-place, out-of-place) to + minimize CPU consumption by reducing host-side GC. + - Remove a black-box component like RocksDB and a file abstraction layer in BlueStore + to avoid unnecessary overheads (e.g., data copy and serialization/deserialization) + - Utilize NVMe feature (atomic large write command, Atomic Write Unit Normal). + Make use of io_uring, new kernel asynchronous I/O interface, to selectively use the interrupt + driven mode for CPU efficiency (or polled mode for low latency). +* Sharded data/processing model + +Background +---------- + +Both in-place and out-of-place update strategies have their pros and cons. + +* Log-structured store + + Log-structured based storage system is a typical example that adopts an update-out-of-place approach. + It never modifies the written data. Writes always go to the end of the log. It enables I/O sequentializing. + + * Pros + + - Without a doubt, one sequential write is enough to store the data + - It naturally supports transaction (this is no overwrite, so the store can rollback + previous stable state) + - Flash friendly (it mitigates GC burden on SSDs) + * Cons + + - There is host-side GC that induces overheads + + - I/O amplification (host-side) + - More host-CPU consumption + + - Slow metadata lookup + - Space overhead (live and unused data co-exist) + +* In-place update store + + The update-in-place strategy has been used widely for conventional file systems such as ext4 and xfs. + Once a block has been placed in a given disk location, it doesn't move. + Thus, writes go to the corresponding location in the disk. + + * Pros + + - Less host-CPU consumption (No host-side GC is required) + - Fast lookup + - No additional space for log-structured, but there is internal fragmentation + * Cons + + - More writes occur to record the data (metadata and data section are separated) + - It cannot support transaction. Some form of WAL required to ensure update atomicity + in the general case + - Flash unfriendly (Give more burdens on SSDs due to device-level GC) + +Motivation and Key idea +----------------------- + +In modern distributed storage systems, a server node can be equipped with multiple +NVMe storage devices. In fact, ten or more NVMe SSDs could be attached on a server. +As a result, it is hard to achieve NVMe SSD's full performance due to the limited CPU resources +available in a server node. In such environments, CPU tends to become a performance bottleneck. +Thus, now we should focus on minimizing host-CPU consumption, which is the same as the Crimson's objective. + +Towards an object store highly optimized for CPU consumption, three design choices have been made. + +* **PoseidonStore does not have a black-box component like RocksDB in BlueStore.** + + Thus, it can avoid unnecessary data copy and serialization/deserialization overheads. + Moreover, we can remove an unnecessary file abstraction layer, which was required to run RocksDB. + Object data and metadata is now directly mapped to the disk blocks. + Eliminating all these overheads will reduce CPU consumption (e.g., pre-allocation, NVME atomic feature). + +* **PoseidonStore uses hybrid update strategies for different data size, similar to BlueStore.** + + As we discussed, both in-place and out-of-place update strategies have their pros and cons. + Since CPU is only bottlenecked under small I/O workloads, we chose update-in-place for small I/Os to minimize CPU consumption + while choosing update-out-of-place for large I/O to avoid double write. Double write for small data may be better than host-GC overhead + in terms of CPU consumption in the long run. Although it leaves GC entirely up to SSDs, + +* **PoseidonStore makes use of io_uring, new kernel asynchronous I/O interface to exploit interrupt-driven I/O.** + + User-space driven I/O solutions like SPDK provide high I/O performance by avoiding syscalls and enabling zero-copy + access from the application. However, it does not support interrupt-driven I/O, which is only possible with kernel-space driven I/O. + Polling is good for low-latency but bad for CPU efficiency. On the other hand, interrupt is good for CPU efficiency and bad for + low-latency (but not that bad as I/O size increases). Note that network acceleration solutions like DPDK also excessively consume + CPU resources for polling. Using polling both for network and storage processing aggravates CPU consumption. + Since network is typically much faster and has a higher priority than storage, polling should be applied only to network processing. + +high-end NVMe SSD has enough powers to handle more works. Also, SSD lifespan is not a practical concern these days +(there is enough program-erase cycle limit [#f1]_). On the other hand, for large I/O workloads, the host can afford process host-GC. +Also, the host can garbage collect invalid objects more effectively when their size is large + +Observation +----------- + +Two data types in Ceph + +* Data (object data) + + - The cost of double write is high + - The best method to store this data is in-place update + + - At least two operations required to store the data: 1) data and 2) location of + data. Nevertheless, a constant number of operations would be better than out-of-place + even if it aggravates WAF in SSDs + +* Metadata or small data (e.g., object_info_t, snapset, pg_log, and collection) + + - Multiple small-sized metadata entries for an object + - The best solution to store this data is WAL + Using cache + + - The efficient way to store metadata is to merge all metadata related to data + and store it though a single write operation even though it requires background + flush to update the data partition + + +Design +====== +.. ditaa:: + + +-WAL partition-|----------------------Data partition-------------------------------+ + | Sharded partition | + +-----------------------------------------------------------------------------------+ + | WAL -> | | Super block | Freelist info | Onode radix tree info| Data blocks | + +-----------------------------------------------------------------------------------+ + | Sharded partition 2 + +-----------------------------------------------------------------------------------+ + | WAL -> | | Super block | Freelist info | Onode radix tree info| Data blocks | + +-----------------------------------------------------------------------------------+ + | Sharded partition N + +-----------------------------------------------------------------------------------+ + | WAL -> | | Super block | Freelist info | Onode radix tree info| Data blocks | + +-----------------------------------------------------------------------------------+ + | Global information (in reverse order) + +-----------------------------------------------------------------------------------+ + | Global WAL -> | | SB | Freelist | | + +-----------------------------------------------------------------------------------+ + + +* WAL + + - Log, metadata and small data are stored in the WAL partition + - Space within the WAL partition is continually reused in a circular manner + - Flush data to trim WAL as necessary +* Disk layout + + - Data blocks are metadata blocks or data blocks + - Freelist manages the root of free space B+tree + - Super block contains management info for a data partition + - Onode radix tree info contains the root of onode radix tree + + +I/O procedure +------------- +* Write + + For incoming writes, data is handled differently depending on the request size; + data is either written twice (WAL) or written in a log-structured manner. + + #. If Request Size ≤ Threshold (similar to minimum allocation size in BlueStore) + + Write data and metadata to [WAL] —flush—> Write them to [Data section (in-place)] and + [Metadata section], respectively. + + Since the CPU becomes the bottleneck for small I/O workloads, in-place update scheme is used. + Double write for small data may be better than host-GC overhead in terms of CPU consumption + in the long run + #. Else if Request Size > Threshold + + Append data to [Data section (log-structure)] —> Write the corresponding metadata to [WAL] + —flush—> Write the metadata to [Metadata section] + + For large I/O workloads, the host can afford process host-GC + Also, the host can garbage collect invalid objects more effectively when their size is large + + Note that Threshold can be configured to a very large number so that only the scenario (1) occurs. + With this design, we can control the overall I/O procedure with the optimizations for crimson + as described above. + + * Detailed flow + + We make use of a NVMe write command which provides atomicity guarantees (Atomic Write Unit Power Fail) + For example, 512 Kbytes of data can be atomically written at once without fsync(). + + * stage 1 + + - if the data is small + WAL (written) --> | TxBegin A | Log Entry | TxEnd A | + Append a log entry that contains pg_log, snapset, object_infot_t and block allocation + using NVMe atomic write command on the WAL + - if the data is large + Data partition (written) --> | Data blocks | + * stage 2 + + - if the data is small + No need. + - if the data is large + Then, append the metadata to WAL. + WAL --> | TxBegin A | Log Entry | TxEnd A | + +* Read + + - Use the cached object metadata to find out the data location + - If not cached, need to search WAL after checkpoint and Object meta partition to find the + latest meta data + +* Flush (WAL --> Data partition) + + - Flush WAL entries that have been committed. There are two conditions + (1. the size of WAL is close to full, 2. a signal to flush). + We can mitigate the overhead of frequent flush via batching processing, but it leads to + delaying completion. + + +Crash consistency +------------------ + +* Large case + + #. Crash occurs right after writing Data blocks + + - Data partition --> | Data blocks | + - We don't need to care this case. Data is not allocated yet. The blocks will be reused. + #. Crash occurs right after WAL + + - Data partition --> | Data blocks | + - WAL --> | TxBegin A | Log Entry | TxEnd A | + - Write procedure is completed, so there is no data loss or inconsistent state + +* Small case + + #. Crash occurs right after writing WAL + + - WAL --> | TxBegin A | Log Entry| TxEnd A | + - All data has been written + + +Comparison +---------- + +* Best case (pre-allocation) + + - Only need writes on both WAL and Data partition without updating object metadata (for the location). +* Worst case + + - At least three writes are required additionally on WAL, object metadata, and data blocks. + - If the flush from WAL to the data partition occurs frequently, radix tree onode structure needs to be update + in many times. To minimize such overhead, we can make use of batch processing to minimize the update on the tree + (the data related to the object has a locality because it will have the same parent node, so updates can be minimized) + +* WAL needs to be flushed if the WAL is close to full or a signal to flush. + + - The premise behind this design is OSD can manage the latest metadata as a single copy. So, + appended entries are not to be read +* Either best of the worst case does not produce severe I/O amplification (it produce I/Os, but I/O rate is constant) + unlike LSM-tree DB (the proposed design is similar to LSM-tree which has only level-0) + + +Detailed Design +=============== + +* Onode lookup + + * Radix tree + Our design is entirely based on the prefix tree. Ceph already makes use of the characteristic of OID's prefix to split or search + the OID (e.g., pool id + hash + oid). So, the prefix tree fits well to store or search the object. Our scheme is designed + to lookup the prefix tree efficiently. + + * Sharded partition + A few bits (leftmost bits of the hash) of the OID determine a sharded partition where the object is located. + For example, if the number of partitions is configured as four, The entire space of the hash in hobject_t + can be divided into four domains (0x0xxx ~ 0x3xxx, 0x4xxx ~ 0x7xxx, 0x8xxx ~ 0xBxxx and 0xCxxx ~ 0xFxxx). + + * Ondisk onode + + .. code-block:: c + + struct onode { + extent_tree block_maps; + b+_tree omaps; + map xattrs; + } + + onode contains the radix tree nodes for lookup, which means we can search for objects using tree node information in onode. + Also, if the data size is small, the onode can embed the data and xattrs. + The onode is fixed size (256 or 512 byte). On the other hands, omaps and block_maps are variable-length by using pointers in the onode. + + .. ditaa:: + + +----------------+------------+--------+ + | on\-disk onode | block_maps | omaps | + +----------+-----+------------+--------+ + | ^ ^ + | | | + +-----------+---------+ + + + * Lookup + The location of the root of onode tree is specified on Onode radix tree info, so we can find out where the object + is located by using the root of prefix tree. For example, shared partition is determined by OID as described above. + Using the rest of the OID's bits and radix tree, lookup procedure find outs the location of the onode. + The extent tree (block_maps) contains where data chunks locate, so we finally figure out the data location. + + +* Allocation + + * Sharded partitions + + The entire disk space is divided into several data chunks called sharded partition (SP). + Each SP has its own data structures to manage the partition. + + * Data allocation + + As we explained above, the management infos (e.g., super block, freelist info, onode radix tree info) are pre-allocated + in each shared partition. Given OID, we can map any data in Data block section to the extent tree in the onode. + Blocks can be allocated by searching the free space tracking data structure (we explain below). + + :: + + +-----------------------------------+ + | onode radix tree root node block | + | (Per-SP Meta) | + | | + | # of records | + | left_sibling / right_sibling | + | +--------------------------------+| + | | keys[# of records] || + | | +-----------------------------+|| + | | | start onode ID ||| + | | | ... ||| + | | +-----------------------------+|| + | +--------------------------------|| + | +--------------------------------+| + | | ptrs[# of records] || + | | +-----------------------------+|| + | | | SP block number ||| + | | | ... ||| + | | +-----------------------------+|| + | +--------------------------------+| + +-----------------------------------+ + + * Free space tracking + The freespace is tracked on a per-SP basis. We can use extent-based B+tree in XFS for free space tracking. + The freelist info contains the root of free space B+tree. Granularity is a data block in Data blocks partition. + The data block is the smallest and fixed size unit of data. + + :: + + +-----------------------------------+ + | Free space B+tree root node block | + | (Per-SP Meta) | + | | + | # of records | + | left_sibling / right_sibling | + | +--------------------------------+| + | | keys[# of records] || + | | +-----------------------------+|| + | | | startblock / blockcount ||| + | | | ... ||| + | | +-----------------------------+|| + | +--------------------------------|| + | +--------------------------------+| + | | ptrs[# of records] || + | | +-----------------------------+|| + | | | SP block number ||| + | | | ... ||| + | | +-----------------------------+|| + | +--------------------------------+| + +-----------------------------------+ + +* Omap and xattr + In this design, omap and xattr data is tracked by b+tree in onode. The onode only has the root node of b+tree. + The root node contains entries which indicate where the key onode exists. + So, if we know the onode, omap can be found via omap b+tree. + +* Fragmentation + + - Internal fragmentation + + We pack different types of data/metadata in a single block as many as possible to reduce internal fragmentation. + Extent-based B+tree may help reduce this further by allocating contiguous blocks that best fit for the object + + - External fragmentation + + Frequent object create/delete may lead to external fragmentation + In this case, we need cleaning work (GC-like) to address this. + For this, we are referring the NetApp’s Continuous Segment Cleaning, which seems similar to the SeaStore’s approach + Countering Fragmentation in an Enterprise Storage System (NetApp, ACM TOS, 2020) + +.. ditaa:: + + + +---------------+-------------------+-------------+ + | Freelist info | Onode radix tree | Data blocks +-------+ + +---------------+---------+---------+-+-----------+ | + | | | + +--------------------+ | | + | | | + | OID | | + | | | + +---+---+ | | + | Root | | | + +---+---+ | | + | | | + v | | + /-----------------------------\ | | + | Radix tree | | v + +---------+---------+---------+ | /---------------\ + | onode | ... | ... | | | Num Chunk | + +---------+---------+---------+ | | | + +--+ onode | ... | ... | | | | + | +---------+---------+---------+ | | +-------+ + | | | ... | | + | | +---------------+ | + | | ^ | + | | | | + | | | | + | | | | + | /---------------\ /-------------\ | | v + +->| onode | | onode |<---+ | /------------+------------\ + +---------------+ +-------------+ | | Block0 | Block1 | + | OID | | OID | | +------------+------------+ + | Omaps | | Omaps | | | Data | Data | + | Data Extent | | Data Extent +-----------+ +------------+------------+ + +---------------+ +-------------+ + +WAL +--- +Each SP has a WAL. +The data written to the WAL are metadata updates, free space update and small data. +Note that only data smaller than the predefined threshold needs to be written to the WAL. +The larger data is written to the unallocated free space and its onode's extent_tree is updated accordingly +(also on-disk extent tree). We statically allocate WAL partition aside from data partition pre-configured. + + +Partition and Reactor thread +---------------------------- +In early stage development, PoseidonStore will employ static allocation of partition. The number of sharded partitions +is fixed and the size of each partition also should be configured before running cluster. +But, the number of partitions can grow as below. We leave this as a future work. +Also, each reactor thread has a static set of SPs. + +.. ditaa:: + + +------+------+-------------+------------------+ + | SP 1 | SP N | --> <-- | global partition | + +------+------+-------------+------------------+ + + + +Cache +----- +There are mainly two cache data structures; onode cache and block cache. +It looks like below. + +#. Onode cache: + lru_map ; +#. Block cache (data and omap): + Data cache --> lru_map + +To fill the onode data structure, the target onode needs to be retrieved using the prefix tree. +Block cache is used for caching a block contents. For a transaction, all the updates to blocks +(including object meta block, data block) are first performed in the in-memory block cache. +After writing a transaction to the WAL, the dirty blocks are flushed to their respective locations in the +respective partitions. +PoseidonStore can configure cache size for each type. Simple LRU cache eviction strategy can be used for both. + + +Sharded partitions (with cross-SP transaction) +---------------------------------------------- +The entire disk space is divided into a number of chunks called sharded partitions (SP). +The prefixes of the parent collection ID (original collection ID before collection splitting. That is, hobject.hash) +is used to map any collections to SPs. +We can use BlueStore's approach for collection splitting, changing the number of significant bits for the collection prefixes. +Because the prefixes of the parent collection ID do not change even after collection splitting, the mapping between +the collection and SP are maintained. +The number of SPs may be configured to match the number of CPUs allocated for each disk so that each SP can hold +a number of objects large enough for cross-SP transaction not to occur. + +In case of need of cross-SP transaction, we could use the global WAL. The coordinator thread (mainly manages global partition) handles +cross-SP transaction via acquire the source SP and target SP locks before processing the cross-SP transaction. +Source and target probably are blocked. + +For the load unbalanced situation, +Poseidonstore can create partitions to make full use of entire space efficiently and provide load balaning. + + +CoW/Clone +--------- +As for CoW/Clone, a clone has its own onode like other normal objects. + +Although each clone has its own onode, data blocks should be shared between the original object and clones +if there are no changes on them to minimize the space overhead. +To do so, the reference count for the data blocks is needed to manage those shared data blocks. + +To deal with the data blocks which has the reference count, poseidon store makes use of shared_blob +which maintains the referenced data block. + +As shown the figure as below, +the shared_blob tracks the data blocks shared between other onodes by using a reference count. +The shared_blobs are managed by shared_blob_list in the superblock. + + +.. ditaa:: + + + /----------\ /----------\ + | Object A | | Object B | + +----------+ +----------+ + | Extent | | Extent | + +---+--+---+ +--+----+--+ + | | | | + | | +----------+ | + | | | | + | +---------------+ | + | | | | + v v v v + +---------------+---------------+ + | Data block 1 | Data block 2 | + +-------+-------+------+--------+ + | | + v v + /---------------+---------------\ + | shared_blob 1 | shared_blob 2 | + +---------------+---------------+ shared_blob_list + | refcount | refcount | + +---------------+---------------+ + +Plans +===== + +All PRs should contain unit tests to verify its minimal functionality. + +* WAL and block cache implementation + + As a first step, we are going to build the WAL including the I/O procedure to read/write the WAL. + With WAL development, the block cache needs to be developed together. + Besides, we are going to add an I/O library to read/write from/to the NVMe storage to + utilize NVMe feature and the asynchronous interface. + +* Radix tree and onode + + First, submit a PR against this file with a more detailed on disk layout and lookup strategy for the onode radix tree. + Follow up with implementation based on the above design once design PR is merged. + The second PR will be the implementation regarding radix tree which is the key structure to look up + objects. + +* Extent tree + + This PR is the extent tree to manage data blocks in the onode. We build the extent tree, and + demonstrate how it works when looking up the object. + +* B+tree for omap + + We will put together a simple key/value interface for omap. This probably will be a separate PR. + +* CoW/Clone + + To support CoW/Clone, shared_blob and shared_blob_list will be added. + +* Integration to Crimson as to I/O interfaces + + At this stage, interfaces for interacting with Crimson such as queue_transaction(), read(), clone_range(), etc. + should work right. + +* Configuration + + We will define Poseidon store configuration in detail. + +* Stress test environment and integration to teuthology + + We will add stress tests and teuthology suites. + +.. rubric:: Footnotes + +.. [#f1] Stathis Maneas, Kaveh Mahdaviani, Tim Emami, Bianca Schroeder: A Study of SSD Reliability in Large Scale Enterprise Storage Deployments. FAST 2020: 137-149 -- cgit v1.2.3