From 19fcec84d8d7d21e796c7624e521b60d28ee21ed Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 20:45:59 +0200 Subject: Adding upstream version 16.2.11+ds. Signed-off-by: Daniel Baumann --- doc/dev/osd_internals/manifest.rst | 623 +++++++++++++++++++++++++++++++++++++ 1 file changed, 623 insertions(+) create mode 100644 doc/dev/osd_internals/manifest.rst (limited to 'doc/dev/osd_internals/manifest.rst') diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst new file mode 100644 index 000000000..2e8f9ae44 --- /dev/null +++ b/doc/dev/osd_internals/manifest.rst @@ -0,0 +1,623 @@ +======== +Manifest +======== + + +Introduction +============ + +As described in ``../deduplication.rst``, adding transparent redirect +machinery to RADOS would enable a more capable tiering solution +than RADOS currently has with "cache/tiering". + +See ``../deduplication.rst`` + +At a high level, each object has a piece of metadata embedded in +the ``object_info_t`` which can map subsets of the object data payload +to (refcounted) objects in other pools. + +This document exists to detail: + +1. Manifest data structures +2. Rados operations for manipulating manifests. +3. Status and Plans + + +Intended Usage Model +==================== + +RBD +--- + +For RBD, the primary goal is for either an OSD-internal agent or a +cluster-external agent to be able to transparently shift portions +of the consituent 4MB extents between a dedup pool and a hot base +pool. + +As such, RBD operations (including class operations and snapshots) +must have the same observable results regardless of the current +status of the object. + +Moreover, tiering/dedup operations must interleave with RBD operations +without changing the result. + +Thus, here is a sketch of how I'd expect a tiering agent to perform +basic operations: + +* Demote cold RBD chunk to slow pool: + + 1. Read object, noting current user_version. + 2. In memory, run CDC implementation to fingerprint object. + 3. Write out each resulting extent to an object in the cold pool + using the CAS class. + 4. Submit operation to base pool: + + * ``ASSERT_VER`` with the user version from the read to fail if the + object has been mutated since the read. + * ``SET_CHUNK`` for each of the extents to the corresponding object + in the base pool. + * ``EVICT_CHUNK`` for each extent to free up space in the base pool. + Results in each chunk being marked ``MISSING``. + + RBD users should then either see the state prior to the demotion or + subsequent to it. + + Note that between 3 and 4, we potentially leak references, so a + periodic scrub would be needed to validate refcounts. + +* Promote cold RBD chunk to fast pool. + + 1. Submit ``TIER_PROMOTE`` + +For clones, all of the above would be identical except that the +initial read would need a ``LIST_SNAPS`` to determine which clones exist +and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include +the ``cloneid``. + +RadosGW +------- + +For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the +manifest machinery in the OSD to hide the distinction between the object +being dedup'd or present in the base pool + +For writes, RGW could operate as RBD does above, but could +optionally have the freedom to fingerprint prior to doing the write. +In that case, it could immediately write out the target objects to the +CAS pool and then atomically write an object with the corresponding +chunks set. + +Status and Future Work +====================== + +At the moment, initial versions of a manifest data structure along +with IO path support and rados control operations exist. This section +is meant to outline next steps. + +At a high level, our future work plan is: + +- Cleanups: Address immediate inconsistencies and shortcomings outlined + in the next section. +- Testing: Rados relies heavily on teuthology failure testing to validate + features like cache/tiering. We'll need corresponding tests for + manifest operations. +- Snapshots: We want to be able to deduplicate portions of clones + below the level of the rados snapshot system. As such, the + rados operations below need to be extended to work correctly on + clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the + corresponding extent in the base pool, and correctly maintain OSD metadata). +- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing + cache/tiering implementation, but to do that we need to ensure that we + can address the same use cases. + + +Cleanups +-------- + +The existing implementation has some things that need to be cleaned up: + +* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise + one couldn't create an object atomically as a redirect. +* ``SET_CHUNK``: + + * Appears to trigger a new clone as user_modify gets set in + ``do_osd_ops``. This probably isn't desirable, see Snapshots section + below for some options on how generally to mix these operations + with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set + user_modify. + * Appears to assume that the corresponding section of the object + does not exist (sets ``FLAG_MISSING``) but does not check whether the + corresponding extent exists already in the object. Should always + leave the extent clean. + * Appears to clear the manifest unconditionally if not chunked, + that's probably wrong. We should return an error if it's a + ``REDIRECT`` :: + + case CEPH_OSD_OP_SET_CHUNK: + if (oi.manifest.is_redirect()) { + result = -EINVAL; + goto fail; + } + + +* ``TIER_PROMOTE``: + + * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears + to copy them back in, but does not unset the redirect or clear the + reference. This violates the invariant that a redirect object + should be empty in the base pool. In particular, as long as the + redirect is set, it appears that all operations will be proxied + even after the promote defeating the purpose. We do want ``PROMOTE`` + to be able to atomically replace a redirect with the actual + object, so the solution is to clear the redirect at the end of the + promote. + * For a chunked manifest, we appear to flush prior to promoting. + Promotion will often be used to prepare an object for low latency + reads and writes, accordingly, the only effect should be to read + any ``MISSING`` extents into the base pool. No flushing should be done. + +* High Level: + + * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing + at a dedup extent. Writing the mutated extent back to the dedup pool + requires writing a new object since the previous one cannot be mutated, + just as it would if it hadn't been dedup'd yet. Thus, we should always + drop the reference and remove the manifest pointer. + + * There isn't currently a way to "evict" an object region. With the above + change to ``SET_CHUNK`` to always retain the existing object region, we + need an ``EVICT_CHUNK`` operation to then remove the extent. + + +Testing +------- + +We rely really heavily on randomized failure testing. As such, we need +to extend that testing to include dedup/manifest support as well. Here's +a short list of the touchpoints: + +* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml`` + + That test, of course, tests the existing cache/tiering machinery. Add + additional files to that directory that instead setup a dedup pool. Add + support to ``ceph_test_rados`` (``src/test/osd/TestRados*``). + +* RBD tests + + Add a test that runs an RBD workload concurrently with blind + promote/evict operations. + +* RGW + + Add a test that runs a rgw workload concurrently with blind + promote/evict operations. + + +Snapshots +--------- + +Fundamentally we need to be able to manipulate the manifest +status of clones because we want to be able to dynamically promote, +flush (if the state was dirty when the clone was created), and evict +extents from clones. + +As such, the plan is to allow the ``object_manifest_t`` for each clone +to be independent. Here's an incomplete list of the high level +tasks: + +* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK`` + to operation directly on clones. +* Ensure that recovery checks the object_manifest prior to trying to + use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets`` + are the two methods that would likely need to be modified. + +See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD +support details. I'd like to call out one particular data structure +we may want to exploit. + +The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover +clones as part of leak detection. + +An important question is how we deal with the fact that many clones +will frequently have references to the same backing chunks at the same +offset. In particular, ``make_writeable`` will generally create a clone +that shares the same ``object_manifest_t`` references with the exception +of any extents modified in that transaction. The metadata that +commits as part of that transaction must therefore map onto the same +refcount as before because otherwise we'd have to first increment +refcounts on backing objects (or risk a reference to a dead object) +Thus, we introduce a simple convention: consecutive clones which +share a reference at the same offset share the same refcount. This +means that a write that invokes ``make_writeable`` may decrease refcounts, +but not increase them. This has some conquences for removing clones. +Consider the following sequence :: + + write foo [0, 1024) + flush foo -> + head: [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=1, refcount(bbb)=1 + snapshot 10 + write foo [0, 512) -> + head: [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=1, refcount(bbb)=1 + flush foo -> + head: [0, 512) ccc, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1 + snapshot 20 + write foo [0, 512) (same contents as the original write) + head: [512, 1024) bbb + 20 : [0, 512) ccc, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=?, refcount(bbb)=1 + flush foo + head: [0, 512) aaa, [512, 1024) bbb + 20 : [0, 512) ccc, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1 + +What should be the refcount for ``aaa`` be at the end? By our +above rule, it should be ``2`` since the two ```aaa``` refs are not +contiguous. However, consider removing clone ``20`` :: + + initial: + head: [0, 512) aaa, [512, 1024) bbb + 20 : [0, 512) ccc, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1 + trim 20 + head: [0, 512) aaa, [512, 1024) bbb + 10 : [0, 512) aaa, [512, 1024) bbb + refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0 + +At this point, our rule dictates that ``refcount(aaa)`` is `1`. +This means that removing ``20`` needs to check for refs held by +the clones on either side which will then match. + +See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal`` +for the logic implementing this rule. + +This seems complicated, but it gets us two valuable properties: + +1) The refcount change from make_writeable will not block on + incrementing a ref +2) We don't need to load the ``object_manifest_t`` for every clone + to determine how to handle removing one -- just the ones + immediately preceding and succeeding it. + +All clone operations will need to consider adjacent ``chunk_maps`` +when adding or removing references. + +Cache/Tiering +------------- + +There already exists a cache/tiering mechanism based on whiteouts. +One goal here should ultimately be for this manifest machinery to +provide a complete replacement. + +See ``cache-pool.rst`` + +The manifest machinery already shares some code paths with the +existing cache/tiering code, mainly ``stat_flush``. + +In no particular order, here's in incomplete list of things that need +to be wired up to provide feature parity: + +* Online object access information: The osd already has pool configs + for maintaining bloom filters which provide estimates of access + recency for objects. We probably need to modify this to permit + hitset maintenance for a normal pool -- there are already + ``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them. +* Tiering agent: The osd already has a background tiering agent which + would need to be modified to instead flush and evict using + manifests. + +* Use exiting existing features regarding the cache flush policy such as + histset, age, ratio. + - hitset + - age, ratio, bytes + +* Add tiering-mode to ``manifest-tiering`` + - Writeback + - Read-only + + +Data Structures +=============== + +Each RADOS object contains an ``object_manifest_t`` embedded within the +``object_info_t`` (see ``osd_types.h``): + +:: + + struct object_manifest_t { + enum { + TYPE_NONE = 0, + TYPE_REDIRECT = 1, + TYPE_CHUNKED = 2, + }; + uint8_t type; // redirect, chunked, ... + hobject_t redirect_target; + std::map chunk_map; + } + +The ``type`` enum reflects three possible states an object can be in: + +1. ``TYPE_NONE``: normal RADOS object +2. ``TYPE_REDIRECT``: object payload is backed by a single object + specified by ``redirect_target`` +3. ``TYPE_CHUNKED: object payload is distributed among objects with + size and offset specified by the ``chunk_map``. ``chunk_map`` maps + the offset of the chunk to a ``chunk_info_t`` as shown below, also + specifying the ``length``, target `OID`, and ``flags``. + +:: + + struct chunk_info_t { + typedef enum { + FLAG_DIRTY = 1, + FLAG_MISSING = 2, + FLAG_HAS_REFERENCE = 4, + FLAG_HAS_FINGERPRINT = 8, + } cflag_t; + uint32_t offset; + uint32_t length; + hobject_t oid; + cflag_t flags; // FLAG_* + + +``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint +is written. This should be changed to drop the fingerprint instead. + + +Request Handling +================ + +Similarly to cache/tiering, the initial touchpoint is +``maybe_handle_manifest_detail``. + +For manifest operations listed below, we return ``NOOP`` and continue onto +dedicated handling within ``do_osd_ops``. + +For redirect objects which haven't been promoted (apparently ``oi.size > +0`` indicates that it's present?) we proxy reads and writes. + +For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all +of the ops are reads of extents in the ``object_manifest_t chunk_map``), +we proxy requests to those objects. + + +RADOS Interface +================ + +To set up deduplication one must provision two pools. One will act as the +base pool and the other will act as the chunk pool. The base pool need to be +configured with the ``fingerprint_algorithm`` option as follows. + +:: + + ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512 + --yes-i-really-mean-it + +Create objects :: + + rados -p base_pool put foo ./foo + rados -p chunk_pool put foo-chunk ./foo-chunk + +Make a manifest object :: + + rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference + +Operations: + +* ``set-redirect`` + + Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object`` + in the ``target_pool``. + A redirected object will forward all operations from the client to the + ``target_object``. :: + + void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx, + uint64_t tgt_version, int flag = 0); + + rados -p base_pool set-redirect --target-pool + + + Returns ``ENOENT`` if the object does not exist (TODO: why?) + Returns ``EINVAL`` if the object already is a redirect. + + Takes a reference to target as part of operation, can possibly leak a ref + if the acting set resets and the client dies between taking the ref and + recording the redirect. + + Truncates object, clears omap, and clears xattrs as a side effect. + + At the top of ``do_osd_ops``, does not set user_modify. + + This operation is not a user mutation and does not trigger a clone to be created. + + There are two purposes of ``set_redirect``: + + 1. Redirect all operation to the target object (like proxy) + 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time). + +* ``set-chunk`` + + Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a + ``target_object``. :: + + void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx, + std::string tgt_oid, uint64_t tgt_offset, int flag = 0); + + rados -p base_pool set-chunk --target-pool + + + Returns ``ENOENT`` if the object does not exist (TODO: why?) + Returns ``EINVAL`` if the object already is a redirect. + Returns ``EINVAL`` if on ill-formed parameter buffer. + Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping. + + Takes references to targets as part of operation, can possibly leak refs + if the acting set resets and the client dies between taking the ref and + recording the redirect. + + Truncates object, clears omap, and clears xattrs as a side effect. + + This operation is not a user mutation and does not trigger a clone to be created. + + TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. :: + + if (!oi.manifest.is_chunked()) { + oi.manifest.clear(); + } + +* ``evict-chunk`` + + Clears an extent from an object leaving only the manifest link between + it and the ``target_object``. :: + + void evict_chunk( + uint64_t offset, uint64_t length, int flag = 0); + + rados -p base_pool evict-chunk + + Returns ``EINVAL`` if the extent is not present in the manifest. + + Note: this does not exist yet. + + +* ``tier-promote`` + + Promotes the object ensuring that subsequent reads and writes will be local :: + + void tier_promote(); + + rados -p base_pool tier-promote + + Returns ``ENOENT`` if the object does not exist + + For a redirect manifest, copies data to head. + + TODO: Promote on a redirect object needs to clear the redirect. + + For a chunked manifest, reads all MISSING extents into the base pool, + subsequent reads and writes will be served from the base pool. + + Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The + resulting ``copy_get`` operation will issue reads which will then be redirected by + the normal manifest read machinery. + + Does not set the ``user_modify`` flag. + + Future work will involve adding support for specifying a ``clone_id``. + +* ``unset-manifest`` + + Unset the manifest info in the object that has manifest. :: + + void unset_manifest(); + + rados -p base_pool unset-manifest + + Clears manifest chunks or redirect. Lazily releases references, may + leak. + + ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``, + and so will trigger a snapshot. Note, this will be true even for a + redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should + be fixed -- ``unset-manifest`` should not be a ``user_modify``. + +* ``tier-flush`` + + Flush the object which has chunks to the chunk pool. :: + + void tier_flush(); + + rados -p base_pool tier-flush + + Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone. + + Does not evict the extents. + + +ceph-dedup-tool +=============== + +``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking +and fixing the reference count (see ``./refcount.rst``). + +* Find an optimal chunk offset + + a. Fixed chunk + + To find out a fixed chunk length, you need to run the following command many + times while changing the ``chunk_size``. :: + + ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size + --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512 + + b. Rabin chunk(Rabin-Karp algorithm) + + Rabin-Karp is a string-searching algorithm based + on a rolling hash. But a rolling hash is not enough to do deduplication because + we don't know the chunk boundary. So, we need content-based slicing using + a rolling hash for content-defined chunking. + The current implementation uses the simplest approach: look for chunk boundaries + by inspecting the rolling hash for pattern (like the + lower N bits are all zeroes). + + Users who want to use deduplication need to find an ideal chunk offset. + To find out ideal chunk offset, users should discover + the optimal configuration for their data workload via ``ceph-dedup-tool``. + This information will then be used for object chunking through + the ``set-chunk`` API. :: + + ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size + --chunk-algorithm rabin --fingerprint-algorithm rabin + + ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``. + These are options for ``rabin chunk``. :: + + --mod-prime + --rabin-prime + --pow + --chunk-mask-bit + --window-size + --min-chunk + --max-chunk + + Users need to refer following equation to use above options for ``rabin chunk``. :: + + rabin_hash = + (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime) + + c. Fixed chunk vs content-defined chunk + + Content-defined chunking may or not be optimal solution. + For example, + + Data chunk ``A`` : ``abcdefgabcdefgabcdefg`` + + Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is + from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length. + But, in the case of content-based slicing, the optimal chunk length + could not be found (dedup ratio will not be 100%). + Because we need to find optimal parameter such + as boundary bit, window size and prime value. This is as easy as fixed chunk. + But, content defined chunking is very effective in the following case. + + Data chunk ``B`` : ``abcdefgabcdefgabcdefg`` + + Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg`` + + +* Fix reference count + + The key idea behind of reference counting for dedup is false-positive, which means + ``(manifest object (no ref),, chunk object(has ref))`` happen instead of + ``(manifest object (has ref), chunk 1(no ref))``. + To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. :: + + ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL + -- cgit v1.2.3