summaryrefslogtreecommitdiffstats
path: root/doc/dev/osd_internals/manifest.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/dev/osd_internals/manifest.rst')
-rw-r--r--doc/dev/osd_internals/manifest.rst623
1 files changed, 623 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst
new file mode 100644
index 000000000..2e8f9ae44
--- /dev/null
+++ b/doc/dev/osd_internals/manifest.rst
@@ -0,0 +1,623 @@
+========
+Manifest
+========
+
+
+Introduction
+============
+
+As described in ``../deduplication.rst``, adding transparent redirect
+machinery to RADOS would enable a more capable tiering solution
+than RADOS currently has with "cache/tiering".
+
+See ``../deduplication.rst``
+
+At a high level, each object has a piece of metadata embedded in
+the ``object_info_t`` which can map subsets of the object data payload
+to (refcounted) objects in other pools.
+
+This document exists to detail:
+
+1. Manifest data structures
+2. Rados operations for manipulating manifests.
+3. Status and Plans
+
+
+Intended Usage Model
+====================
+
+RBD
+---
+
+For RBD, the primary goal is for either an OSD-internal agent or a
+cluster-external agent to be able to transparently shift portions
+of the consituent 4MB extents between a dedup pool and a hot base
+pool.
+
+As such, RBD operations (including class operations and snapshots)
+must have the same observable results regardless of the current
+status of the object.
+
+Moreover, tiering/dedup operations must interleave with RBD operations
+without changing the result.
+
+Thus, here is a sketch of how I'd expect a tiering agent to perform
+basic operations:
+
+* Demote cold RBD chunk to slow pool:
+
+ 1. Read object, noting current user_version.
+ 2. In memory, run CDC implementation to fingerprint object.
+ 3. Write out each resulting extent to an object in the cold pool
+ using the CAS class.
+ 4. Submit operation to base pool:
+
+ * ``ASSERT_VER`` with the user version from the read to fail if the
+ object has been mutated since the read.
+ * ``SET_CHUNK`` for each of the extents to the corresponding object
+ in the base pool.
+ * ``EVICT_CHUNK`` for each extent to free up space in the base pool.
+ Results in each chunk being marked ``MISSING``.
+
+ RBD users should then either see the state prior to the demotion or
+ subsequent to it.
+
+ Note that between 3 and 4, we potentially leak references, so a
+ periodic scrub would be needed to validate refcounts.
+
+* Promote cold RBD chunk to fast pool.
+
+ 1. Submit ``TIER_PROMOTE``
+
+For clones, all of the above would be identical except that the
+initial read would need a ``LIST_SNAPS`` to determine which clones exist
+and the ``PROMOTE`` or ``SET_CHUNK``/``EVICT`` operations would need to include
+the ``cloneid``.
+
+RadosGW
+-------
+
+For reads, RADOS Gateway (RGW) could operate as RBD does above relying on the
+manifest machinery in the OSD to hide the distinction between the object
+being dedup'd or present in the base pool
+
+For writes, RGW could operate as RBD does above, but could
+optionally have the freedom to fingerprint prior to doing the write.
+In that case, it could immediately write out the target objects to the
+CAS pool and then atomically write an object with the corresponding
+chunks set.
+
+Status and Future Work
+======================
+
+At the moment, initial versions of a manifest data structure along
+with IO path support and rados control operations exist. This section
+is meant to outline next steps.
+
+At a high level, our future work plan is:
+
+- Cleanups: Address immediate inconsistencies and shortcomings outlined
+ in the next section.
+- Testing: Rados relies heavily on teuthology failure testing to validate
+ features like cache/tiering. We'll need corresponding tests for
+ manifest operations.
+- Snapshots: We want to be able to deduplicate portions of clones
+ below the level of the rados snapshot system. As such, the
+ rados operations below need to be extended to work correctly on
+ clones (e.g.: we should be able to call ``SET_CHUNK`` on a clone, clear the
+ corresponding extent in the base pool, and correctly maintain OSD metadata).
+- Cache/tiering: Ultimately, we'd like to be able to deprecate the existing
+ cache/tiering implementation, but to do that we need to ensure that we
+ can address the same use cases.
+
+
+Cleanups
+--------
+
+The existing implementation has some things that need to be cleaned up:
+
+* ``SET_REDIRECT``: Should create the object if it doesn't exist, otherwise
+ one couldn't create an object atomically as a redirect.
+* ``SET_CHUNK``:
+
+ * Appears to trigger a new clone as user_modify gets set in
+ ``do_osd_ops``. This probably isn't desirable, see Snapshots section
+ below for some options on how generally to mix these operations
+ with snapshots. At a minimum, ``SET_CHUNK`` probably shouldn't set
+ user_modify.
+ * Appears to assume that the corresponding section of the object
+ does not exist (sets ``FLAG_MISSING``) but does not check whether the
+ corresponding extent exists already in the object. Should always
+ leave the extent clean.
+ * Appears to clear the manifest unconditionally if not chunked,
+ that's probably wrong. We should return an error if it's a
+ ``REDIRECT`` ::
+
+ case CEPH_OSD_OP_SET_CHUNK:
+ if (oi.manifest.is_redirect()) {
+ result = -EINVAL;
+ goto fail;
+ }
+
+
+* ``TIER_PROMOTE``:
+
+ * ``SET_REDIRECT`` clears the contents of the object. ``PROMOTE`` appears
+ to copy them back in, but does not unset the redirect or clear the
+ reference. This violates the invariant that a redirect object
+ should be empty in the base pool. In particular, as long as the
+ redirect is set, it appears that all operations will be proxied
+ even after the promote defeating the purpose. We do want ``PROMOTE``
+ to be able to atomically replace a redirect with the actual
+ object, so the solution is to clear the redirect at the end of the
+ promote.
+ * For a chunked manifest, we appear to flush prior to promoting.
+ Promotion will often be used to prepare an object for low latency
+ reads and writes, accordingly, the only effect should be to read
+ any ``MISSING`` extents into the base pool. No flushing should be done.
+
+* High Level:
+
+ * It appears that ``FLAG_DIRTY`` should never be used for an extent pointing
+ at a dedup extent. Writing the mutated extent back to the dedup pool
+ requires writing a new object since the previous one cannot be mutated,
+ just as it would if it hadn't been dedup'd yet. Thus, we should always
+ drop the reference and remove the manifest pointer.
+
+ * There isn't currently a way to "evict" an object region. With the above
+ change to ``SET_CHUNK`` to always retain the existing object region, we
+ need an ``EVICT_CHUNK`` operation to then remove the extent.
+
+
+Testing
+-------
+
+We rely really heavily on randomized failure testing. As such, we need
+to extend that testing to include dedup/manifest support as well. Here's
+a short list of the touchpoints:
+
+* Thrasher tests like ``qa/suites/rados/thrash/workloads/cache-snaps.yaml``
+
+ That test, of course, tests the existing cache/tiering machinery. Add
+ additional files to that directory that instead setup a dedup pool. Add
+ support to ``ceph_test_rados`` (``src/test/osd/TestRados*``).
+
+* RBD tests
+
+ Add a test that runs an RBD workload concurrently with blind
+ promote/evict operations.
+
+* RGW
+
+ Add a test that runs a rgw workload concurrently with blind
+ promote/evict operations.
+
+
+Snapshots
+---------
+
+Fundamentally we need to be able to manipulate the manifest
+status of clones because we want to be able to dynamically promote,
+flush (if the state was dirty when the clone was created), and evict
+extents from clones.
+
+As such, the plan is to allow the ``object_manifest_t`` for each clone
+to be independent. Here's an incomplete list of the high level
+tasks:
+
+* Modify the op processing pipeline to permit ``SET_CHUNK``, ``EVICT_CHUNK``
+ to operation directly on clones.
+* Ensure that recovery checks the object_manifest prior to trying to
+ use the overlaps in clone_range. ``ReplicatedBackend::calc_*_subsets``
+ are the two methods that would likely need to be modified.
+
+See ``snaps.rst`` for a rundown of the ``librados`` snapshot system and OSD
+support details. I'd like to call out one particular data structure
+we may want to exploit.
+
+The dedup-tool needs to be updated to use ``LIST_SNAPS`` to discover
+clones as part of leak detection.
+
+An important question is how we deal with the fact that many clones
+will frequently have references to the same backing chunks at the same
+offset. In particular, ``make_writeable`` will generally create a clone
+that shares the same ``object_manifest_t`` references with the exception
+of any extents modified in that transaction. The metadata that
+commits as part of that transaction must therefore map onto the same
+refcount as before because otherwise we'd have to first increment
+refcounts on backing objects (or risk a reference to a dead object)
+Thus, we introduce a simple convention: consecutive clones which
+share a reference at the same offset share the same refcount. This
+means that a write that invokes ``make_writeable`` may decrease refcounts,
+but not increase them. This has some conquences for removing clones.
+Consider the following sequence ::
+
+ write foo [0, 1024)
+ flush foo ->
+ head: [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=1, refcount(bbb)=1
+ snapshot 10
+ write foo [0, 512) ->
+ head: [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=1, refcount(bbb)=1
+ flush foo ->
+ head: [0, 512) ccc, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=1, refcount(bbb)=1, refcount(ccc)=1
+ snapshot 20
+ write foo [0, 512) (same contents as the original write)
+ head: [512, 1024) bbb
+ 20 : [0, 512) ccc, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=?, refcount(bbb)=1
+ flush foo
+ head: [0, 512) aaa, [512, 1024) bbb
+ 20 : [0, 512) ccc, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=1
+
+What should be the refcount for ``aaa`` be at the end? By our
+above rule, it should be ``2`` since the two ```aaa``` refs are not
+contiguous. However, consider removing clone ``20`` ::
+
+ initial:
+ head: [0, 512) aaa, [512, 1024) bbb
+ 20 : [0, 512) ccc, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=2, refcount(bbb)=1, refcount(ccc)=1
+ trim 20
+ head: [0, 512) aaa, [512, 1024) bbb
+ 10 : [0, 512) aaa, [512, 1024) bbb
+ refcount(aaa)=?, refcount(bbb)=1, refcount(ccc)=0
+
+At this point, our rule dictates that ``refcount(aaa)`` is `1`.
+This means that removing ``20`` needs to check for refs held by
+the clones on either side which will then match.
+
+See ``osd_types.h:object_manifest_t::calc_refs_to_drop_on_removal``
+for the logic implementing this rule.
+
+This seems complicated, but it gets us two valuable properties:
+
+1) The refcount change from make_writeable will not block on
+ incrementing a ref
+2) We don't need to load the ``object_manifest_t`` for every clone
+ to determine how to handle removing one -- just the ones
+ immediately preceding and succeeding it.
+
+All clone operations will need to consider adjacent ``chunk_maps``
+when adding or removing references.
+
+Cache/Tiering
+-------------
+
+There already exists a cache/tiering mechanism based on whiteouts.
+One goal here should ultimately be for this manifest machinery to
+provide a complete replacement.
+
+See ``cache-pool.rst``
+
+The manifest machinery already shares some code paths with the
+existing cache/tiering code, mainly ``stat_flush``.
+
+In no particular order, here's in incomplete list of things that need
+to be wired up to provide feature parity:
+
+* Online object access information: The osd already has pool configs
+ for maintaining bloom filters which provide estimates of access
+ recency for objects. We probably need to modify this to permit
+ hitset maintenance for a normal pool -- there are already
+ ``CEPH_OSD_OP_PG_HITSET*`` interfaces for querying them.
+* Tiering agent: The osd already has a background tiering agent which
+ would need to be modified to instead flush and evict using
+ manifests.
+
+* Use exiting existing features regarding the cache flush policy such as
+ histset, age, ratio.
+ - hitset
+ - age, ratio, bytes
+
+* Add tiering-mode to ``manifest-tiering``
+ - Writeback
+ - Read-only
+
+
+Data Structures
+===============
+
+Each RADOS object contains an ``object_manifest_t`` embedded within the
+``object_info_t`` (see ``osd_types.h``):
+
+::
+
+ struct object_manifest_t {
+ enum {
+ TYPE_NONE = 0,
+ TYPE_REDIRECT = 1,
+ TYPE_CHUNKED = 2,
+ };
+ uint8_t type; // redirect, chunked, ...
+ hobject_t redirect_target;
+ std::map<uint64_t, chunk_info_t> chunk_map;
+ }
+
+The ``type`` enum reflects three possible states an object can be in:
+
+1. ``TYPE_NONE``: normal RADOS object
+2. ``TYPE_REDIRECT``: object payload is backed by a single object
+ specified by ``redirect_target``
+3. ``TYPE_CHUNKED: object payload is distributed among objects with
+ size and offset specified by the ``chunk_map``. ``chunk_map`` maps
+ the offset of the chunk to a ``chunk_info_t`` as shown below, also
+ specifying the ``length``, target `OID`, and ``flags``.
+
+::
+
+ struct chunk_info_t {
+ typedef enum {
+ FLAG_DIRTY = 1,
+ FLAG_MISSING = 2,
+ FLAG_HAS_REFERENCE = 4,
+ FLAG_HAS_FINGERPRINT = 8,
+ } cflag_t;
+ uint32_t offset;
+ uint32_t length;
+ hobject_t oid;
+ cflag_t flags; // FLAG_*
+
+
+``FLAG_DIRTY`` at this time can happen if an extent with a fingerprint
+is written. This should be changed to drop the fingerprint instead.
+
+
+Request Handling
+================
+
+Similarly to cache/tiering, the initial touchpoint is
+``maybe_handle_manifest_detail``.
+
+For manifest operations listed below, we return ``NOOP`` and continue onto
+dedicated handling within ``do_osd_ops``.
+
+For redirect objects which haven't been promoted (apparently ``oi.size >
+0`` indicates that it's present?) we proxy reads and writes.
+
+For reads on ``TYPE_CHUNKED``, if ``can_proxy_chunked_read`` (basically, all
+of the ops are reads of extents in the ``object_manifest_t chunk_map``),
+we proxy requests to those objects.
+
+
+RADOS Interface
+================
+
+To set up deduplication one must provision two pools. One will act as the
+base pool and the other will act as the chunk pool. The base pool need to be
+configured with the ``fingerprint_algorithm`` option as follows.
+
+::
+
+ ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
+ --yes-i-really-mean-it
+
+Create objects ::
+
+ rados -p base_pool put foo ./foo
+ rados -p chunk_pool put foo-chunk ./foo-chunk
+
+Make a manifest object ::
+
+ rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool chunk_pool foo-chunk $START_OFFSET --with-reference
+
+Operations:
+
+* ``set-redirect``
+
+ Set a redirection between a ``base_object`` in the ``base_pool`` and a ``target_object``
+ in the ``target_pool``.
+ A redirected object will forward all operations from the client to the
+ ``target_object``. ::
+
+ void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
+ uint64_t tgt_version, int flag = 0);
+
+ rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
+ <target_object>
+
+ Returns ``ENOENT`` if the object does not exist (TODO: why?)
+ Returns ``EINVAL`` if the object already is a redirect.
+
+ Takes a reference to target as part of operation, can possibly leak a ref
+ if the acting set resets and the client dies between taking the ref and
+ recording the redirect.
+
+ Truncates object, clears omap, and clears xattrs as a side effect.
+
+ At the top of ``do_osd_ops``, does not set user_modify.
+
+ This operation is not a user mutation and does not trigger a clone to be created.
+
+ There are two purposes of ``set_redirect``:
+
+ 1. Redirect all operation to the target object (like proxy)
+ 2. Cache when ``tier_promote`` is called (redirect will be cleared at this time).
+
+* ``set-chunk``
+
+ Set the ``chunk-offset`` in a ``source_object`` to make a link between it and a
+ ``target_object``. ::
+
+ void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
+ std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
+
+ rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
+ <caspool> <target_object> <target-offset>
+
+ Returns ``ENOENT`` if the object does not exist (TODO: why?)
+ Returns ``EINVAL`` if the object already is a redirect.
+ Returns ``EINVAL`` if on ill-formed parameter buffer.
+ Returns ``ENOTSUPP`` if existing mapped chunks overlap with new chunk mapping.
+
+ Takes references to targets as part of operation, can possibly leak refs
+ if the acting set resets and the client dies between taking the ref and
+ recording the redirect.
+
+ Truncates object, clears omap, and clears xattrs as a side effect.
+
+ This operation is not a user mutation and does not trigger a clone to be created.
+
+ TODO: ``SET_CHUNK`` appears to clear the manifest unconditionally if it's not chunked. ::
+
+ if (!oi.manifest.is_chunked()) {
+ oi.manifest.clear();
+ }
+
+* ``evict-chunk``
+
+ Clears an extent from an object leaving only the manifest link between
+ it and the ``target_object``. ::
+
+ void evict_chunk(
+ uint64_t offset, uint64_t length, int flag = 0);
+
+ rados -p base_pool evict-chunk <offset> <length> <object>
+
+ Returns ``EINVAL`` if the extent is not present in the manifest.
+
+ Note: this does not exist yet.
+
+
+* ``tier-promote``
+
+ Promotes the object ensuring that subsequent reads and writes will be local ::
+
+ void tier_promote();
+
+ rados -p base_pool tier-promote <obj-name>
+
+ Returns ``ENOENT`` if the object does not exist
+
+ For a redirect manifest, copies data to head.
+
+ TODO: Promote on a redirect object needs to clear the redirect.
+
+ For a chunked manifest, reads all MISSING extents into the base pool,
+ subsequent reads and writes will be served from the base pool.
+
+ Implementation Note: For a chunked manifest, calls ``start_copy`` on itself. The
+ resulting ``copy_get`` operation will issue reads which will then be redirected by
+ the normal manifest read machinery.
+
+ Does not set the ``user_modify`` flag.
+
+ Future work will involve adding support for specifying a ``clone_id``.
+
+* ``unset-manifest``
+
+ Unset the manifest info in the object that has manifest. ::
+
+ void unset_manifest();
+
+ rados -p base_pool unset-manifest <obj-name>
+
+ Clears manifest chunks or redirect. Lazily releases references, may
+ leak.
+
+ ``do_osd_ops`` seems not to include it in the ``user_modify=false`` ``ignorelist``,
+ and so will trigger a snapshot. Note, this will be true even for a
+ redirect though ``SET_REDIRECT`` does not flip ``user_modify``. This should
+ be fixed -- ``unset-manifest`` should not be a ``user_modify``.
+
+* ``tier-flush``
+
+ Flush the object which has chunks to the chunk pool. ::
+
+ void tier_flush();
+
+ rados -p base_pool tier-flush <obj-name>
+
+ Included in the ``user_modify=false`` ``ignorelist``, does not trigger a clone.
+
+ Does not evict the extents.
+
+
+ceph-dedup-tool
+===============
+
+``ceph-dedup-tool`` has two features: finding an optimal chunk offset for dedup chunking
+and fixing the reference count (see ``./refcount.rst``).
+
+* Find an optimal chunk offset
+
+ a. Fixed chunk
+
+ To find out a fixed chunk length, you need to run the following command many
+ times while changing the ``chunk_size``. ::
+
+ ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
+ --chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
+
+ b. Rabin chunk(Rabin-Karp algorithm)
+
+ Rabin-Karp is a string-searching algorithm based
+ on a rolling hash. But a rolling hash is not enough to do deduplication because
+ we don't know the chunk boundary. So, we need content-based slicing using
+ a rolling hash for content-defined chunking.
+ The current implementation uses the simplest approach: look for chunk boundaries
+ by inspecting the rolling hash for pattern (like the
+ lower N bits are all zeroes).
+
+ Users who want to use deduplication need to find an ideal chunk offset.
+ To find out ideal chunk offset, users should discover
+ the optimal configuration for their data workload via ``ceph-dedup-tool``.
+ This information will then be used for object chunking through
+ the ``set-chunk`` API. ::
+
+ ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
+ --chunk-algorithm rabin --fingerprint-algorithm rabin
+
+ ``ceph-dedup-tool`` has many options to utilize ``rabin chunk``.
+ These are options for ``rabin chunk``. ::
+
+ --mod-prime <uint64_t>
+ --rabin-prime <uint64_t>
+ --pow <uint64_t>
+ --chunk-mask-bit <uint32_t>
+ --window-size <uint32_t>
+ --min-chunk <uint32_t>
+ --max-chunk <uint64_t>
+
+ Users need to refer following equation to use above options for ``rabin chunk``. ::
+
+ rabin_hash =
+ (rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
+
+ c. Fixed chunk vs content-defined chunk
+
+ Content-defined chunking may or not be optimal solution.
+ For example,
+
+ Data chunk ``A`` : ``abcdefgabcdefgabcdefg``
+
+ Let's think about Data chunk ``A``'s deduplication. The ideal chunk offset is
+ from ``1`` to ``7`` (``abcdefg``). So, if we use fixed chunk, ``7`` is optimal chunk length.
+ But, in the case of content-based slicing, the optimal chunk length
+ could not be found (dedup ratio will not be 100%).
+ Because we need to find optimal parameter such
+ as boundary bit, window size and prime value. This is as easy as fixed chunk.
+ But, content defined chunking is very effective in the following case.
+
+ Data chunk ``B`` : ``abcdefgabcdefgabcdefg``
+
+ Data chunk ``C`` : ``Tabcdefgabcdefgabcdefg``
+
+
+* Fix reference count
+
+ The key idea behind of reference counting for dedup is false-positive, which means
+ ``(manifest object (no ref),, chunk object(has ref))`` happen instead of
+ ``(manifest object (has ref), chunk 1(no ref))``.
+ To fix such inconsistencies, ``ceph-dedup-tool`` supports ``chunk_scrub``. ::
+
+ ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
+