1 files changed, 208 insertions, 0 deletions
diff --git a/doc/dev/osd_internals/log_based_pg.rst b/doc/dev/osd_internals/log_based_pg.rst
new file mode 100644
index 000000000..5d1e560c0
--- /dev/null
+++ b/doc/dev/osd_internals/log_based_pg.rst
@@ -0,0 +1,208 @@
+.. _log-based-pg:
+
+============
+Log Based PG
+============
+
+Background
+==========
+
+Why PrimaryLogPG?
+-----------------
+
+Currently, consistency for all ceph pool types is ensured by primary
+log-based replication. This goes for both erasure-coded (EC) and
+replicated pools.
+
+Primary log-based replication
+-----------------------------
+
+Reads must return data written by any write which completed (where the
+client could possibly have received a commit message).  There are lots
+of ways to handle this, but Ceph's architecture makes it easy for
+everyone at any map epoch to know who the primary is.  Thus, the easy
+answer is to route all writes for a particular PG through a single
+ordering primary and then out to the replicas.  Though we only
+actually need to serialize writes on a single RADOS object (and even then,
+the partial ordering only really needs to provide an ordering between
+writes on overlapping regions), we might as well serialize writes on
+the whole PG since it lets us represent the current state of the PG
+using two numbers: the epoch of the map on the primary in which the
+most recent write started (this is a bit stranger than it might seem
+since map distribution itself is asynchronous -- see Peering and the
+concept of interval changes) and an increasing per-PG version number
+-- this is referred to in the code with type ``eversion_t`` and stored as
+``pg_info_t::last_update``.  Furthermore, we maintain a log of "recent"
+operations extending back at least far enough to include any
+*unstable* writes (writes which have been started but not committed)
+and objects which aren't uptodate locally (see recovery and
+backfill).  In practice, the log will extend much further
+(``osd_min_pg_log_entries`` when clean and ``osd_max_pg_log_entries`` when not
+clean) because it's handy for quickly performing recovery.
+
+Using this log, as long as we talk to a non-empty subset of the OSDs
+which must have accepted any completed writes from the most recent
+interval in which we accepted writes, we can determine a conservative
+log which must contain any write which has been reported to a client
+as committed.  There is some freedom here, we can choose any log entry
+between the oldest head remembered by an element of that set (any
+newer cannot have completed without that log containing it) and the
+newest head remembered (clearly, all writes in the log were started,
+so it's fine for us to remember them) as the new head.  This is the
+main point of divergence between replicated pools and EC pools in
+``PG/PrimaryLogPG``: replicated pools try to choose the newest valid
+option to avoid the client needing to replay those operations and
+instead recover the other copies.  EC pools instead try to choose
+the *oldest* option available to them.
+
+The reason for this gets to the heart of the rest of the differences
+in implementation: one copy will not generally be enough to
+reconstruct an EC object.  Indeed, there are encodings where some log
+combinations would leave unrecoverable objects (as with a ``k=4,m=2`` encoding
+where 3 of the replicas remember a write, but the other 3 do not -- we
+don't have 3 copies of either version).  For this reason, log entries
+representing *unstable* writes (writes not yet committed to the
+client) must be rollbackable using only local information on EC pools.
+Log entries in general may therefore be rollbackable (and in that case,
+via a delayed application or via a set of instructions for rolling
+back an inplace update) or not.  Replicated pool log entries are
+never able to be rolled back.
+
+For more details, see ``PGLog.h/cc``, ``osd_types.h:pg_log_t``,
+``osd_types.h:pg_log_entry_t``, and peering in general.
+
+ReplicatedBackend/ECBackend unification strategy
+================================================
+
+PGBackend
+---------
+
+The fundamental difference between replication and erasure coding
+is that replication can do destructive updates while erasure coding
+cannot.  It would be really annoying if we needed to have two entire
+implementations of ``PrimaryLogPG`` since there
+are really only a few fundamental differences:
+
+#. How reads work -- async only, requires remote reads for EC
+#. How writes work -- either restricted to append, or must write aside and do a
+   tpc
+#. Whether we choose the oldest or newest possible head entry during peering
+#. A bit of extra information in the log entry to enable rollback
+
+and so many similarities
+
+#. All of the stats and metadata for objects
+#. The high level locking rules for mixing client IO with recovery and scrub
+#. The high level locking rules for mixing reads and writes without exposing
+   uncommitted state (which might be rolled back or forgotten later)
+#. The process, metadata, and protocol needed to determine the set of osds
+   which participated in the most recent interval in which we accepted writes
+#. etc.
+
+Instead, we choose a few abstractions (and a few kludges) to paper over the differences:
+
+#. ``PGBackend``
+#. ``PGTransaction``
+#. ``PG::choose_acting`` chooses between ``calc_replicated_acting`` and ``calc_ec_acting``
+#. Various bits of the write pipeline disallow some operations based on pool
+   type -- like omap operations, class operation reads, and writes which are
+   not aligned appends (officially, so far) for EC
+#. Misc other kludges here and there
+
+``PGBackend`` and ``PGTransaction`` enable abstraction of differences 1 and 2 above
+and the addition of 4 as needed to the log entries.
+
+The replicated implementation is in ``ReplicatedBackend.h/cc`` and doesn't
+require much additional explanation.  More detail on the ``ECBackend`` can be
+found in ``doc/dev/osd_internals/erasure_coding/ecbackend.rst``.
+
+PGBackend Interface Explanation
+===============================
+
+Note: this is from a design document that predated the Firefly release
+and is probably out of date w.r.t. some of the method names.
+
+Readable vs Degraded
+--------------------
+
+For a replicated pool, an object is readable IFF it is present on
+the primary (at the right version).  For an EC pool, we need at least
+`m` shards present to perform a read, and we need it on the primary.  For
+this reason, ``PGBackend`` needs to include some interfaces for determining
+when recovery is required to serve a read vs a write.  This also
+changes the rules for when peering has enough logs to prove that it
+
+Core Changes:
+
+- | ``PGBackend`` needs to be able to return ``IsPG(Recoverable|Readable)Predicate``
+  | objects to allow the user to make these determinations.
+
+Client Reads
+------------
+
+Reads from a replicated pool can always be satisfied
+synchronously by the primary OSD.  Within an erasure coded pool,
+the primary will need to request data from some number of replicas in
+order to satisfy a read.  ``PGBackend`` will therefore need to provide
+separate ``objects_read_sync`` and ``objects_read_async`` interfaces where
+the former won't be implemented by the ``ECBackend``.
+
+``PGBackend`` interfaces:
+
+- ``objects_read_sync``
+- ``objects_read_async``
+
+Scrubs
+------
+
+We currently have two scrub modes with different default frequencies:
+
+#. [shallow] scrub: compares the set of objects and metadata, but not
+   the contents
+#. deep scrub: compares the set of objects, metadata, and a CRC32 of
+   the object contents (including omap)
+
+The primary requests a scrubmap from each replica for a particular
+range of objects.  The replica fills out this scrubmap for the range
+of objects including, if the scrub is deep, a CRC32 of the contents of
+each object.  The primary gathers these scrubmaps from each replica
+and performs a comparison identifying inconsistent objects.
+
+Most of this can work essentially unchanged with erasure coded PG with
+the caveat that the ``PGBackend`` implementation must be in charge of
+actually doing the scan.
+
+
+``PGBackend`` interfaces:
+
+- ``be_*``
+
+Recovery
+--------
+
+The logic for recovering an object depends on the backend.  With
+the current replicated strategy, we first pull the object replica
+to the primary and then concurrently push it out to the replicas.
+With the erasure coded strategy, we probably want to read the
+minimum number of replica chunks required to reconstruct the object
+and push out the replacement chunks concurrently.
+
+Another difference is that objects in erasure coded PG may be
+unrecoverable without being unfound.  The ``unfound`` state
+should probably be renamed to ``unrecoverable``.  Also, the
+``PGBackend`` implementation will have to be able to direct the search
+for PG replicas with unrecoverable object chunks and to be able
+to determine whether a particular object is recoverable.
+
+
+Core changes:
+
+- ``s/unfound/unrecoverable``
+
+PGBackend interfaces:
+
+- `on_local_recover_start <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L60>`_
+- `on_local_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L66>`_
+- `on_global_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L78>`_
+- `on_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L83>`_
+- `begin_peer_recover <https://github.com/ceph/ceph/blob/firefly/src/osd/PGBackend.h#L90>`_