1 files changed, 415 insertions, 0 deletions
diff --git a/doc/dev/mon-osdmap-prune.rst b/doc/dev/mon-osdmap-prune.rst
new file mode 100644
index 000000000..6ff059b84
--- /dev/null
+++ b/doc/dev/mon-osdmap-prune.rst
@@ -0,0 +1,415 @@
+===========================
+FULL OSDMAP VERSION PRUNING
+===========================
+
+For each incremental osdmap epoch, the monitor will keep a full osdmap
+epoch in the store.
+
+While this is great when serving osdmap requests from clients, allowing
+us to fulfill their request without having to recompute the full osdmap
+from a myriad of incrementals, it can also become a burden once we start
+keeping an unbounded number of osdmaps.
+
+The monitors will attempt to keep a bounded number of osdmaps in the store.
+This number is defined (and configurable) via ``mon_min_osdmap_epochs``, and
+defaults to 500 epochs. Generally speaking, we will remove older osdmap
+epochs once we go over this limit.
+
+However, there are a few constraints to removing osdmaps. These are all
+defined in ``OSDMonitor::get_trim_to()``.
+
+In the event one of these conditions is not met, we may go over the bounds
+defined by ``mon_min_osdmap_epochs``. And if the cluster does not meet the
+trim criteria for some time (e.g., unclean pgs), the monitor may start
+keeping a lot of osdmaps. This can start putting pressure on the underlying
+key/value store, as well as on the available disk space.
+
+One way to mitigate this problem would be to stop keeping full osdmap
+epochs on disk. We would have to rebuild osdmaps on-demand, or grab them
+from cache if they had been recently served. We would still have to keep
+at least one osdmap, and apply all incrementals on top of either this
+oldest map epoch kept in the store or a more recent map grabbed from cache.
+While this would be feasible, it seems like a lot of cpu (and potentially
+IO) would be going into rebuilding osdmaps.
+
+Additionally, this would prevent the aforementioned problem going forward,
+but would do nothing for stores currently in a state that would truly
+benefit from not keeping osdmaps.
+
+This brings us to full osdmap pruning.
+
+Instead of not keeping full osdmap epochs, we are going to prune some of
+them when we have too many.
+
+Deciding whether we have too many will be dictated by a configurable option
+``mon_osdmap_full_prune_min`` (default: 10000). The pruning algorithm will be
+engaged once we go over this threshold.
+
+We will not remove all ``mon_osdmap_full_prune_min`` full osdmap epochs
+though. Instead, we are going to poke some holes in the sequence of full
+maps. By default, we will keep one full osdmap per 10 maps since the last
+map kept; i.e., if we keep epoch 1, we will also keep epoch 10 and remove
+full map epochs 2 to 9. The size of this interval is configurable with
+``mon_osdmap_full_prune_interval``.
+
+Essentially, we are proposing to keep ~10% of the full maps, but we will
+always honour the minimum number of osdmap epochs, as defined by
+``mon_min_osdmap_epochs``, and these won't be used for the count of the
+minimum versions to prune. For instance, if we have on-disk versions
+[1..50000], we would allow the pruning algorithm to operate only over
+osdmap epochs [1..49500); but, if have on-disk versions [1..10200], we
+won't be pruning because the algorithm would only operate on versions
+[1..9700), and this interval contains less versions than the minimum
+required by ``mon_osdmap_full_prune_min``.
+
+
+ALGORITHM
+=========
+
+Say we have 50,000 osdmap epochs in the store, and we're using the
+defaults for all configurable options.
+
+::
+
+    -----------------------------------------------------------
+    |1|2|..|10|11|..|100|..|1000|..|10000|10001|..|49999|50000|
+    -----------------------------------------------------------
+     ^ first                                            last ^
+
+We will prune when all the following constraints are met:
+
+1. number of versions is greater than ``mon_min_osdmap_epochs``;
+
+2. the number of versions between ``first`` and ``prune_to`` is greater (or
+   equal) than ``mon_osdmap_full_prune_min``, with ``prune_to`` being equal to
+   ``last`` minus ``mon_min_osdmap_epochs``.
+
+If any of these conditions fails, we will *not* prune any maps.
+
+Furthermore, if it is known that we have been pruning, but since then we
+are no longer satisfying at least one of the above constraints, we will
+not continue to prune. In essence, we only prune full osdmaps if the
+number of epochs in the store so warrants it.
+
+As pruning will create gaps in the sequence of full maps, we need to keep
+track of the intervals of missing maps. We do this by keeping a manifest of
+pinned maps -- i.e., a list of maps that, by being pinned, are not to be
+pruned.
+
+While pinned maps are not removed from the store, maps between two consecutive
+pinned maps will; and the number of maps to be removed will be dictated by the
+configurable option ``mon_osdmap_full_prune_interval``. The algorithm makes an
+effort to keep pinned maps apart by as many maps as defined by this option,
+but in the event of corner cases it may allow smaller intervals. Additionally,
+as this is a configurable option that is read any time a prune iteration
+occurs, there is the possibility this interval will change if the user changes
+this config option.
+
+Pinning maps is performed lazily: we will be pinning maps as we are removing
+maps. This grants us more flexibility to change the prune interval while
+pruning is happening, but also simplifies considerably the algorithm, as well
+as the information we need to keep in the manifest. Below we show a simplified
+version of the algorithm:::
+
+    manifest.pin(first)
+    last_to_prune = last - mon_min_osdmap_epochs
+
+    while manifest.get_last_pinned() + prune_interval < last_to_prune AND
+          last_to_prune - first > mon_min_osdmap_epochs AND
+          last_to_prune - first > mon_osdmap_full_prune_min AND
+          num_pruned < mon_osdmap_full_prune_txsize:
+      
+      last_pinned = manifest.get_last_pinned()
+      new_pinned = last_pinned + prune_interval
+      manifest.pin(new_pinned)
+      for e in (last_pinned .. new_pinned):
+        store.erase(e)
+        ++num_pruned
+
+In essence, the algorithm ensures that the first version in the store is
+*always* pinned. After all, we need a starting point when rebuilding maps, and
+we can't simply remove the earliest map we have; otherwise we would be unable
+to rebuild maps for the very first pruned interval.
+
+Once we have at least one pinned map, each iteration of the algorithm can
+simply base itself on the manifest's last pinned map (which we can obtain by
+reading the element at the tail of the manifest's pinned maps list).
+
+We'll next need to determine the interval of maps to be removed: all the maps
+from ``last_pinned`` up to ``new_pinned``, which in turn is nothing more than
+``last_pinned`` plus ``mon_osdmap_full_prune_interval``. We know that all maps
+between these two values, ``last_pinned`` and ``new_pinned`` can be removed,
+considering ``new_pinned`` has been pinned.
+
+The algorithm ceases to execute as soon as one of the two initial
+preconditions is not met, or if we do not meet two additional conditions that
+have no weight on the algorithm's correctness:
+
+1. We will stop if we are not able to create a new pruning interval properly
+   aligned with ``mon_osdmap_full_prune_interval`` that is lower than
+   ``last_pruned``. There is no particular technical reason why we enforce
+   this requirement, besides allowing us to keep the intervals with an
+   expected size, and preventing small, irregular intervals that would be
+   bound to happen eventually (e.g., pruning continues over the course of
+   several iterations, removing one or two or three maps each time).
+
+2. We will stop once we know that we have pruned more than a certain number of
+   maps. This value is defined by ``mon_osdmap_full_prune_txsize``, and
+   ensures we don't spend an unbounded number of cycles pruning maps. We don't
+   enforce this value religiously (deletes do not cost much), but we make an
+   effort to honor it.
+
+We could do the removal in one go, but we have no idea how long that would
+take. Therefore, we will perform several iterations, removing at most
+``mon_osdmap_full_prune_txsize`` osdmaps per iteration.
+
+In the end, our on-disk map sequence will look similar to::
+
+    ------------------------------------------
+    |1|10|20|30|..|49500|49501|..|49999|50000|
+    ------------------------------------------
+     ^ first                           last ^
+
+
+Because we are not pruning all versions in one go, we need to keep state
+about how far along on our pruning we are. With that in mind, we have
+created a data structure, ``osdmap_manifest_t``, that holds the set of pinned
+maps:::
+
+    struct osdmap_manifest_t:
+        set<version_t> pinned;
+
+Given we are only pinning maps while we are pruning, we don't need to keep
+track of additional state about the last pruned version. We know as a matter
+of fact that we have pruned all the intermediate maps between any two
+consecutive pinned maps.
+
+The question one could ask, though, is how can we be sure we pruned all the
+intermediate maps if, for instance, the monitor crashes. To ensure we are
+protected against such an event, we always write the osdmap manifest to disk
+on the same transaction that is deleting the maps. This way we have the
+guarantee that, if the monitor crashes, we will read the latest version of the
+manifest: either containing the newly pinned maps, meaning we also pruned the
+in-between maps; or we will find the previous version of the osdmap manifest,
+which will not contain the maps we were pinning at the time we crashed, given
+the transaction on which we would be writing the updated osdmap manifest was
+not applied (alongside with the maps removal).
+
+The osdmap manifest will be written to the store each time we prune, with an
+updated list of pinned maps. It is written in the transaction effectively
+pruning the maps, so we guarantee the manifest is always up to date. As a
+consequence of this criteria, the first time we will write the osdmap manifest
+is the first time we prune. If an osdmap manifest does not exist, we can be
+certain we do not hold pruned map intervals.
+
+We will rely on the manifest to ascertain whether we have pruned maps
+intervals. In theory, this will always be the on-disk osdmap manifest, but we
+make sure to read the on-disk osdmap manifest each time we update from paxos;
+this way we always ensure having an up to date in-memory osdmap manifest.
+
+Once we finish pruning maps, we will keep the manifest in the store, to
+allow us to easily find which maps have been pinned (instead of checking
+the store until we find a map). This has the added benefit of allowing us to
+quickly figure out which is the next interval we need to prune (i.e., last
+pinned plus the prune interval). This doesn't however mean we will forever
+keep the osdmap manifest: the osdmap manifest will no longer be required once
+the monitor trims osdmaps and the earliest available epoch in the store is
+greater than the last map we pruned.
+
+The same conditions from ``OSDMonitor::get_trim_to()`` that force the monitor
+to keep a lot of osdmaps, thus requiring us to prune, may eventually change
+and allow the monitor to remove some of its oldest maps.
+
+MAP TRIMMING
+------------
+
+If the monitor trims maps, we must then adjust the osdmap manifest to
+reflect our pruning status, or remove the manifest entirely if it no longer
+makes sense to keep it. For instance, take the map sequence from before, but
+let us assume we did not finish pruning all the maps.::
+
+    -------------------------------------------------------------
+    |1|10|20|30|..|490|500|501|502|..|49500|49501|..|49999|50000|
+    -------------------------------------------------------------
+     ^ first            ^ pinned.last()                   last ^
+
+    pinned = {1, 10, 20, ..., 490, 500}
+
+Now let us assume that the monitor will trim up to epoch 501. This means
+removing all maps prior to epoch 501, and updating the ``first_committed``
+pointer to ``501``. Given removing all those maps would invalidate our
+existing pruning efforts, we can consider our pruning has finished and drop
+our osdmap manifest. Doing so also simplifies starting a new prune, if all
+the starting conditions are met once we refreshed our state from the
+store.
+
+We would then have the following map sequence: ::
+
+    ---------------------------------------
+    |501|502|..|49500|49501|..|49999|50000|
+    ---------------------------------------
+     ^ first                        last ^
+
+However, imagine a slightly more convoluted scenario: the monitor will trim
+up to epoch 491. In this case, epoch 491 has been previously pruned from the
+store.
+
+Given we will always need to have the oldest known map in the store, before
+we trim we will have to check whether that map is in the prune interval
+(i.e., if said map epoch belongs to ``[ pinned.first()..pinned.last() )``).
+If so, we need to check if this is a pinned map, in which case we don't have
+much to be concerned aside from removing lower epochs from the manifest's
+pinned list. On the other hand, if the map being trimmed to is not a pinned
+map, we will need to rebuild said map and pin it, and only then will we remove
+the pinned maps prior to the map's epoch. 
+
+In this case, we would end up with the following sequence:::
+
+    -----------------------------------------------
+    |491|500|501|502|..|49500|49501|..|49999|50000|
+    -----------------------------------------------
+     ^   ^- pinned.last()                   last ^
+     `- first
+
+There is still an edge case that we should mention. Consider that we are
+going to trim up to epoch 499, which is the very last pruned epoch.
+
+Much like the scenario above, we would end up writing osdmap epoch 499 to
+the store; but what should we do about pinned maps and pruning?
+
+The simplest solution is to drop the osdmap manifest. After all, given we
+are trimming to the last pruned map, and we are rebuilding this map, we can
+guarantee that all maps greater than e 499 are sequential (because we have
+not pruned any of them). In essence, dropping the osdmap manifest in this
+case is essentially the same as if we were trimming over the last pruned
+epoch: we can prune again later if we meet the required conditions.
+
+And, with this, we have fully dwelled into full osdmap pruning. Later in this
+document one can find detailed `REQUIREMENTS, CONDITIONS & INVARIANTS` for the
+whole algorithm, from pruning to trimming. Additionally, the next section
+details several additional checks to guarantee the sanity of our configuration
+options. Enjoy.
+
+
+CONFIGURATION OPTIONS SANITY CHECKS
+-----------------------------------
+
+We perform additional checks before pruning to ensure all configuration
+options involved are sane:
+
+1. If ``mon_osdmap_full_prune_interval`` is zero we will not prune; we
+   require an actual positive number, greater than one, to be able to prune
+   maps. If the interval is one, we would not actually be pruning any maps, as
+   the interval between pinned maps would essentially be a single epoch. This
+   means we would have zero maps in-between pinned maps, hence no maps would
+   ever be pruned.
+
+2. If ``mon_osdmap_full_prune_min`` is zero we will not prune; we require a
+   positive, greater than zero, value so we know the threshold over which we
+   should prune. We don't want to guess.
+
+3. If ``mon_osdmap_full_prune_interval`` is greater than
+   ``mon_osdmap_full_prune_min`` we will not prune, as it is impossible to
+   ascertain a proper prune interval.
+
+4. If ``mon_osdmap_full_prune_txsize`` is lower than
+   ``mon_osdmap_full_prune_interval`` we will not prune; we require a
+   ``txsize`` with a value at least equal than ``interval``, and (depending on
+   the value of the latter) ideally higher.
+
+
+REQUIREMENTS, CONDITIONS & INVARIANTS
+-------------------------------------
+
+REQUIREMENTS
+~~~~~~~~~~~~
+
+* All monitors in the quorum need to support pruning.
+
+* Once pruning has been enabled, monitors not supporting pruning will not be
+  allowed in the quorum, nor will be allowed to synchronize.
+
+* Removing the osdmap manifest results in disabling the pruning feature quorum
+  requirement. This means that monitors not supporting pruning will be allowed
+  to synchronize and join the quorum, granted they support any other features
+  required.
+
+
+CONDITIONS & INVARIANTS
+~~~~~~~~~~~~~~~~~~~~~~~
+
+* Pruning has never happened, or we have trimmed past its previous
+  intervals:::
+
+    invariant: first_committed > 1
+
+    condition: pinned.empty() AND !store.exists(manifest)
+
+
+* Pruning has happened at least once:::
+
+    invariant: first_committed > 0
+    invariant: !pinned.empty())
+    invariant: pinned.first() == first_committed
+    invariant: pinned.last() < last_committed
+
+      precond: pinned.last() < prune_to AND
+               pinned.last() + prune_interval < prune_to
+
+     postcond: pinned.size() > old_pinned.size() AND
+               (for each v in [pinned.first()..pinned.last()]:
+                 if pinned.count(v) > 0: store.exists_full(v)
+                 else: !store.exists_full(v)
+               )
+
+
+* Pruning has finished:::
+
+    invariant: first_committed > 0
+    invariant: !pinned.empty()
+    invariant: pinned.first() == first_committed
+    invariant: pinned.last() < last_committed
+
+    condition: pinned.last() == prune_to OR
+               pinned.last() + prune_interval < prune_to
+
+
+* Pruning intervals can be trimmed:::
+
+    precond:   OSDMonitor::get_trim_to() > 0
+
+    condition: !pinned.empty()
+
+    invariant: pinned.first() == first_committed
+    invariant: pinned.last() < last_committed
+    invariant: pinned.first() <= OSDMonitor::get_trim_to()
+    invariant: pinned.last() >= OSDMonitor::get_trim_to()
+
+* Trim pruned intervals:::
+
+    invariant: !pinned.empty()
+    invariant: pinned.first() == first_committed
+    invariant: pinned.last() < last_committed
+    invariant: pinned.first() <= OSDMonitor::get_trim_to()
+    invariant: pinned.last() >= OSDMonitor::get_trim_to()
+
+    postcond:  pinned.empty() OR
+               (pinned.first() == OSDMonitor::get_trim_to() AND
+                pinned.last() > pinned.first() AND
+                (for each v in [0..pinned.first()]:
+                  !store.exists(v) AND
+                  !store.exists_full(v)
+                ) AND
+                (for each m in [pinned.first()..pinned.last()]:
+                  if pinned.count(m) > 0: store.exists_full(m)
+                  else: !store.exists_full(m) AND store.exists(m)
+                )
+               )
+    postcond:  !pinned.empty() OR
+               (!store.exists(manifest) AND
+                (for each v in [pinned.first()..pinned.last()]:
+                  !store.exists(v) AND
+                  !store.exists_full(v)
+                )
+               )
+