diff options
Diffstat (limited to 'doc/dev/mon-osdmap-prune.rst')
-rw-r--r-- | doc/dev/mon-osdmap-prune.rst | 415 |
1 files changed, 415 insertions, 0 deletions
diff --git a/doc/dev/mon-osdmap-prune.rst b/doc/dev/mon-osdmap-prune.rst new file mode 100644 index 000000000..6ff059b84 --- /dev/null +++ b/doc/dev/mon-osdmap-prune.rst @@ -0,0 +1,415 @@ +=========================== +FULL OSDMAP VERSION PRUNING +=========================== + +For each incremental osdmap epoch, the monitor will keep a full osdmap +epoch in the store. + +While this is great when serving osdmap requests from clients, allowing +us to fulfill their request without having to recompute the full osdmap +from a myriad of incrementals, it can also become a burden once we start +keeping an unbounded number of osdmaps. + +The monitors will attempt to keep a bounded number of osdmaps in the store. +This number is defined (and configurable) via ``mon_min_osdmap_epochs``, and +defaults to 500 epochs. Generally speaking, we will remove older osdmap +epochs once we go over this limit. + +However, there are a few constraints to removing osdmaps. These are all +defined in ``OSDMonitor::get_trim_to()``. + +In the event one of these conditions is not met, we may go over the bounds +defined by ``mon_min_osdmap_epochs``. And if the cluster does not meet the +trim criteria for some time (e.g., unclean pgs), the monitor may start +keeping a lot of osdmaps. This can start putting pressure on the underlying +key/value store, as well as on the available disk space. + +One way to mitigate this problem would be to stop keeping full osdmap +epochs on disk. We would have to rebuild osdmaps on-demand, or grab them +from cache if they had been recently served. We would still have to keep +at least one osdmap, and apply all incrementals on top of either this +oldest map epoch kept in the store or a more recent map grabbed from cache. +While this would be feasible, it seems like a lot of cpu (and potentially +IO) would be going into rebuilding osdmaps. + +Additionally, this would prevent the aforementioned problem going forward, +but would do nothing for stores currently in a state that would truly +benefit from not keeping osdmaps. + +This brings us to full osdmap pruning. + +Instead of not keeping full osdmap epochs, we are going to prune some of +them when we have too many. + +Deciding whether we have too many will be dictated by a configurable option +``mon_osdmap_full_prune_min`` (default: 10000). The pruning algorithm will be +engaged once we go over this threshold. + +We will not remove all ``mon_osdmap_full_prune_min`` full osdmap epochs +though. Instead, we are going to poke some holes in the sequence of full +maps. By default, we will keep one full osdmap per 10 maps since the last +map kept; i.e., if we keep epoch 1, we will also keep epoch 10 and remove +full map epochs 2 to 9. The size of this interval is configurable with +``mon_osdmap_full_prune_interval``. + +Essentially, we are proposing to keep ~10% of the full maps, but we will +always honour the minimum number of osdmap epochs, as defined by +``mon_min_osdmap_epochs``, and these won't be used for the count of the +minimum versions to prune. For instance, if we have on-disk versions +[1..50000], we would allow the pruning algorithm to operate only over +osdmap epochs [1..49500); but, if have on-disk versions [1..10200], we +won't be pruning because the algorithm would only operate on versions +[1..9700), and this interval contains less versions than the minimum +required by ``mon_osdmap_full_prune_min``. + + +ALGORITHM +========= + +Say we have 50,000 osdmap epochs in the store, and we're using the +defaults for all configurable options. + +:: + + ----------------------------------------------------------- + |1|2|..|10|11|..|100|..|1000|..|10000|10001|..|49999|50000| + ----------------------------------------------------------- + ^ first last ^ + +We will prune when all the following constraints are met: + +1. number of versions is greater than ``mon_min_osdmap_epochs``; + +2. the number of versions between ``first`` and ``prune_to`` is greater (or + equal) than ``mon_osdmap_full_prune_min``, with ``prune_to`` being equal to + ``last`` minus ``mon_min_osdmap_epochs``. + +If any of these conditions fails, we will *not* prune any maps. + +Furthermore, if it is known that we have been pruning, but since then we +are no longer satisfying at least one of the above constraints, we will +not continue to prune. In essence, we only prune full osdmaps if the +number of epochs in the store so warrants it. + +As pruning will create gaps in the sequence of full maps, we need to keep +track of the intervals of missing maps. We do this by keeping a manifest of +pinned maps -- i.e., a list of maps that, by being pinned, are not to be +pruned. + +While pinned maps are not removed from the store, maps between two consecutive +pinned maps will; and the number of maps to be removed will be dictated by the +configurable option ``mon_osdmap_full_prune_interval``. The algorithm makes an +effort to keep pinned maps apart by as many maps as defined by this option, +but in the event of corner cases it may allow smaller intervals. Additionally, +as this is a configurable option that is read any time a prune iteration +occurs, there is the possibility this interval will change if the user changes +this config option. + +Pinning maps is performed lazily: we will be pinning maps as we are removing +maps. This grants us more flexibility to change the prune interval while +pruning is happening, but also simplifies considerably the algorithm, as well +as the information we need to keep in the manifest. Below we show a simplified +version of the algorithm::: + + manifest.pin(first) + last_to_prune = last - mon_min_osdmap_epochs + + while manifest.get_last_pinned() + prune_interval < last_to_prune AND + last_to_prune - first > mon_min_osdmap_epochs AND + last_to_prune - first > mon_osdmap_full_prune_min AND + num_pruned < mon_osdmap_full_prune_txsize: + + last_pinned = manifest.get_last_pinned() + new_pinned = last_pinned + prune_interval + manifest.pin(new_pinned) + for e in (last_pinned .. new_pinned): + store.erase(e) + ++num_pruned + +In essence, the algorithm ensures that the first version in the store is +*always* pinned. After all, we need a starting point when rebuilding maps, and +we can't simply remove the earliest map we have; otherwise we would be unable +to rebuild maps for the very first pruned interval. + +Once we have at least one pinned map, each iteration of the algorithm can +simply base itself on the manifest's last pinned map (which we can obtain by +reading the element at the tail of the manifest's pinned maps list). + +We'll next need to determine the interval of maps to be removed: all the maps +from ``last_pinned`` up to ``new_pinned``, which in turn is nothing more than +``last_pinned`` plus ``mon_osdmap_full_prune_interval``. We know that all maps +between these two values, ``last_pinned`` and ``new_pinned`` can be removed, +considering ``new_pinned`` has been pinned. + +The algorithm ceases to execute as soon as one of the two initial +preconditions is not met, or if we do not meet two additional conditions that +have no weight on the algorithm's correctness: + +1. We will stop if we are not able to create a new pruning interval properly + aligned with ``mon_osdmap_full_prune_interval`` that is lower than + ``last_pruned``. There is no particular technical reason why we enforce + this requirement, besides allowing us to keep the intervals with an + expected size, and preventing small, irregular intervals that would be + bound to happen eventually (e.g., pruning continues over the course of + several iterations, removing one or two or three maps each time). + +2. We will stop once we know that we have pruned more than a certain number of + maps. This value is defined by ``mon_osdmap_full_prune_txsize``, and + ensures we don't spend an unbounded number of cycles pruning maps. We don't + enforce this value religiously (deletes do not cost much), but we make an + effort to honor it. + +We could do the removal in one go, but we have no idea how long that would +take. Therefore, we will perform several iterations, removing at most +``mon_osdmap_full_prune_txsize`` osdmaps per iteration. + +In the end, our on-disk map sequence will look similar to:: + + ------------------------------------------ + |1|10|20|30|..|49500|49501|..|49999|50000| + ------------------------------------------ + ^ first last ^ + + +Because we are not pruning all versions in one go, we need to keep state +about how far along on our pruning we are. With that in mind, we have +created a data structure, ``osdmap_manifest_t``, that holds the set of pinned +maps::: + + struct osdmap_manifest_t: + set<version_t> pinned; + +Given we are only pinning maps while we are pruning, we don't need to keep +track of additional state about the last pruned version. We know as a matter +of fact that we have pruned all the intermediate maps between any two +consecutive pinned maps. + +The question one could ask, though, is how can we be sure we pruned all the +intermediate maps if, for instance, the monitor crashes. To ensure we are +protected against such an event, we always write the osdmap manifest to disk +on the same transaction that is deleting the maps. This way we have the +guarantee that, if the monitor crashes, we will read the latest version of the +manifest: either containing the newly pinned maps, meaning we also pruned the +in-between maps; or we will find the previous version of the osdmap manifest, +which will not contain the maps we were pinning at the time we crashed, given +the transaction on which we would be writing the updated osdmap manifest was +not applied (alongside with the maps removal). + +The osdmap manifest will be written to the store each time we prune, with an +updated list of pinned maps. It is written in the transaction effectively +pruning the maps, so we guarantee the manifest is always up to date. As a +consequence of this criteria, the first time we will write the osdmap manifest +is the first time we prune. If an osdmap manifest does not exist, we can be +certain we do not hold pruned map intervals. + +We will rely on the manifest to ascertain whether we have pruned maps +intervals. In theory, this will always be the on-disk osdmap manifest, but we +make sure to read the on-disk osdmap manifest each time we update from paxos; +this way we always ensure having an up to date in-memory osdmap manifest. + +Once we finish pruning maps, we will keep the manifest in the store, to +allow us to easily find which maps have been pinned (instead of checking +the store until we find a map). This has the added benefit of allowing us to +quickly figure out which is the next interval we need to prune (i.e., last +pinned plus the prune interval). This doesn't however mean we will forever +keep the osdmap manifest: the osdmap manifest will no longer be required once +the monitor trims osdmaps and the earliest available epoch in the store is +greater than the last map we pruned. + +The same conditions from ``OSDMonitor::get_trim_to()`` that force the monitor +to keep a lot of osdmaps, thus requiring us to prune, may eventually change +and allow the monitor to remove some of its oldest maps. + +MAP TRIMMING +------------ + +If the monitor trims maps, we must then adjust the osdmap manifest to +reflect our pruning status, or remove the manifest entirely if it no longer +makes sense to keep it. For instance, take the map sequence from before, but +let us assume we did not finish pruning all the maps.:: + + ------------------------------------------------------------- + |1|10|20|30|..|490|500|501|502|..|49500|49501|..|49999|50000| + ------------------------------------------------------------- + ^ first ^ pinned.last() last ^ + + pinned = {1, 10, 20, ..., 490, 500} + +Now let us assume that the monitor will trim up to epoch 501. This means +removing all maps prior to epoch 501, and updating the ``first_committed`` +pointer to ``501``. Given removing all those maps would invalidate our +existing pruning efforts, we can consider our pruning has finished and drop +our osdmap manifest. Doing so also simplifies starting a new prune, if all +the starting conditions are met once we refreshed our state from the +store. + +We would then have the following map sequence: :: + + --------------------------------------- + |501|502|..|49500|49501|..|49999|50000| + --------------------------------------- + ^ first last ^ + +However, imagine a slightly more convoluted scenario: the monitor will trim +up to epoch 491. In this case, epoch 491 has been previously pruned from the +store. + +Given we will always need to have the oldest known map in the store, before +we trim we will have to check whether that map is in the prune interval +(i.e., if said map epoch belongs to ``[ pinned.first()..pinned.last() )``). +If so, we need to check if this is a pinned map, in which case we don't have +much to be concerned aside from removing lower epochs from the manifest's +pinned list. On the other hand, if the map being trimmed to is not a pinned +map, we will need to rebuild said map and pin it, and only then will we remove +the pinned maps prior to the map's epoch. + +In this case, we would end up with the following sequence::: + + ----------------------------------------------- + |491|500|501|502|..|49500|49501|..|49999|50000| + ----------------------------------------------- + ^ ^- pinned.last() last ^ + `- first + +There is still an edge case that we should mention. Consider that we are +going to trim up to epoch 499, which is the very last pruned epoch. + +Much like the scenario above, we would end up writing osdmap epoch 499 to +the store; but what should we do about pinned maps and pruning? + +The simplest solution is to drop the osdmap manifest. After all, given we +are trimming to the last pruned map, and we are rebuilding this map, we can +guarantee that all maps greater than e 499 are sequential (because we have +not pruned any of them). In essence, dropping the osdmap manifest in this +case is essentially the same as if we were trimming over the last pruned +epoch: we can prune again later if we meet the required conditions. + +And, with this, we have fully dwelled into full osdmap pruning. Later in this +document one can find detailed `REQUIREMENTS, CONDITIONS & INVARIANTS` for the +whole algorithm, from pruning to trimming. Additionally, the next section +details several additional checks to guarantee the sanity of our configuration +options. Enjoy. + + +CONFIGURATION OPTIONS SANITY CHECKS +----------------------------------- + +We perform additional checks before pruning to ensure all configuration +options involved are sane: + +1. If ``mon_osdmap_full_prune_interval`` is zero we will not prune; we + require an actual positive number, greater than one, to be able to prune + maps. If the interval is one, we would not actually be pruning any maps, as + the interval between pinned maps would essentially be a single epoch. This + means we would have zero maps in-between pinned maps, hence no maps would + ever be pruned. + +2. If ``mon_osdmap_full_prune_min`` is zero we will not prune; we require a + positive, greater than zero, value so we know the threshold over which we + should prune. We don't want to guess. + +3. If ``mon_osdmap_full_prune_interval`` is greater than + ``mon_osdmap_full_prune_min`` we will not prune, as it is impossible to + ascertain a proper prune interval. + +4. If ``mon_osdmap_full_prune_txsize`` is lower than + ``mon_osdmap_full_prune_interval`` we will not prune; we require a + ``txsize`` with a value at least equal than ``interval``, and (depending on + the value of the latter) ideally higher. + + +REQUIREMENTS, CONDITIONS & INVARIANTS +------------------------------------- + +REQUIREMENTS +~~~~~~~~~~~~ + +* All monitors in the quorum need to support pruning. + +* Once pruning has been enabled, monitors not supporting pruning will not be + allowed in the quorum, nor will be allowed to synchronize. + +* Removing the osdmap manifest results in disabling the pruning feature quorum + requirement. This means that monitors not supporting pruning will be allowed + to synchronize and join the quorum, granted they support any other features + required. + + +CONDITIONS & INVARIANTS +~~~~~~~~~~~~~~~~~~~~~~~ + +* Pruning has never happened, or we have trimmed past its previous + intervals::: + + invariant: first_committed > 1 + + condition: pinned.empty() AND !store.exists(manifest) + + +* Pruning has happened at least once::: + + invariant: first_committed > 0 + invariant: !pinned.empty()) + invariant: pinned.first() == first_committed + invariant: pinned.last() < last_committed + + precond: pinned.last() < prune_to AND + pinned.last() + prune_interval < prune_to + + postcond: pinned.size() > old_pinned.size() AND + (for each v in [pinned.first()..pinned.last()]: + if pinned.count(v) > 0: store.exists_full(v) + else: !store.exists_full(v) + ) + + +* Pruning has finished::: + + invariant: first_committed > 0 + invariant: !pinned.empty() + invariant: pinned.first() == first_committed + invariant: pinned.last() < last_committed + + condition: pinned.last() == prune_to OR + pinned.last() + prune_interval < prune_to + + +* Pruning intervals can be trimmed::: + + precond: OSDMonitor::get_trim_to() > 0 + + condition: !pinned.empty() + + invariant: pinned.first() == first_committed + invariant: pinned.last() < last_committed + invariant: pinned.first() <= OSDMonitor::get_trim_to() + invariant: pinned.last() >= OSDMonitor::get_trim_to() + +* Trim pruned intervals::: + + invariant: !pinned.empty() + invariant: pinned.first() == first_committed + invariant: pinned.last() < last_committed + invariant: pinned.first() <= OSDMonitor::get_trim_to() + invariant: pinned.last() >= OSDMonitor::get_trim_to() + + postcond: pinned.empty() OR + (pinned.first() == OSDMonitor::get_trim_to() AND + pinned.last() > pinned.first() AND + (for each v in [0..pinned.first()]: + !store.exists(v) AND + !store.exists_full(v) + ) AND + (for each m in [pinned.first()..pinned.last()]: + if pinned.count(m) > 0: store.exists_full(m) + else: !store.exists_full(m) AND store.exists(m) + ) + ) + postcond: !pinned.empty() OR + (!store.exists(manifest) AND + (for each v in [pinned.first()..pinned.last()]: + !store.exists(v) AND + !store.exists_full(v) + ) + ) + |