From 19fcec84d8d7d21e796c7624e521b60d28ee21ed Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 Apr 2024 20:45:59 +0200 Subject: Adding upstream version 16.2.11+ds. Signed-off-by: Daniel Baumann --- doc/rados/operations/stretch-mode.rst | 215 ++++++++++++++++++++++++++++++++++ 1 file changed, 215 insertions(+) create mode 100644 doc/rados/operations/stretch-mode.rst (limited to 'doc/rados/operations/stretch-mode.rst') diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst new file mode 100644 index 000000000..6b4c9ba8a --- /dev/null +++ b/doc/rados/operations/stretch-mode.rst @@ -0,0 +1,215 @@ +.. _stretch_mode: + +================ +Stretch Clusters +================ + + +Stretch Clusters +================ +Ceph generally expects all parts of its network and overall cluster to be +equally reliable, with failures randomly distributed across the CRUSH map. +So you may lose a switch that knocks out a number of OSDs, but we expect +the remaining OSDs and monitors to route around that. + +This is usually a good choice, but may not work well in some +stretched cluster configurations where a significant part of your cluster +is stuck behind a single network component. For instance, a single +cluster which is located in multiple data centers, and you want to +sustain the loss of a full DC. + +There are two standard configurations we've seen deployed, with either +two or three data centers (or, in clouds, availability zones). With two +zones, we expect each site to hold a copy of the data, and for a third +site to have a tiebreaker monitor (this can be a VM or high-latency compared +to the main sites) to pick a winner if the network connection fails and both +DCs remain alive. For three sites, we expect a copy of the data and an equal +number of monitors in each site. + +Note that the standard Ceph configuration will survive MANY failures of the +network or data centers and it will never compromise data consistency. If you +bring back enough Ceph servers following a failure, it will recover. If you +lose a data center, but can still form a quorum of monitors and have all the data +available (with enough copies to satisfy pools' ``min_size``, or CRUSH rules +that will re-replicate to meet it), Ceph will maintain availability. + +What can't it handle? + +Stretch Cluster Issues +====================== +No matter what happens, Ceph will not compromise on data integrity +and consistency. If there's a failure in your network or a loss of nodes and +you can restore service, Ceph will return to normal functionality on its own. + +But there are scenarios where you lose data availibility despite having +enough servers available to satisfy Ceph's consistency and sizing constraints, or +where you may be surprised to not satisfy Ceph's constraints. +The first important category of these failures resolve around inconsistent +networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick +them out of the acting PG sets despite the primary being unable to replicate data. +If this happens, IO will not be permitted, because Ceph can't satisfy its durability +guarantees. + +The second important category of failures is when you think you have data replicated +across data centers, but the constraints aren't sufficient to guarantee this. +For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies +and places a copy in each data center with a ``min_size`` of 2. The PG may go active with +2 copies in site A and no copies in site B, which means that if you then lose site A you +have lost data and Ceph can't operate on it. This situation is surprisingly difficult +to avoid with standard CRUSH rules. + +Stretch Mode +============ +The new stretch mode is designed to handle the 2-site case. Three sites are +just as susceptible to netsplit issues, but are much more tolerant of +component availability outages than 2-site clusters are. + +To enter stretch mode, you must set the location of each monitor, matching +your CRUSH map. For instance, to place ``mon.a`` in your first data center: + +.. prompt:: bash $ + + ceph mon set_location a datacenter=site1 + +Next, generate a CRUSH rule which will place 2 copies in each data center. This +will require editing the CRUSH map directly: + +.. prompt:: bash $ + + ceph osd getcrushmap > crush.map.bin + crushtool -d crush.map.bin -o crush.map.txt + +Now edit the ``crush.map.txt`` file to add a new rule. Here +there is only one other rule, so this is ID 1, but you may need +to use a different rule ID. We also have two datacenter buckets +named ``site1`` and ``site2``:: + + rule stretch_rule { + id 1 + type replicated + min_size 1 + max_size 10 + step take site1 + step chooseleaf firstn 2 type host + step emit + step take site2 + step chooseleaf firstn 2 type host + step emit + } + +Finally, inject the CRUSH map to make the rule available to the cluster: + +.. prompt:: bash $ + + crushtool -c crush.map.txt -o crush2.map.bin + ceph osd setcrushmap -i crush2.map.bin + +If you aren't already running your monitors in connectivity mode, do so with +the instructions in `Changing Monitor Elections`_. + +.. _Changing Monitor elections: ../change-mon-elections + +And lastly, tell the cluster to enter stretch mode. Here, ``mon.e`` is the +tiebreaker and we are splitting across data centers. ``mon.e`` should be also +set a datacenter, that will differ from ``site1`` and ``site2``. For this +purpose you can create another datacenter bucket named ```site3`` in your +CRUSH and place ``mon.e`` there: + +.. prompt:: bash $ + + ceph mon set_location e datacenter=site3 + ceph mon enable_stretch_mode e stretch_rule datacenter + +When stretch mode is enabled, the OSDs wlll only take PGs active when +they peer across data centers (or whatever other CRUSH bucket type +you specified), assuming both are alive. Pools will increase in size +from the default 3 to 4, expecting 2 copies in each site. OSDs will only +be allowed to connect to monitors in the same data center. New monitors +will not be allowed to join the cluster if they do not specify a location. + +If all the OSDs and monitors from a data center become inaccessible +at once, the surviving data center will enter a degraded stretch mode. This +will issue a warning, reduce the min_size to 1, and allow +the cluster to go active with data in the single remaining site. Note that +we do not change the pool size, so you will also get warnings that the +pools are too small -- but a special stretch mode flag will prevent the OSDs +from creating extra copies in the remaining data center (so it will only keep +2 copies, as before). + +When the missing data center comes back, the cluster will enter +recovery stretch mode. This changes the warning and allows peering, but +still only requires OSDs from the data center which was up the whole time. +When all PGs are in a known state, and are neither degraded nor incomplete, +the cluster transitions back to regular stretch mode, ends the warning, +restores min_size to its starting value (2) and requires both sites to peer, +and stops requiring the always-alive site when peering (so that you can fail +over to the other site, if necessary). + + +Stretch Mode Limitations +======================== +As implied by the setup, stretch mode only handles 2 sites with OSDs. + +While it is not enforced, you should run 2 monitors in each site plus +a tiebreaker, for a total of 5. This is because OSDs can only connect +to monitors in their own site when in stretch mode. + +You cannot use erasure coded pools with stretch mode. If you try, it will +refuse, and it will not allow you to create EC pools once in stretch mode. + +You must create your own CRUSH rule which provides 2 copies in each site, and +you must use 4 total copies with 2 in each site. If you have existing pools +with non-default size/min_size, Ceph will object when you attempt to +enable stretch mode. + +Because it runs with ``min_size 1`` when degraded, you should only use stretch +mode with all-flash OSDs. This minimizes the time needed to recover once +connectivity is restored, and thus minimizes the potential for data loss. + +Hopefully, future development will extend this feature to support EC pools and +running with more than 2 full sites. + +Other commands +============== +If your tiebreaker monitor fails for some reason, you can replace it. Turn on +a new monitor and run: + +.. prompt:: bash $ + + ceph mon set_new_tiebreaker mon. + +This command will protest if the new monitor is in the same location as existing +non-tiebreaker monitors. This command WILL NOT remove the previous tiebreaker +monitor; you should do so yourself. + +Also in 16.2.7, if you are writing your own tooling for deploying Ceph, you can use a new +``--set-crush-location`` option when booting monitors, instead of running +``ceph mon set_location``. This option accepts only a single "bucket=loc" pair, eg +``ceph-mon --set-crush-location 'datacenter=a'``, which must match the +bucket type you specified when running ``enable_stretch_mode``. + + +When in stretch degraded mode, the cluster will go into "recovery" mode automatically +when the disconnected data center comes back. If that doesn't work, or you want to +enable recovery mode early, you can invoke: + +.. prompt:: bash $ + + ceph osd force_recovery_stretch_mode --yes-i-really-mean-it + +But this command should not be necessary; it is included to deal with +unanticipated situations. + +When in recovery mode, the cluster should go back into normal stretch mode +when the PGs are healthy. If this doesn't happen, or you want to force the +cross-data-center peering early and are willing to risk data downtime (or have +verified separately that all the PGs can peer, even if they aren't fully +recovered), you can invoke: + +.. prompt:: bash $ + + ceph osd force_healthy_stretch_mode --yes-i-really-mean-it + +This command should not be necessary; it is included to deal with +unanticipated situations. But you might wish to invoke it to remove +the ``HEALTH_WARN`` state which recovery mode generates. -- cgit v1.2.3