1 files changed, 215 insertions, 0 deletions
diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst
new file mode 100644
index 000000000..6b4c9ba8a
--- /dev/null
+++ b/doc/rados/operations/stretch-mode.rst
@@ -0,0 +1,215 @@
+.. _stretch_mode:
+
+================
+Stretch Clusters
+================
+
+
+Stretch Clusters
+================
+Ceph generally expects all parts of its network and overall cluster to be
+equally reliable, with failures randomly distributed across the CRUSH map.
+So you may lose a switch that knocks out a number of OSDs, but we expect
+the remaining OSDs and monitors to route around that.
+
+This is usually a good choice, but may not work well in some
+stretched cluster configurations where a significant part of your cluster
+is stuck behind a single network component. For instance, a single
+cluster which is located in multiple data centers, and you want to
+sustain the loss of a full DC.
+
+There are two standard configurations we've seen deployed, with either
+two or three data centers (or, in clouds, availability zones). With two
+zones, we expect each site to hold a copy of the data, and for a third
+site to have a tiebreaker monitor (this can be a VM or high-latency compared
+to the main sites) to pick a winner if the network connection fails and both
+DCs remain alive. For three sites, we expect a copy of the data and an equal
+number of monitors in each site.
+
+Note that the standard Ceph configuration will survive MANY failures of the
+network or data centers and it will never compromise data consistency.  If you
+bring back enough Ceph servers following a failure, it will recover. If you
+lose a data center, but can still form a quorum of monitors and have all the data
+available (with enough copies to satisfy pools' ``min_size``, or CRUSH rules
+that will re-replicate to meet it), Ceph will maintain availability.
+
+What can't it handle?
+
+Stretch Cluster Issues
+======================
+No matter what happens, Ceph will not compromise on data integrity
+and consistency. If there's a failure in your network or a loss of nodes and
+you can restore service, Ceph will return to normal functionality on its own.
+
+But there are scenarios where you lose data availibility despite having
+enough servers available to satisfy Ceph's consistency and sizing constraints, or
+where you may be surprised to not satisfy Ceph's constraints.
+The first important category of these failures resolve around inconsistent
+networks -- if there's a netsplit, Ceph may be unable to mark OSDs down and kick
+them out of the acting PG sets despite the primary being unable to replicate data.
+If this happens, IO will not be permitted, because Ceph can't satisfy its durability
+guarantees.
+
+The second important category of failures is when you think you have data replicated
+across data centers, but the constraints aren't sufficient to guarantee this.
+For instance, you might have data centers A and B, and your CRUSH rule targets 3 copies
+and places a copy in each data center with a ``min_size`` of 2. The PG may go active with
+2 copies in site A and no copies in site B, which means that if you then lose site A you
+have lost data and Ceph can't operate on it. This situation is surprisingly difficult
+to avoid with standard CRUSH rules.
+
+Stretch Mode
+============
+The new stretch mode is designed to handle the 2-site case. Three sites are
+just as susceptible to netsplit issues, but are much more tolerant of
+component availability outages than 2-site clusters are.
+
+To enter stretch mode, you must set the location of each monitor, matching
+your CRUSH map. For instance, to place ``mon.a`` in your first data center:
+
+.. prompt:: bash $
+
+   ceph mon set_location a datacenter=site1
+
+Next, generate a CRUSH rule which will place 2 copies in each data center. This
+will require editing the CRUSH map directly:
+
+.. prompt:: bash $
+
+   ceph osd getcrushmap > crush.map.bin
+   crushtool -d crush.map.bin -o crush.map.txt
+
+Now edit the ``crush.map.txt`` file to add a new rule. Here
+there is only one other rule, so this is ID 1, but you may need
+to use a different rule ID. We also have two datacenter buckets
+named ``site1`` and ``site2``::
+
+  rule stretch_rule {
+          id 1
+          type replicated
+          min_size 1
+          max_size 10
+          step take site1
+          step chooseleaf firstn 2 type host
+          step emit
+          step take site2
+          step chooseleaf firstn 2 type host
+          step emit
+  }
+
+Finally, inject the CRUSH map to make the rule available to the cluster:
+
+.. prompt:: bash $
+
+   crushtool -c crush.map.txt -o crush2.map.bin
+   ceph osd setcrushmap -i crush2.map.bin
+
+If you aren't already running your monitors in connectivity mode, do so with
+the instructions in `Changing Monitor Elections`_.
+
+.. _Changing Monitor elections: ../change-mon-elections
+
+And lastly, tell the cluster to enter stretch mode. Here, ``mon.e`` is the
+tiebreaker and we are splitting across data centers. ``mon.e`` should be also
+set a datacenter, that will differ from ``site1`` and ``site2``. For this
+purpose you can create another datacenter bucket named ```site3`` in your
+CRUSH and place ``mon.e`` there:
+
+.. prompt:: bash $
+
+   ceph mon set_location e datacenter=site3
+   ceph mon enable_stretch_mode e stretch_rule datacenter
+
+When stretch mode is enabled, the OSDs wlll only take PGs active when
+they peer across data centers (or whatever other CRUSH bucket type
+you specified), assuming both are alive. Pools will increase in size
+from the default 3 to 4, expecting 2 copies in each site. OSDs will only
+be allowed to connect to monitors in the same data center. New monitors
+will not be allowed to join the cluster if they do not specify a location.
+
+If all the OSDs and monitors from a data center become inaccessible
+at once, the surviving data center will enter a degraded stretch mode. This
+will issue a warning, reduce the min_size to 1, and allow
+the cluster to go active with data in the single remaining site. Note that
+we do not change the pool size, so you will also get warnings that the
+pools are too small -- but a special stretch mode flag will prevent the OSDs
+from creating extra copies in the remaining data center (so it will only keep
+2 copies, as before).
+
+When the missing data center comes back, the cluster will enter
+recovery stretch mode. This changes the warning and allows peering, but
+still only requires OSDs from the data center which was up the whole time.
+When all PGs are in a known state, and are neither degraded nor incomplete,
+the cluster transitions back to regular stretch mode, ends the warning,
+restores min_size to its starting value (2) and requires both sites to peer,
+and stops requiring the always-alive site when peering (so that you can fail
+over to the other site, if necessary).
+
+  
+Stretch Mode Limitations
+========================
+As implied by the setup, stretch mode only handles 2 sites with OSDs.
+
+While it is not enforced, you should run 2 monitors in each site plus
+a tiebreaker, for a total of 5. This is because OSDs can only connect
+to monitors in their own site when in stretch mode.
+
+You cannot use erasure coded pools with stretch mode. If you try, it will
+refuse, and it will not allow you to create EC pools once in stretch mode.
+
+You must create your own CRUSH rule which provides 2 copies in each site, and
+you must use 4 total copies with 2 in each site. If you have existing pools
+with non-default size/min_size, Ceph will object when you attempt to
+enable stretch mode.
+
+Because it runs with ``min_size 1`` when degraded, you should only use stretch
+mode with all-flash OSDs.  This minimizes the time needed to recover once
+connectivity is restored, and thus minimizes the potential for data loss.
+
+Hopefully, future development will extend this feature to support EC pools and
+running with more than 2 full sites.
+
+Other commands
+==============
+If your tiebreaker monitor fails for some reason, you can replace it. Turn on
+a new monitor and run:
+
+.. prompt:: bash $
+
+   ceph mon set_new_tiebreaker mon.<new_mon_name>
+
+This command will protest if the new monitor is in the same location as existing
+non-tiebreaker monitors. This command WILL NOT remove the previous tiebreaker
+monitor; you should do so yourself.
+
+Also in 16.2.7, if you are writing your own tooling for deploying Ceph, you can use a new
+``--set-crush-location`` option when booting monitors, instead of running
+``ceph mon set_location``. This option accepts only a single "bucket=loc" pair, eg
+``ceph-mon --set-crush-location 'datacenter=a'``, which must match the
+bucket type you specified when running ``enable_stretch_mode``.
+
+
+When in stretch degraded mode, the cluster will go into "recovery" mode automatically
+when the disconnected data center comes back. If that doesn't work, or you want to
+enable recovery mode early, you can invoke:
+
+.. prompt:: bash $
+
+   ceph osd force_recovery_stretch_mode --yes-i-really-mean-it
+
+But this command should not be necessary; it is included to deal with
+unanticipated situations.
+
+When in recovery mode, the cluster should go back into normal stretch mode
+when the PGs are healthy. If this doesn't happen, or you want to force the
+cross-data-center peering early and are willing to risk data downtime (or have
+verified separately that all the PGs can peer, even if they aren't fully
+recovered), you can invoke:
+
+.. prompt:: bash $
+
+   ceph osd force_healthy_stretch_mode --yes-i-really-mean-it
+
+This command should not be necessary; it is included to deal with
+unanticipated situations. But you might wish to invoke it to remove
+the ``HEALTH_WARN`` state which recovery mode generates.