diff options
Diffstat (limited to 'doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst')
-rw-r--r-- | doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst | 341 |
1 files changed, 341 insertions, 0 deletions
diff --git a/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst b/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst new file mode 100644 index 0000000..59d3f93 --- /dev/null +++ b/doc/sphinx/Pacemaker_Explained/multi-site-clusters.rst @@ -0,0 +1,341 @@ +Multi-Site Clusters and Tickets +------------------------------- + +Apart from local clusters, Pacemaker also supports multi-site clusters. +That means you can have multiple, geographically dispersed sites, each with a +local cluster. Failover between these clusters can be coordinated +manually by the administrator, or automatically by a higher-level entity called +a *Cluster Ticket Registry (CTR)*. + +Challenges for Multi-Site Clusters +################################## + +Typically, multi-site environments are too far apart to support +synchronous communication and data replication between the sites. +That leads to significant challenges: + +- How do we make sure that a cluster site is up and running? + +- How do we make sure that resources are only started once? + +- How do we make sure that quorum can be reached between the different + sites and a split-brain scenario avoided? + +- How do we manage failover between sites? + +- How do we deal with high latency in case of resources that need to be + stopped? + +In the following sections, learn how to meet these challenges. + +Conceptual Overview +################### + +Multi-site clusters can be considered as “overlay” clusters where +each cluster site corresponds to a cluster node in a traditional cluster. +The overlay cluster can be managed by a CTR in order to +guarantee that any cluster resource will be active +on no more than one cluster site. This is achieved by using +*tickets* that are treated as failover domain between cluster +sites, in case a site should be down. + +The following sections explain the individual components and mechanisms +that were introduced for multi-site clusters in more detail. + +Ticket +______ + +Tickets are, essentially, cluster-wide attributes. A ticket grants the +right to run certain resources on a specific cluster site. Resources can +be bound to a certain ticket by ``rsc_ticket`` constraints. Only if the +ticket is available at a site can the respective resources be started there. +Vice versa, if the ticket is revoked, the resources depending on that +ticket must be stopped. + +The ticket thus is similar to a *site quorum*, i.e. the permission to +manage/own resources associated with that site. (One can also think of the +current ``have-quorum`` flag as a special, cluster-wide ticket that is +granted in case of node majority.) + +Tickets can be granted and revoked either manually by administrators +(which could be the default for classic enterprise clusters), or via +the automated CTR mechanism described below. + +A ticket can only be owned by one site at a time. Initially, none +of the sites has a ticket. Each ticket must be granted once by the cluster +administrator. + +The presence or absence of tickets for a site is stored in the CIB as a +cluster status. With regards to a certain ticket, there are only two states +for a site: ``true`` (the site has the ticket) or ``false`` (the site does +not have the ticket). The absence of a certain ticket (during the initial +state of the multi-site cluster) is the same as the value ``false``. + +Dead Man Dependency +___________________ + +A site can only activate resources safely if it can be sure that the +other site has deactivated them. However after a ticket is revoked, it can +take a long time until all resources depending on that ticket are stopped +"cleanly", especially in case of cascaded resources. To cut that process +short, the concept of a *Dead Man Dependency* was introduced. + +If a dead man dependency is in force, if a ticket is revoked from a site, the +nodes that are hosting dependent resources are fenced. This considerably speeds +up the recovery process of the cluster and makes sure that resources can be +migrated more quickly. + +This can be configured by specifying a ``loss-policy="fence"`` in +``rsc_ticket`` constraints. + +Cluster Ticket Registry +_______________________ + +A CTR is a coordinated group of network daemons that automatically handles +granting, revoking, and timing out tickets (instead of the administrator +revoking the ticket somewhere, waiting for everything to stop, and then +granting it on the desired site). + +Pacemaker does not implement its own CTR, but interoperates with external +software designed for that purpose (similar to how resource and fencing agents +are not directly part of pacemaker). + +Participating clusters run the CTR daemons, which connect to each other, exchange +information about their connectivity, and vote on which sites gets which +tickets. + +A ticket is granted to a site only once the CTR is sure that the ticket +has been relinquished by the previous owner, implemented via a timer in most +scenarios. If a site loses connection to its peers, its tickets time out and +recovery occurs. After the connection timeout plus the recovery timeout has +passed, the other sites are allowed to re-acquire the ticket and start the +resources again. + +This can also be thought of as a "quorum server", except that it is not +a single quorum ticket, but several. + +Configuration Replication +_________________________ + +As usual, the CIB is synchronized within each cluster, but it is *not* synchronized +across cluster sites of a multi-site cluster. You have to configure the resources +that will be highly available across the multi-site cluster for every site +accordingly. + +.. _ticket-constraints: + +Configuring Ticket Dependencies +############################### + +The **rsc_ticket** constraint lets you specify the resources depending on a certain +ticket. Together with the constraint, you can set a **loss-policy** that defines +what should happen to the respective resources if the ticket is revoked. + +The attribute **loss-policy** can have the following values: + +* ``fence:`` Fence the nodes that are running the relevant resources. + +* ``stop:`` Stop the relevant resources. + +* ``freeze:`` Do nothing to the relevant resources. + +* ``demote:`` Demote relevant resources that are running in the promoted role. + +.. topic:: Constraint that fences node if ``ticketA`` is revoked + + .. code-block:: xml + + <rsc_ticket id="rsc1-req-ticketA" rsc="rsc1" ticket="ticketA" loss-policy="fence"/> + +The example above creates a constraint with the ID ``rsc1-req-ticketA``. It +defines that the resource ``rsc1`` depends on ``ticketA`` and that the node running +the resource should be fenced if ``ticketA`` is revoked. + +If resource ``rsc1`` were a promotable resource, you might want to configure +that only being in the promoted role depends on ``ticketA``. With the following +configuration, ``rsc1`` will be demoted if ``ticketA`` is revoked: + +.. topic:: Constraint that demotes ``rsc1`` if ``ticketA`` is revoked + + .. code-block:: xml + + <rsc_ticket id="rsc1-req-ticketA" rsc="rsc1" rsc-role="Promoted" ticket="ticketA" loss-policy="demote"/> + +You can create multiple **rsc_ticket** constraints to let multiple resources +depend on the same ticket. However, **rsc_ticket** also supports resource sets +(see :ref:`s-resource-sets`), so one can easily list all the resources in one +**rsc_ticket** constraint instead. + +.. topic:: Ticket constraint for multiple resources + + .. code-block:: xml + + <rsc_ticket id="resources-dep-ticketA" ticket="ticketA" loss-policy="fence"> + <resource_set id="resources-dep-ticketA-0" role="Started"> + <resource_ref id="rsc1"/> + <resource_ref id="group1"/> + <resource_ref id="clone1"/> + </resource_set> + <resource_set id="resources-dep-ticketA-1" role="Promoted"> + <resource_ref id="ms1"/> + </resource_set> + </rsc_ticket> + +In the example above, there are two resource sets, so we can list resources +with different roles in a single ``rsc_ticket`` constraint. There's no dependency +between the two resource sets, and there's no dependency among the +resources within a resource set. Each of the resources just depends on +``ticketA``. + +Referencing resource templates in ``rsc_ticket`` constraints, and even +referencing them within resource sets, is also supported. + +If you want other resources to depend on further tickets, create as many +constraints as necessary with ``rsc_ticket``. + +Managing Multi-Site Clusters +############################ + +Granting and Revoking Tickets Manually +______________________________________ + +You can grant tickets to sites or revoke them from sites manually. +If you want to re-distribute a ticket, you should wait for +the dependent resources to stop cleanly at the previous site before you +grant the ticket to the new site. + +Use the **crm_ticket** command line tool to grant and revoke tickets. + +To grant a ticket to this site: + + .. code-block:: none + + # crm_ticket --ticket ticketA --grant + +To revoke a ticket from this site: + + .. code-block:: none + + # crm_ticket --ticket ticketA --revoke + +.. important:: + + If you are managing tickets manually, use the **crm_ticket** command with + great care, because it cannot check whether the same ticket is already + granted elsewhere. + +Granting and Revoking Tickets via a Cluster Ticket Registry +___________________________________________________________ + +We will use `Booth <https://github.com/ClusterLabs/booth>`_ here as an example of +software that can be used with pacemaker as a Cluster Ticket Registry. Booth +implements the `Raft <http://en.wikipedia.org/wiki/Raft_%28computer_science%29>`_ +algorithm to guarantee the distributed consensus among different +cluster sites, and manages the ticket distribution (and thus the failover +process between sites). + +Each of the participating clusters and *arbitrators* runs the Booth daemon +**boothd**. + +An *arbitrator* is the multi-site equivalent of a quorum-only node in a local +cluster. If you have a setup with an even number of sites, +you need an additional instance to reach consensus about decisions such +as failover of resources across sites. In this case, add one or more +arbitrators running at additional sites. Arbitrators are single machines +that run a booth instance in a special mode. An arbitrator is especially +important for a two-site scenario, otherwise there is no way for one site +to distinguish between a network failure between it and the other site, and +a failure of the other site. + +The most common multi-site scenario is probably a multi-site cluster with two +sites and a single arbitrator on a third site. However, technically, there are +no limitations with regards to the number of sites and the number of +arbitrators involved. + +**Boothd** at each site connects to its peers running at the other sites and +exchanges connectivity details. Once a ticket is granted to a site, the +booth mechanism will manage the ticket automatically: If the site which +holds the ticket is out of service, the booth daemons will vote which +of the other sites will get the ticket. To protect against brief +connection failures, sites that lose the vote (either explicitly or +implicitly by being disconnected from the voting body) need to +relinquish the ticket after a time-out. Thus, it is made sure that a +ticket will only be re-distributed after it has been relinquished by the +previous site. The resources that depend on that ticket will fail over +to the new site holding the ticket. The nodes that have run the +resources before will be treated according to the **loss-policy** you set +within the **rsc_ticket** constraint. + +Before the booth can manage a certain ticket within the multi-site cluster, +you initially need to grant it to a site manually via the **booth** command-line +tool. After you have initially granted a ticket to a site, **boothd** +will take over and manage the ticket automatically. + +.. important:: + + The **booth** command-line tool can be used to grant, list, or + revoke tickets and can be run on any machine where **boothd** is running. + If you are managing tickets via Booth, use only **booth** for manual + intervention, not **crm_ticket**. That ensures the same ticket + will only be owned by one cluster site at a time. + +Booth Requirements +~~~~~~~~~~~~~~~~~~ + +* All clusters that will be part of the multi-site cluster must be based on + Pacemaker. + +* Booth must be installed on all cluster nodes and on all arbitrators that will + be part of the multi-site cluster. + +* Nodes belonging to the same cluster site should be synchronized via NTP. However, + time synchronization is not required between the individual cluster sites. + +General Management of Tickets +_____________________________ + +Display the information of tickets: + + .. code-block:: none + + # crm_ticket --info + +Or you can monitor them with: + + .. code-block:: none + + # crm_mon --tickets + +Display the ``rsc_ticket`` constraints that apply to a ticket: + + .. code-block:: none + + # crm_ticket --ticket ticketA --constraints + +When you want to do maintenance or manual switch-over of a ticket, +revoking the ticket would trigger the loss policies. If +``loss-policy="fence"``, the dependent resources could not be gracefully +stopped/demoted, and other unrelated resources could even be affected. + +The proper way is making the ticket *standby* first with: + + .. code-block:: none + + # crm_ticket --ticket ticketA --standby + +Then the dependent resources will be stopped or demoted gracefully without +triggering the loss policies. + +If you have finished the maintenance and want to activate the ticket again, +you can run: + + .. code-block:: none + + # crm_ticket --ticket ticketA --activate + +For more information +#################### + +* `SUSE's Geo Clustering quick start <https://www.suse.com/documentation/sle-ha-geo-12/art_ha_geo_quick/data/art_ha_geo_quick.html>`_ + +* `Booth <https://github.com/ClusterLabs/booth>`_ |