summaryrefslogtreecommitdiffstats
path: root/iredis/data/commands/cluster-failover.md
diff options
context:
space:
mode:
Diffstat (limited to 'iredis/data/commands/cluster-failover.md')
-rw-r--r--iredis/data/commands/cluster-failover.md96
1 files changed, 41 insertions, 55 deletions
diff --git a/iredis/data/commands/cluster-failover.md b/iredis/data/commands/cluster-failover.md
index c811c04..911eaea 100644
--- a/iredis/data/commands/cluster-failover.md
+++ b/iredis/data/commands/cluster-failover.md
@@ -1,81 +1,67 @@
-This command, that can only be sent to a Redis Cluster replica node, forces the
-replica to start a manual failover of its master instance.
+This command, that can only be sent to a Redis Cluster replica node, forces
+the replica to start a manual failover of its master instance.
A manual failover is a special kind of failover that is usually executed when
-there are no actual failures, but we wish to swap the current master with one of
-its replicas (which is the node we send the command to), in a safe way, without
-any window for data loss. It works in the following way:
+there are no actual failures, but we wish to swap the current master with one
+of its replicas (which is the node we send the command to), in a safe way,
+without any window for data loss. It works in the following way:
1. The replica tells the master to stop processing queries from clients.
-2. The master replies to the replica with the current _replication offset_.
-3. The replica waits for the replication offset to match on its side, to make
- sure it processed all the data from the master before it continues.
-4. The replica starts a failover, obtains a new configuration epoch from the
- majority of the masters, and broadcasts the new configuration.
-5. The old master receives the configuration update: unblocks its clients and
- starts replying with redirection messages so that they'll continue the chat
- with the new master.
-
-This way clients are moved away from the old master to the new master atomically
-and only when the replica that is turning into the new master has processed all
-of the replication stream from the old master.
+2. The master replies to the replica with the current *replication offset*.
+3. The replica waits for the replication offset to match on its side, to make sure it processed all the data from the master before it continues.
+4. The replica starts a failover, obtains a new configuration epoch from the majority of the masters, and broadcasts the new configuration.
+5. The old master receives the configuration update: unblocks its clients and starts replying with redirection messages so that they'll continue the chat with the new master.
+
+This way clients are moved away from the old master to the new master
+atomically and only when the replica that is turning into the new master
+has processed all of the replication stream from the old master.
## FORCE option: manual failover when the master is down
The command behavior can be modified by two options: **FORCE** and **TAKEOVER**.
If the **FORCE** option is given, the replica does not perform any handshake
-with the master, that may be not reachable, but instead just starts a failover
-ASAP starting from point 4. This is useful when we want to start a manual
-failover while the master is no longer reachable.
+with the master, that may be not reachable, but instead just starts a
+failover ASAP starting from point 4. This is useful when we want to start
+a manual failover while the master is no longer reachable.
-However using **FORCE** we still need the majority of masters to be available in
-order to authorize the failover and generate a new configuration epoch for the
-replica that is going to become master.
+However using **FORCE** we still need the majority of masters to be available
+in order to authorize the failover and generate a new configuration epoch
+for the replica that is going to become master.
## TAKEOVER option: manual failover without cluster consensus
There are situations where this is not enough, and we want a replica to failover
-without any agreement with the rest of the cluster. A real world use case for
-this is to mass promote replicas in a different data center to masters in order
-to perform a data center switch, while all the masters are down or partitioned
-away.
+without any agreement with the rest of the cluster. A real world use case
+for this is to mass promote replicas in a different data center to masters
+in order to perform a data center switch, while all the masters are down
+or partitioned away.
-The **TAKEOVER** option implies everything **FORCE** implies, but also does not
-uses any cluster authorization in order to failover. A replica receiving
+The **TAKEOVER** option implies everything **FORCE** implies, but also does
+not uses any cluster authorization in order to failover. A replica receiving
`CLUSTER FAILOVER TAKEOVER` will instead:
-1. Generate a new `configEpoch` unilaterally, just taking the current greatest
- epoch available and incrementing it if its local configuration epoch is not
- already the greatest.
-2. Assign itself all the hash slots of its master, and propagate the new
- configuration to every node which is reachable ASAP, and eventually to every
- other node.
-
-Note that **TAKEOVER violates the last-failover-wins principle** of Redis
-Cluster, since the configuration epoch generated by the replica violates the
-normal generation of configuration epochs in several ways:
-
-1. There is no guarantee that it is actually the higher configuration epoch,
- since, for example, we can use the **TAKEOVER** option within a minority, nor
- any message exchange is performed to generate the new configuration epoch.
-2. If we generate a configuration epoch which happens to collide with another
- instance, eventually our configuration epoch, or the one of another instance
- with our same epoch, will be moved away using the _configuration epoch
- collision resolution algorithm_.
+1. Generate a new `configEpoch` unilaterally, just taking the current greatest epoch available and incrementing it if its local configuration epoch is not already the greatest.
+2. Assign itself all the hash slots of its master, and propagate the new configuration to every node which is reachable ASAP, and eventually to every other node.
+
+Note that **TAKEOVER violates the last-failover-wins principle** of Redis Cluster, since the configuration epoch generated by the replica violates the normal generation of configuration epochs in several ways:
+
+1. There is no guarantee that it is actually the higher configuration epoch, since, for example, we can use the **TAKEOVER** option within a minority, nor any message exchange is performed to generate the new configuration epoch.
+2. If we generate a configuration epoch which happens to collide with another instance, eventually our configuration epoch, or the one of another instance with our same epoch, will be moved away using the *configuration epoch collision resolution algorithm*.
Because of this the **TAKEOVER** option should be used with care.
## Implementation details and notes
-`CLUSTER FAILOVER`, unless the **TAKEOVER** option is specified, does not
-execute a failover synchronously, it only _schedules_ a manual failover,
-bypassing the failure detection stage, so to check if the failover actually
-happened, `CLUSTER NODES` or other means should be used in order to verify that
-the state of the cluster changes after some time the command was sent.
+* `CLUSTER FAILOVER`, unless the **TAKEOVER** option is specified, does not execute a failover synchronously.
+ It only *schedules* a manual failover, bypassing the failure detection stage.
+* An `OK` reply is no guarantee that the failover will succeed.
+* A replica can only be promoted to a master if it is known as a replica by a majority of the masters in the cluster.
+ If the replica is a new node that has just been added to the cluster (for example after upgrading it), it may not yet be known to all the masters in the cluster.
+ To check that the masters are aware of a new replica, you can send `CLUSTER NODES` or `CLUSTER REPLICAS` to each of the master nodes and check that it appears as a replica, before sending `CLUSTER FAILOVER` to the replica.
+* To check that the failover has actually happened you can use `ROLE`, `INFO REPLICATION` (which indicates "role:master" after successful failover), or `CLUSTER NODES` to verify that the state of the cluster has changed sometime after the command was sent.
+* To check if the failover has failed, check the replica's log for "Manual failover timed out", which is logged if the replica has given up after a few seconds.
@return
-@simple-string-reply: `OK` if the command was accepted and a manual failover is
-going to be attempted. An error if the operation cannot be executed, for example
-if we are talking with a node which is already a master.
+@simple-string-reply: `OK` if the command was accepted and a manual failover is going to be attempted. An error if the operation cannot be executed, for example if we are talking with a node which is already a master.