summaryrefslogtreecommitdiffstats
path: root/iredis/data/commands/cluster-failover.md
diff options
context:
space:
mode:
Diffstat (limited to 'iredis/data/commands/cluster-failover.md')
-rw-r--r--iredis/data/commands/cluster-failover.md81
1 files changed, 81 insertions, 0 deletions
diff --git a/iredis/data/commands/cluster-failover.md b/iredis/data/commands/cluster-failover.md
new file mode 100644
index 0000000..c811c04
--- /dev/null
+++ b/iredis/data/commands/cluster-failover.md
@@ -0,0 +1,81 @@
+This command, that can only be sent to a Redis Cluster replica node, forces the
+replica to start a manual failover of its master instance.
+
+A manual failover is a special kind of failover that is usually executed when
+there are no actual failures, but we wish to swap the current master with one of
+its replicas (which is the node we send the command to), in a safe way, without
+any window for data loss. It works in the following way:
+
+1. The replica tells the master to stop processing queries from clients.
+2. The master replies to the replica with the current _replication offset_.
+3. The replica waits for the replication offset to match on its side, to make
+ sure it processed all the data from the master before it continues.
+4. The replica starts a failover, obtains a new configuration epoch from the
+ majority of the masters, and broadcasts the new configuration.
+5. The old master receives the configuration update: unblocks its clients and
+ starts replying with redirection messages so that they'll continue the chat
+ with the new master.
+
+This way clients are moved away from the old master to the new master atomically
+and only when the replica that is turning into the new master has processed all
+of the replication stream from the old master.
+
+## FORCE option: manual failover when the master is down
+
+The command behavior can be modified by two options: **FORCE** and **TAKEOVER**.
+
+If the **FORCE** option is given, the replica does not perform any handshake
+with the master, that may be not reachable, but instead just starts a failover
+ASAP starting from point 4. This is useful when we want to start a manual
+failover while the master is no longer reachable.
+
+However using **FORCE** we still need the majority of masters to be available in
+order to authorize the failover and generate a new configuration epoch for the
+replica that is going to become master.
+
+## TAKEOVER option: manual failover without cluster consensus
+
+There are situations where this is not enough, and we want a replica to failover
+without any agreement with the rest of the cluster. A real world use case for
+this is to mass promote replicas in a different data center to masters in order
+to perform a data center switch, while all the masters are down or partitioned
+away.
+
+The **TAKEOVER** option implies everything **FORCE** implies, but also does not
+uses any cluster authorization in order to failover. A replica receiving
+`CLUSTER FAILOVER TAKEOVER` will instead:
+
+1. Generate a new `configEpoch` unilaterally, just taking the current greatest
+ epoch available and incrementing it if its local configuration epoch is not
+ already the greatest.
+2. Assign itself all the hash slots of its master, and propagate the new
+ configuration to every node which is reachable ASAP, and eventually to every
+ other node.
+
+Note that **TAKEOVER violates the last-failover-wins principle** of Redis
+Cluster, since the configuration epoch generated by the replica violates the
+normal generation of configuration epochs in several ways:
+
+1. There is no guarantee that it is actually the higher configuration epoch,
+ since, for example, we can use the **TAKEOVER** option within a minority, nor
+ any message exchange is performed to generate the new configuration epoch.
+2. If we generate a configuration epoch which happens to collide with another
+ instance, eventually our configuration epoch, or the one of another instance
+ with our same epoch, will be moved away using the _configuration epoch
+ collision resolution algorithm_.
+
+Because of this the **TAKEOVER** option should be used with care.
+
+## Implementation details and notes
+
+`CLUSTER FAILOVER`, unless the **TAKEOVER** option is specified, does not
+execute a failover synchronously, it only _schedules_ a manual failover,
+bypassing the failure detection stage, so to check if the failover actually
+happened, `CLUSTER NODES` or other means should be used in order to verify that
+the state of the cluster changes after some time the command was sent.
+
+@return
+
+@simple-string-reply: `OK` if the command was accepted and a manual failover is
+going to be attempted. An error if the operation cannot be executed, for example
+if we are talking with a node which is already a master.