iredis/data/commands/cluster-failover.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

This command, that can only be sent to a Redis Cluster replica node, forces the
replica to start a manual failover of its master instance.

A manual failover is a special kind of failover that is usually executed when
there are no actual failures, but we wish to swap the current master with one of
its replicas (which is the node we send the command to), in a safe way, without
any window for data loss. It works in the following way:

1. The replica tells the master to stop processing queries from clients.
2. The master replies to the replica with the current _replication offset_.
3. The replica waits for the replication offset to match on its side, to make
   sure it processed all the data from the master before it continues.
4. The replica starts a failover, obtains a new configuration epoch from the
   majority of the masters, and broadcasts the new configuration.
5. The old master receives the configuration update: unblocks its clients and
   starts replying with redirection messages so that they'll continue the chat
   with the new master.

This way clients are moved away from the old master to the new master atomically
and only when the replica that is turning into the new master has processed all
of the replication stream from the old master.

## FORCE option: manual failover when the master is down

The command behavior can be modified by two options: **FORCE** and **TAKEOVER**.

If the **FORCE** option is given, the replica does not perform any handshake
with the master, that may be not reachable, but instead just starts a failover
ASAP starting from point 4. This is useful when we want to start a manual
failover while the master is no longer reachable.

However using **FORCE** we still need the majority of masters to be available in
order to authorize the failover and generate a new configuration epoch for the
replica that is going to become master.

## TAKEOVER option: manual failover without cluster consensus

There are situations where this is not enough, and we want a replica to failover
without any agreement with the rest of the cluster. A real world use case for
this is to mass promote replicas in a different data center to masters in order
to perform a data center switch, while all the masters are down or partitioned
away.

The **TAKEOVER** option implies everything **FORCE** implies, but also does not
uses any cluster authorization in order to failover. A replica receiving
`CLUSTER FAILOVER TAKEOVER` will instead:

1. Generate a new `configEpoch` unilaterally, just taking the current greatest
   epoch available and incrementing it if its local configuration epoch is not
   already the greatest.
2. Assign itself all the hash slots of its master, and propagate the new
   configuration to every node which is reachable ASAP, and eventually to every
   other node.

Note that **TAKEOVER violates the last-failover-wins principle** of Redis
Cluster, since the configuration epoch generated by the replica violates the
normal generation of configuration epochs in several ways:

1. There is no guarantee that it is actually the higher configuration epoch,
   since, for example, we can use the **TAKEOVER** option within a minority, nor
   any message exchange is performed to generate the new configuration epoch.
2. If we generate a configuration epoch which happens to collide with another
   instance, eventually our configuration epoch, or the one of another instance
   with our same epoch, will be moved away using the _configuration epoch
   collision resolution algorithm_.

Because of this the **TAKEOVER** option should be used with care.

## Implementation details and notes

`CLUSTER FAILOVER`, unless the **TAKEOVER** option is specified, does not
execute a failover synchronously, it only _schedules_ a manual failover,
bypassing the failure detection stage, so to check if the failover actually
happened, `CLUSTER NODES` or other means should be used in order to verify that
the state of the cluster changes after some time the command was sent.

@return

@simple-string-reply: `OK` if the command was accepted and a manual failover is
going to be attempted. An error if the operation cannot be executed, for example
if we are talking with a node which is already a master.