heartbeat/README.galera


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148

Notes regarding the Galera resource agent
---

In the resource agent, the action of bootstrapping a Galera cluster is
implemented into a series of small steps, by using:

  * Two CIB attributes `last-committed` and `bootstrap` to elect a
    bootstrap node that will restart the cluster.

  * One CIB attribute `sync-needed` that will identify that joining
    nodes are in the process of synchronizing their local database
    via SST.

  * A Master/Slave pacemaker resource which helps splitting the boot
    into steps, up to a point where a galera node is available.

  * the recurring monitor action to coordinate switch from one
    state to another.

How boot works
====

There are two things to know to understand how the resource agent
restart a Galera cluster.

### Bootstrap the cluster with the right node

When synced, the nodes of a galera cluster have in common a last seqno,
which identifies the last transaction considered successful by a
majority of nodes in the cluster (think quorum).

To restart a cluster, the resource agent must ensure that it will
bootstrap the cluster from an node which is up-to-date, i.e which has
the highest seqno of all nodes.

As a result, if the resource agent cannot retrieve the seqno on all
nodes, it won't be able to safely identify a bootstrap node, and
will simply refuse to start the galera cluster.

### synchronizing nodes can be a long operation

Starting a bootstrap node is relatively fast, so it's performed
during the "promote" operation, which is a one-off, time-bounded
operation.

Subsequent nodes will need to synchronize via SST, which consists
in "pushing" an entire Galera DB from one node to another.

There is no perfect time-out, as time spent during synchronization
depends on the size of the DB. Thus, joiner nodes are started during
the "monitor" operation, which is a recurring operation that can
better track the progress of the SST.


State flow
====

General idea for starting Galera:

  * Before starting the Galera cluster each node needs to go in Slave
    state so that the agent records its last seqno into the CIB.
    __ This uses attribute last-committed __

  * When all node went in Slave, the agent can safely determine the
    last seqno and elect a bootstrap node (`detect_first_master()`).
    __ This uses attribute bootstrap __

  * The agent then sets the score of the elected bootstrap node to
    Master so that pacemaker promote it and start the first Galera
    server.

  * Once the first Master is running, the agent can start joiner
    nodes during the "monitor" operation, and starts monitoring
    their SST sync.
    __ This uses attribute sync-needed __

  * Only when SST is over on joiner nodes, the agent promotes them
    to Master. At this point, the entire Galera cluster is up.


Attribute usage and liveness
====

Here is how attributes are created on a per-node basis. If you
modify the resource agent make sure those properties still hold.

### last-committed

It is just a temporary hint for the resource agent to help
elect a bootstrap node. Once the bootstrap attribute is set on one
of the nodes, we can get rid of last-committed.

 - Used   : during Slave state to compare seqno
 - Created: before entering Slave state:
              . at startup in `galera_start()`
              . or when a Galera node is stopped in `galera_demote()`
 - Deleted: just before node starts in `galera_start_local_node()`;
            cleaned-up during `galera_demote()` and `galera_stop()`

We delete last-committed before starting Galera, to avoid race
conditions that could arise due to discrepancies between the CIB and
Galera.

### bootstrap

Attribute set on the node that is elected to bootstrap Galera.

- Used   : during promotion in `galera_start_local_node()`
- Created: at startup once all nodes have `last-committed`;
           or during monitor if all nodes have failed
- Deleted: in `galera_start_local_node()`, just after the bootstrap
           node started and is ready;
           cleaned-up during `galera_demote()` and `galera_stop()`

There cannot be more than one bootstrap node at any time, otherwise
the Galera cluster would stop replicating properly.

### sync-needed

While this attribute is set on a node, the Galera node is in JOIN
state, i.e. SST is in progress and the node cannot serve queries.

The resource agent relies on the underlying SST method to monitor
the progress of the SST. For instance, with `wsrep_sst_rsync`,
timeout would be reported by rsync, the Galera node would go in
Non-primary state, which would make `galera_monitor()` fail.

- Used   : during recurring slave monitor in `check_sync_status()`
- Created: in `galera_start_local_node()`, just after the joiner
           node started and entered the Galera cluster
- Deleted: during recurring slave monitor in `check_sync_status()`
           as soon as the Galera code reports to be SYNC-ed.

### no-grastate

If a galera node was unexpectedly killed in a middle of a replication,
InnoDB can retain the equivalent of a XA transaction in prepared state
in its redo log. If so, mysqld cannot recover state (nor last seqno)
automatically, and special recovery heuristic has to be used to
unblock the node.

This transient attribute is used to keep track of forced recoveries to
prevent bootstrapping a cluster from a recovered node when possible.

- Used   : during `detect_first_master()` to elect the bootstrap node
- Created: in `detect_last_commit()` if the node has a pending XA
           transaction to recover in the redo log
- Deleted: when a node is promoted to Master.