Notes regarding the Galera resource agent --- In the resource agent, the action of bootstrapping a Galera cluster is implemented into a series of small steps, by using: * Two CIB attributes `last-committed` and `bootstrap` to elect a bootstrap node that will restart the cluster. * One CIB attribute `sync-needed` that will identify that joining nodes are in the process of synchronizing their local database via SST. * A Master/Slave pacemaker resource which helps splitting the boot into steps, up to a point where a galera node is available. * the recurring monitor action to coordinate switch from one state to another. How boot works ==== There are two things to know to understand how the resource agent restart a Galera cluster. ### Bootstrap the cluster with the right node When synced, the nodes of a galera cluster have in common a last seqno, which identifies the last transaction considered successful by a majority of nodes in the cluster (think quorum). To restart a cluster, the resource agent must ensure that it will bootstrap the cluster from an node which is up-to-date, i.e which has the highest seqno of all nodes. As a result, if the resource agent cannot retrieve the seqno on all nodes, it won't be able to safely identify a bootstrap node, and will simply refuse to start the galera cluster. ### synchronizing nodes can be a long operation Starting a bootstrap node is relatively fast, so it's performed during the "promote" operation, which is a one-off, time-bounded operation. Subsequent nodes will need to synchronize via SST, which consists in "pushing" an entire Galera DB from one node to another. There is no perfect time-out, as time spent during synchronization depends on the size of the DB. Thus, joiner nodes are started during the "monitor" operation, which is a recurring operation that can better track the progress of the SST. State flow ==== General idea for starting Galera: * Before starting the Galera cluster each node needs to go in Slave state so that the agent records its last seqno into the CIB. __ This uses attribute last-committed __ * When all node went in Slave, the agent can safely determine the last seqno and elect a bootstrap node (`detect_first_master()`). __ This uses attribute bootstrap __ * The agent then sets the score of the elected bootstrap node to Master so that pacemaker promote it and start the first Galera server. * Once the first Master is running, the agent can start joiner nodes during the "monitor" operation, and starts monitoring their SST sync. __ This uses attribute sync-needed __ * Only when SST is over on joiner nodes, the agent promotes them to Master. At this point, the entire Galera cluster is up. Attribute usage and liveness ==== Here is how attributes are created on a per-node basis. If you modify the resource agent make sure those properties still hold. ### last-committed It is just a temporary hint for the resource agent to help elect a bootstrap node. Once the bootstrap attribute is set on one of the nodes, we can get rid of last-committed. - Used : during Slave state to compare seqno - Created: before entering Slave state: . at startup in `galera_start()` . or when a Galera node is stopped in `galera_demote()` - Deleted: just before node starts in `galera_start_local_node()`; cleaned-up during `galera_demote()` and `galera_stop()` We delete last-committed before starting Galera, to avoid race conditions that could arise due to discrepancies between the CIB and Galera. ### bootstrap Attribute set on the node that is elected to bootstrap Galera. - Used : during promotion in `galera_start_local_node()` - Created: at startup once all nodes have `last-committed`; or during monitor if all nodes have failed - Deleted: in `galera_start_local_node()`, just after the bootstrap node started and is ready; cleaned-up during `galera_demote()` and `galera_stop()` There cannot be more than one bootstrap node at any time, otherwise the Galera cluster would stop replicating properly. ### sync-needed While this attribute is set on a node, the Galera node is in JOIN state, i.e. SST is in progress and the node cannot serve queries. The resource agent relies on the underlying SST method to monitor the progress of the SST. For instance, with `wsrep_sst_rsync`, timeout would be reported by rsync, the Galera node would go in Non-primary state, which would make `galera_monitor()` fail. - Used : during recurring slave monitor in `check_sync_status()` - Created: in `galera_start_local_node()`, just after the joiner node started and entered the Galera cluster - Deleted: during recurring slave monitor in `check_sync_status()` as soon as the Galera code reports to be SYNC-ed. ### no-grastate If a galera node was unexpectedly killed in a middle of a replication, InnoDB can retain the equivalent of a XA transaction in prepared state in its redo log. If so, mysqld cannot recover state (nor last seqno) automatically, and special recovery heuristic has to be used to unblock the node. This transient attribute is used to keep track of forced recoveries to prevent bootstrapping a cluster from a recovered node when possible. - Used : during `detect_first_master()` to elect the bootstrap node - Created: in `detect_last_commit()` if the node has a pending XA transaction to recover in the redo log - Deleted: when a node is promoted to Master.