diff options
Diffstat (limited to 'doc/sphinx/Clusters_from_Scratch/verification.rst')
-rw-r--r-- | doc/sphinx/Clusters_from_Scratch/verification.rst | 222 |
1 files changed, 222 insertions, 0 deletions
diff --git a/doc/sphinx/Clusters_from_Scratch/verification.rst b/doc/sphinx/Clusters_from_Scratch/verification.rst new file mode 100644 index 0000000..08fab31 --- /dev/null +++ b/doc/sphinx/Clusters_from_Scratch/verification.rst @@ -0,0 +1,222 @@ +Start and Verify Cluster +------------------------ + +Start the Cluster +################# + +Now that Corosync is configured, it is time to start the cluster. +The command below will start the ``corosync`` and ``pacemaker`` services on +both nodes in the cluster. + +.. code-block:: console + + [root@pcmk-1 ~]# pcs cluster start --all + pcmk-1: Starting Cluster... + pcmk-2: Starting Cluster... + +.. NOTE:: + + An alternative to using the ``pcs cluster start --all`` command + is to issue either of the below command sequences on each node in the + cluster separately: + + .. code-block:: console + + # pcs cluster start + Starting Cluster... + + or + + .. code-block:: console + + # systemctl start corosync.service + # systemctl start pacemaker.service + +.. IMPORTANT:: + + In this example, we are not enabling the ``corosync`` and ``pacemaker`` + services to start at boot. If a cluster node fails or is rebooted, you will + need to run ``pcs cluster start [<NODENAME> | --all]`` to start the cluster + on it. While you can enable the services to start at boot (for example, + using ``pcs cluster enable [<NODENAME> | --all]``), requiring a manual + start of cluster services gives you the opportunity to do a post-mortem + investigation of a node failure before returning it to the cluster. + +Verify Corosync Installation +################################ + +First, use ``corosync-cfgtool`` to check whether cluster communication is happy: + +.. code-block:: console + + [root@pcmk-1 ~]# corosync-cfgtool -s + Local node ID 1, transport knet + LINK ID 0 udp + addr = 192.168.122.101 + status: + nodeid: 1: localhost + nodeid: 2: connected + +We can see here that everything appears normal with our fixed IP address (not a +``127.0.0.x`` loopback address) listed as the ``addr``, and ``localhost`` and +``connected`` for the statuses of nodeid 1 and nodeid 2, respectively. + +If you see something different, you might want to start by checking +the node's network, firewall, and SELinux configurations. + +Next, check the membership and quorum APIs: + +.. code-block:: console + + [root@pcmk-1 ~]# corosync-cmapctl | grep members + runtime.members.1.config_version (u64) = 0 + runtime.members.1.ip (str) = r(0) ip(192.168.122.101) + runtime.members.1.join_count (u32) = 1 + runtime.members.1.status (str) = joined + runtime.members.2.config_version (u64) = 0 + runtime.members.2.ip (str) = r(0) ip(192.168.122.102) + runtime.members.2.join_count (u32) = 1 + runtime.members.2.status (str) = joined + + [root@pcmk-1 ~]# pcs status corosync + + Membership information + ---------------------- + Nodeid Votes Name + 1 1 pcmk-1 (local) + 2 1 pcmk-2 + +You should see both nodes have joined the cluster. + +Verify Pacemaker Installation +################################# + +Now that we have confirmed that Corosync is functional, we can check +the rest of the stack. Pacemaker has already been started, so verify +the necessary processes are running: + +.. code-block:: console + + [root@pcmk-1 ~]# ps axf + PID TTY STAT TIME COMMAND + 2 ? S 0:00 [kthreadd] + ...lots of processes... + 17121 ? SLsl 0:01 /usr/sbin/corosync -f + 17133 ? Ss 0:00 /usr/sbin/pacemakerd + 17134 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-based + 17135 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-fenced + 17136 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-execd + 17137 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-attrd + 17138 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-schedulerd + 17139 ? Ss 0:00 \_ /usr/libexec/pacemaker/pacemaker-controld + +If that looks OK, check the ``pcs status`` output: + +.. code-block:: console + + [root@pcmk-1 ~]# pcs status + Cluster name: mycluster + + WARNINGS: + No stonith devices and stonith-enabled is not false + + Cluster Summary: + * Stack: corosync + * Current DC: pcmk-2 (version 2.1.2-4.el9-ada5c3b36e2) - partition with quorum + * Last updated: Wed Jul 27 00:09:55 2022 + * Last change: Wed Jul 27 00:07:08 2022 by hacluster via crmd on pcmk-2 + * 2 nodes configured + * 0 resource instances configured + + Node List: + * Online: [ pcmk-1 pcmk-2 ] + + Full List of Resources: + * No resources + + Daemon Status: + corosync: active/disabled + pacemaker: active/disabled + pcsd: active/enabled + +Finally, ensure there are no start-up errors from ``corosync`` or ``pacemaker`` +(aside from messages relating to not having STONITH configured, which are OK at +this point): + +.. code-block:: console + + [root@pcmk-1 ~]# journalctl -b | grep -i error + +.. NOTE:: + + Other operating systems may report startup errors in other locations + (for example, ``/var/log/messages``). + +Repeat these checks on the other node. The results should be the same. + +Explore the Existing Configuration +################################## + +For those who are not of afraid of XML, you can see the raw cluster +configuration and status by using the ``pcs cluster cib`` command. + +.. topic:: The last XML you'll see in this document + + .. code-block:: console + + [root@pcmk-1 ~]# pcs cluster cib + + .. code-block:: xml + + <cib crm_feature_set="3.13.0" validate-with="pacemaker-3.8" epoch="5" num_updates="4" admin_epoch="0" cib-last-written="Wed Jul 27 00:07:08 2022" update-origin="pcmk-2" update-client="crmd" update-user="hacluster" have-quorum="1" dc-uuid="2"> + <configuration> + <crm_config> + <cluster_property_set id="cib-bootstrap-options"> + <nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/> + <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="2.1.2-4.el9-ada5c3b36e2"/> + <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/> + <nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="mycluster"/> + </cluster_property_set> + </crm_config> + <nodes> + <node id="1" uname="pcmk-1"/> + <node id="2" uname="pcmk-2"/> + </nodes> + <resources/> + <constraints/> + <rsc_defaults> + <meta_attributes id="build-resource-defaults"> + <nvpair id="build-resource-stickiness" name="resource-stickiness" value="1"/> + </meta_attributes> + </rsc_defaults> + </configuration> + <status> + <node_state id="2" uname="pcmk-2" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member"> + <lrm id="2"> + <lrm_resources/> + </lrm> + </node_state> + <node_state id="1" uname="pcmk-1" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member"> + <lrm id="1"> + <lrm_resources/> + </lrm> + </node_state> + </status> + </cib> + +Before we make any changes, it's a good idea to check the validity of +the configuration. + +.. code-block:: console + + [root@pcmk-1 ~]# pcs cluster verify --full + Error: invalid cib: + (unpack_resources) error: Resource start-up disabled since no STONITH resources have been defined + (unpack_resources) error: Either configure some or disable STONITH with the stonith-enabled option + (unpack_resources) error: NOTE: Clusters with shared data need STONITH to ensure data integrity + crm_verify: Errors found during check: config not valid + + Error: Errors have occurred, therefore pcs is unable to continue + +As you can see, the tool has found some errors. The cluster will not start any +resources until we configure STONITH. |