summaryrefslogtreecommitdiffstats
path: root/doc/sphinx/Pacemaker_Development/components.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/sphinx/Pacemaker_Development/components.rst')
-rw-r--r--doc/sphinx/Pacemaker_Development/components.rst489
1 files changed, 489 insertions, 0 deletions
diff --git a/doc/sphinx/Pacemaker_Development/components.rst b/doc/sphinx/Pacemaker_Development/components.rst
new file mode 100644
index 0000000..e14df26
--- /dev/null
+++ b/doc/sphinx/Pacemaker_Development/components.rst
@@ -0,0 +1,489 @@
+Coding Particular Pacemaker Components
+--------------------------------------
+
+The Pacemaker code can be intricate and difficult to follow. This chapter has
+some high-level descriptions of how individual components work.
+
+
+.. index::
+ single: controller
+ single: pacemaker-controld
+
+Controller
+##########
+
+``pacemaker-controld`` is the Pacemaker daemon that utilizes the other daemons
+to orchestrate actions that need to be taken in the cluster. It receives CIB
+change notifications from the CIB manager, passes the new CIB to the scheduler
+to determine whether anything needs to be done, uses the executor and fencer to
+execute any actions required, and sets failure counts (among other things) via
+the attribute manager.
+
+As might be expected, it has the most code of any of the daemons.
+
+.. index::
+ single: join
+
+Join sequence
+_____________
+
+Most daemons track their cluster peers using Corosync's membership and CPG
+only. The controller additionally requires peers to `join`, which ensures they
+are ready to be assigned tasks. Joining proceeds through a series of phases
+referred to as the `join sequence` or `join process`.
+
+A node's current join phase is tracked by the ``join`` member of ``crm_node_t``
+(used in the peer cache). It is an ``enum crm_join_phase`` that (ideally)
+progresses from the DC's point of view as follows:
+
+* The node initially starts at ``crm_join_none``
+
+* The DC sends the node a `join offer` (``CRM_OP_JOIN_OFFER``), and the node
+ proceeds to ``crm_join_welcomed``. This can happen in three ways:
+
+ * The joining node will send a `join announce` (``CRM_OP_JOIN_ANNOUNCE``) at
+ its controller startup, and the DC will reply to that with a join offer.
+ * When the DC's peer status callback notices that the node has joined the
+ messaging layer, it registers ``I_NODE_JOIN`` (which leads to
+ ``A_DC_JOIN_OFFER_ONE`` -> ``do_dc_join_offer_one()`` ->
+ ``join_make_offer()``).
+ * After certain events (notably a new DC being elected), the DC will send all
+ nodes join offers (via A_DC_JOIN_OFFER_ALL -> ``do_dc_join_offer_all()``).
+
+ These can overlap. The DC can send a join offer and the node can send a join
+ announce at nearly the same time, so the node responds to the original join
+ offer while the DC responds to the join announce with a new join offer. The
+ situation resolves itself after looping a bit.
+
+* The node responds to join offers with a `join request`
+ (``CRM_OP_JOIN_REQUEST``, via ``do_cl_join_offer_respond()`` and
+ ``join_query_callback()``). When the DC receives the request, the
+ node proceeds to ``crm_join_integrated`` (via ``do_dc_join_filter_offer()``).
+
+* As each node is integrated, the current best CIB is sync'ed to each
+ integrated node via ``do_dc_join_finalize()``. As each integrated node's CIB
+ sync succeeds, the DC acks the node's join request (``CRM_OP_JOIN_ACKNAK``)
+ and the node proceeds to ``crm_join_finalized`` (via
+ ``finalize_sync_callback()`` + ``finalize_join_for()``).
+
+* Each node confirms the finalization ack (``CRM_OP_JOIN_CONFIRM`` via
+ ``do_cl_join_finalize_respond()``), including its current resource operation
+ history (via ``controld_query_executor_state()``). Once the DC receives this
+ confirmation, the node proceeds to ``crm_join_confirmed`` via
+ ``do_dc_join_ack()``.
+
+Once all nodes are confirmed, the DC calls ``do_dc_join_final()``, which checks
+for quorum and responds appropriately.
+
+When peers are lost, their join phase is reset to none (in various places).
+
+``crm_update_peer_join()`` updates a node's join phase.
+
+The DC increments the global ``current_join_id`` for each joining round, and
+rejects any (older) replies that don't match.
+
+
+.. index::
+ single: fencer
+ single: pacemaker-fenced
+
+Fencer
+######
+
+``pacemaker-fenced`` is the Pacemaker daemon that handles fencing requests. In
+the broadest terms, fencing works like this:
+
+#. The initiator (an external program such as ``stonith_admin``, or the cluster
+ itself via the controller) asks the local fencer, "Hey, could you please
+ fence this node?"
+#. The local fencer asks all the fencers in the cluster (including itself),
+ "Hey, what fencing devices do you have access to that can fence this node?"
+#. Each fencer in the cluster replies with a list of available devices that
+ it knows about.
+#. Once the original fencer gets all the replies, it asks the most
+ appropriate fencer peer to actually carry out the fencing. It may send
+ out more than one such request if the target node must be fenced with
+ multiple devices.
+#. The chosen fencer(s) call the appropriate fencing resource agent(s) to
+ do the fencing, then reply to the original fencer with the result.
+#. The original fencer broadcasts the result to all fencers.
+#. Each fencer sends the result to each of its local clients (including, at
+ some point, the initiator).
+
+A more detailed description follows.
+
+.. index::
+ single: libstonithd
+
+Initiating a fencing request
+____________________________
+
+A fencing request can be initiated by the cluster or externally, using the
+libstonithd API.
+
+* The cluster always initiates fencing via
+ ``daemons/controld/controld_fencing.c:te_fence_node()`` (which calls the
+ ``fence()`` API method). This occurs when a transition graph synapse contains
+ a ``CRM_OP_FENCE`` XML operation.
+* The main external clients are ``stonith_admin`` and ``cts-fence-helper``.
+ The ``DLM`` project also uses Pacemaker for fencing.
+
+Highlights of the fencing API:
+
+* ``stonith_api_new()`` creates and returns a new ``stonith_t`` object, whose
+ ``cmds`` member has methods for connect, disconnect, fence, etc.
+* the ``fence()`` method creates and sends a ``STONITH_OP_FENCE XML`` request with
+ the desired action and target node. Callers do not have to choose or even
+ have any knowledge about particular fencing devices.
+
+Fencing queries
+_______________
+
+The function calls for a fencing request go something like this:
+
+The local fencer receives the client's request via an IPC or messaging
+layer callback, which calls
+
+* ``stonith_command()``, which (for requests) calls
+
+ * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a client) calls
+
+ * ``initiate_remote_stonith_op()``, which creates a ``STONITH_OP_QUERY`` XML
+ request with the target, desired action, timeout, etc. then broadcasts
+ the operation to the cluster group (i.e. all fencer instances) and
+ starts a timer. The query is broadcast because (1) location constraints
+ might prevent the local node from accessing the stonith device directly,
+ and (2) even if the local node does have direct access, another node
+ might be preferred to carry out the fencing.
+
+Each fencer receives the original fencer's ``STONITH_OP_QUERY`` broadcast
+request via IPC or messaging layer callback, which calls:
+
+* ``stonith_command()``, which (for requests) calls
+
+ * ``handle_request()``, which (for ``STONITH_OP_QUERY`` from a peer) calls
+
+ * ``stonith_query()``, which calls
+
+ * ``get_capable_devices()`` with ``stonith_query_capable_device_cb()`` to add
+ device information to an XML reply and send it. (A message is
+ considered a reply if it contains ``T_STONITH_REPLY``, which is only
+ set by fencer peers, not clients.)
+
+The original fencer receives all peers' ``STONITH_OP_QUERY`` replies via IPC
+or messaging layer callback, which calls:
+
+* ``stonith_command()``, which (for replies) calls
+
+ * ``handle_reply()`` which (for ``STONITH_OP_QUERY``) calls
+
+ * ``process_remote_stonith_query()``, which allocates a new query result
+ structure, parses device information into it, and adds it to the
+ operation object. It increments the number of replies received for this
+ operation, and compares it against the expected number of replies (i.e.
+ the number of active peers), and if this is the last expected reply,
+ calls
+
+ * ``request_peer_fencing()``, which calculates the timeout and sends
+ ``STONITH_OP_FENCE`` request(s) to carry out the fencing. If the target
+ node has a fencing "topology" (which allows specifications such as
+ "this node can be fenced either with device A, or devices B and C in
+ combination"), it will choose the device(s), and send out as many
+ requests as needed. If it chooses a device, it will choose the peer; a
+ peer is preferred if it has "verified" access to the desired device,
+ meaning that it has the device "running" on it and thus has a monitor
+ operation ensuring reachability.
+
+Fencing operations
+__________________
+
+Each ``STONITH_OP_FENCE`` request goes something like this:
+
+The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via IPC or
+messaging layer callback, which calls:
+
+* ``stonith_command()``, which (for requests) calls
+
+ * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a peer) calls
+
+ * ``stonith_fence()``, which calls
+
+ * ``schedule_stonith_command()`` (using supplied device if
+ ``F_STONITH_DEVICE`` was set, otherwise the highest-priority capable
+ device obtained via ``get_capable_devices()`` with
+ ``stonith_fence_get_devices_cb()``), which adds the operation to the
+ device's pending operations list and triggers processing.
+
+The chosen peer fencer's mainloop is triggered and calls
+
+* ``stonith_device_dispatch()``, which calls
+
+ * ``stonith_device_execute()``, which pops off the next item from the device's
+ pending operations list. If acting as the (internally implemented) watchdog
+ agent, it panics the node, otherwise it calls
+
+ * ``stonith_action_create()`` and ``stonith_action_execute_async()`` to
+ call the fencing agent.
+
+The chosen peer fencer's mainloop is triggered again once the fencing agent
+returns, and calls
+
+* ``stonith_action_async_done()`` which adds the results to an action object
+ then calls its
+
+ * done callback (``st_child_done()``), which calls ``schedule_stonith_command()``
+ for a new device if there are further required actions to execute or if the
+ original action failed, then builds and sends an XML reply to the original
+ fencer (via ``send_async_reply()``), then checks whether any
+ pending actions are the same as the one just executed and merges them if so.
+
+Fencing replies
+_______________
+
+The original fencer receives the ``STONITH_OP_FENCE`` reply via IPC or
+messaging layer callback, which calls:
+
+* ``stonith_command()``, which (for replies) calls
+
+ * ``handle_reply()``, which calls
+
+ * ``fenced_process_fencing_reply()``, which calls either
+ ``request_peer_fencing()`` (to retry a failed operation, or try the next
+ device in a topology if appropriate, which issues a new
+ ``STONITH_OP_FENCE`` request, proceeding as before) or
+ ``finalize_op()`` (if the operation is definitively failed or
+ successful).
+
+ * ``finalize_op()`` broadcasts the result to all peers.
+
+Finally, all peers receive the broadcast result and call
+
+* ``finalize_op()``, which sends the result to all local clients.
+
+
+.. index::
+ single: fence history
+
+Fencing History
+_______________
+
+The fencer keeps a running history of all fencing operations. The bulk of the
+relevant code is in `fenced_history.c` and ensures the history is synchronized
+across all nodes even if a node leaves and rejoins the cluster.
+
+In libstonithd, this information is represented by `stonith_history_t` and is
+queryable by the `stonith_api_operations_t:history()` method. `crm_mon` and
+`stonith_admin` use this API to display the history.
+
+
+.. index::
+ single: scheduler
+ single: pacemaker-schedulerd
+ single: libpe_status
+ single: libpe_rules
+ single: libpacemaker
+
+Scheduler
+#########
+
+``pacemaker-schedulerd`` is the Pacemaker daemon that runs the Pacemaker
+scheduler for the controller, but "the scheduler" in general refers to related
+library code in ``libpe_status`` and ``libpe_rules`` (``lib/pengine/*.c``), and
+some of ``libpacemaker`` (``lib/pacemaker/pcmk_sched_*.c``).
+
+The purpose of the scheduler is to take a CIB as input and generate a
+transition graph (list of actions that need to be taken) as output.
+
+The controller invokes the scheduler by contacting the scheduler daemon via
+local IPC. Tools such as ``crm_simulate``, ``crm_mon``, and ``crm_resource``
+can also invoke the scheduler, but do so by calling the library functions
+directly. This allows them to run using a ``CIB_file`` without the cluster
+needing to be active.
+
+The main entry point for the scheduler code is
+``lib/pacemaker/pcmk_sched_allocate.c:pcmk__schedule_actions()``. It sets
+defaults and calls a series of functions for the scheduling. Some key steps:
+
+* ``unpack_cib()`` parses most of the CIB XML into data structures, and
+ determines the current cluster status.
+* ``apply_node_criteria()`` applies factors that make resources prefer certain
+ nodes, such as shutdown locks, location constraints, and stickiness.
+* ``pcmk__create_internal_constraints()`` creates internal constraints, such as
+ the implicit ordering for group members, or start actions being implicitly
+ ordered before promote actions.
+* ``pcmk__handle_rsc_config_changes()`` processes resource history entries in
+ the CIB status section. This is used to decide whether certain
+ actions need to be done, such as deleting orphan resources, forcing a restart
+ when a resource definition changes, etc.
+* ``allocate_resources()`` assigns resources to nodes.
+* ``schedule_resource_actions()`` schedules resource-specific actions (which
+ might or might not end up in the final graph).
+* ``pcmk__apply_orderings()`` processes ordering constraints in order to modify
+ action attributes such as optional or required.
+* ``pcmk__create_graph()`` creates the transition graph.
+
+Challenges
+__________
+
+Working with the scheduler is difficult. Challenges include:
+
+* It is far too much code to keep more than a small portion in your head at one
+ time.
+* Small changes can have large (and unexpected) effects. This is why we have a
+ large number of regression tests (``cts/cts-scheduler``), which should be run
+ after making code changes.
+* It produces an insane amount of log messages at debug and trace levels.
+ You can put resource ID(s) in the ``PCMK_trace_tags`` environment variable to
+ enable trace-level messages only when related to specific resources.
+* Different parts of the main ``pe_working_set_t`` structure are finalized at
+ different points in the scheduling process, so you have to keep in mind
+ whether information you're using at one point of the code can possibly change
+ later. For example, data unpacked from the CIB can safely be used anytime
+ after ``unpack_cib(),`` but actions may become optional or required anytime
+ before ``pcmk__create_graph()``. There's no easy way to deal with this.
+* Many names of struct members, functions, etc., are suboptimal, but are part
+ of the public API and cannot be changed until an API backward compatibility
+ break.
+
+
+.. index::
+ single: pe_working_set_t
+
+Cluster Working Set
+___________________
+
+The main data object for the scheduler is ``pe_working_set_t``, which contains
+all information needed about nodes, resources, constraints, etc., both as the
+raw CIB XML and parsed into more usable data structures, plus the resulting
+transition graph XML. The variable name is usually ``data_set``.
+
+.. index::
+ single: pe_resource_t
+
+Resources
+_________
+
+``pe_resource_t`` is the data object representing cluster resources. A resource
+has a variant: primitive (a.k.a. native), group, clone, or bundle.
+
+The resource object has members for two sets of methods,
+``resource_object_functions_t`` from the ``libpe_status`` public API, and
+``resource_alloc_functions_t`` whose implementation is internal to
+``libpacemaker``. The actual functions vary by variant.
+
+The object functions have basic capabilities such as unpacking the resource
+XML, and determining the current or planned location of the resource.
+
+The allocation functions have more obscure capabilities needed for scheduling,
+such as processing location and ordering constraints. For example,
+``pcmk__create_internal_constraints()`` simply calls the
+``internal_constraints()`` method for each top-level resource in the cluster.
+
+.. index::
+ single: pe_node_t
+
+Nodes
+_____
+
+Allocation of resources to nodes is done by choosing the node with the highest
+score for a given resource. The scheduler does a bunch of processing to
+generate the scores, then the actual allocation is straightforward.
+
+Node lists are frequently used. For example, ``pe_working_set_t`` has a
+``nodes`` member which is a list of all nodes in the cluster, and
+``pe_resource_t`` has a ``running_on`` member which is a list of all nodes on
+which the resource is (or might be) active. These are lists of ``pe_node_t``
+objects.
+
+The ``pe_node_t`` object contains a ``struct pe_node_shared_s *details`` member
+with all node information that is independent of resource allocation (the node
+name, etc.).
+
+The working set's ``nodes`` member contains the original of this information.
+All other node lists contain copies of ``pe_node_t`` where only the ``details``
+member points to the originals in the working set's ``nodes`` list. In this
+way, the other members of ``pe_node_t`` (such as ``weight``, which is the node
+score) may vary by node list, while the common details are shared.
+
+.. index::
+ single: pe_action_t
+ single: pe_action_flags
+
+Actions
+_______
+
+``pe_action_t`` is the data object representing actions that might need to be
+taken. These could be resource actions, cluster-wide actions such as fencing a
+node, or "pseudo-actions" which are abstractions used as convenient points for
+ordering other actions against.
+
+It has a ``flags`` member which is a bitmask of ``enum pe_action_flags``. The
+most important of these are ``pe_action_runnable`` (if not set, the action is
+"blocked" and cannot be added to the transition graph) and
+``pe_action_optional`` (actions with this set will not be added to the
+transition graph; actions often start out as optional, and may become required
+later).
+
+
+.. index::
+ single: pe__colocation_t
+
+Colocations
+___________
+
+``pcmk__colocation_t`` is the data object representing colocations.
+
+Colocation constraints come into play in these parts of the scheduler code:
+
+* When sorting resources for assignment, so resources with highest node score
+ are assigned first (see ``cmp_resources()``)
+* When updating node scores for resource assigment or promotion priority
+* When assigning resources, so any resources to be colocated with can be
+ assigned first, and so colocations affect where the resource is assigned
+* When choosing roles for promotable clone instances, so colocations involving
+ a specific role can affect which instances are promoted
+
+The resource allocation functions have several methods related to colocations:
+
+* ``apply_coloc_score():`` This applies a colocation's score to either the
+ dependent's allowed node scores (if called while resources are being
+ assigned) or the dependent's priority (if called while choosing promotable
+ instance roles). It can behave differently depending on whether it is being
+ called as the primary's method or as the dependent's method.
+* ``add_colocated_node_scores():`` This updates a table of nodes for a given
+ colocation attribute and score. It goes through colocations involving a given
+ resource, and updates the scores of the nodes in the table with the best
+ scores of nodes that match up according to the colocation criteria.
+* ``colocated_resources():`` This generates a list of all resources involved
+ in mandatory colocations (directly or indirectly via colocation chains) with
+ a given resource.
+
+
+.. index::
+ single: pe__ordering_t
+ single: pe_ordering
+
+Orderings
+_________
+
+Ordering constraints are simple in concept, but they are one of the most
+important, powerful, and difficult to follow aspects of the scheduler code.
+
+``pe__ordering_t`` is the data object representing an ordering, better thought
+of as a relationship between two actions, since the relation can be more
+complex than just "this one runs after that one".
+
+For an ordering "A then B", the code generally refers to A as "first" or
+"before", and B as "then" or "after".
+
+Much of the power comes from ``enum pe_ordering``, which are flags that
+determine how an ordering behaves. There are many obscure flags with big
+effects. A few examples:
+
+* ``pe_order_none`` means the ordering is disabled and will be ignored. It's 0,
+ meaning no flags set, so it must be compared with equality rather than
+ ``pcmk_is_set()``.
+* ``pe_order_optional`` means the ordering does not make either action
+ required, so it only applies if they both become required for other reasons.
+* ``pe_order_implies_first`` means that if action B becomes required for any
+ reason, then action A will become required as well.