diff options
Diffstat (limited to 'doc/sphinx/Pacemaker_Development/components.rst')
-rw-r--r-- | doc/sphinx/Pacemaker_Development/components.rst | 489 |
1 files changed, 489 insertions, 0 deletions
diff --git a/doc/sphinx/Pacemaker_Development/components.rst b/doc/sphinx/Pacemaker_Development/components.rst new file mode 100644 index 0000000..e14df26 --- /dev/null +++ b/doc/sphinx/Pacemaker_Development/components.rst @@ -0,0 +1,489 @@ +Coding Particular Pacemaker Components +-------------------------------------- + +The Pacemaker code can be intricate and difficult to follow. This chapter has +some high-level descriptions of how individual components work. + + +.. index:: + single: controller + single: pacemaker-controld + +Controller +########## + +``pacemaker-controld`` is the Pacemaker daemon that utilizes the other daemons +to orchestrate actions that need to be taken in the cluster. It receives CIB +change notifications from the CIB manager, passes the new CIB to the scheduler +to determine whether anything needs to be done, uses the executor and fencer to +execute any actions required, and sets failure counts (among other things) via +the attribute manager. + +As might be expected, it has the most code of any of the daemons. + +.. index:: + single: join + +Join sequence +_____________ + +Most daemons track their cluster peers using Corosync's membership and CPG +only. The controller additionally requires peers to `join`, which ensures they +are ready to be assigned tasks. Joining proceeds through a series of phases +referred to as the `join sequence` or `join process`. + +A node's current join phase is tracked by the ``join`` member of ``crm_node_t`` +(used in the peer cache). It is an ``enum crm_join_phase`` that (ideally) +progresses from the DC's point of view as follows: + +* The node initially starts at ``crm_join_none`` + +* The DC sends the node a `join offer` (``CRM_OP_JOIN_OFFER``), and the node + proceeds to ``crm_join_welcomed``. This can happen in three ways: + + * The joining node will send a `join announce` (``CRM_OP_JOIN_ANNOUNCE``) at + its controller startup, and the DC will reply to that with a join offer. + * When the DC's peer status callback notices that the node has joined the + messaging layer, it registers ``I_NODE_JOIN`` (which leads to + ``A_DC_JOIN_OFFER_ONE`` -> ``do_dc_join_offer_one()`` -> + ``join_make_offer()``). + * After certain events (notably a new DC being elected), the DC will send all + nodes join offers (via A_DC_JOIN_OFFER_ALL -> ``do_dc_join_offer_all()``). + + These can overlap. The DC can send a join offer and the node can send a join + announce at nearly the same time, so the node responds to the original join + offer while the DC responds to the join announce with a new join offer. The + situation resolves itself after looping a bit. + +* The node responds to join offers with a `join request` + (``CRM_OP_JOIN_REQUEST``, via ``do_cl_join_offer_respond()`` and + ``join_query_callback()``). When the DC receives the request, the + node proceeds to ``crm_join_integrated`` (via ``do_dc_join_filter_offer()``). + +* As each node is integrated, the current best CIB is sync'ed to each + integrated node via ``do_dc_join_finalize()``. As each integrated node's CIB + sync succeeds, the DC acks the node's join request (``CRM_OP_JOIN_ACKNAK``) + and the node proceeds to ``crm_join_finalized`` (via + ``finalize_sync_callback()`` + ``finalize_join_for()``). + +* Each node confirms the finalization ack (``CRM_OP_JOIN_CONFIRM`` via + ``do_cl_join_finalize_respond()``), including its current resource operation + history (via ``controld_query_executor_state()``). Once the DC receives this + confirmation, the node proceeds to ``crm_join_confirmed`` via + ``do_dc_join_ack()``. + +Once all nodes are confirmed, the DC calls ``do_dc_join_final()``, which checks +for quorum and responds appropriately. + +When peers are lost, their join phase is reset to none (in various places). + +``crm_update_peer_join()`` updates a node's join phase. + +The DC increments the global ``current_join_id`` for each joining round, and +rejects any (older) replies that don't match. + + +.. index:: + single: fencer + single: pacemaker-fenced + +Fencer +###### + +``pacemaker-fenced`` is the Pacemaker daemon that handles fencing requests. In +the broadest terms, fencing works like this: + +#. The initiator (an external program such as ``stonith_admin``, or the cluster + itself via the controller) asks the local fencer, "Hey, could you please + fence this node?" +#. The local fencer asks all the fencers in the cluster (including itself), + "Hey, what fencing devices do you have access to that can fence this node?" +#. Each fencer in the cluster replies with a list of available devices that + it knows about. +#. Once the original fencer gets all the replies, it asks the most + appropriate fencer peer to actually carry out the fencing. It may send + out more than one such request if the target node must be fenced with + multiple devices. +#. The chosen fencer(s) call the appropriate fencing resource agent(s) to + do the fencing, then reply to the original fencer with the result. +#. The original fencer broadcasts the result to all fencers. +#. Each fencer sends the result to each of its local clients (including, at + some point, the initiator). + +A more detailed description follows. + +.. index:: + single: libstonithd + +Initiating a fencing request +____________________________ + +A fencing request can be initiated by the cluster or externally, using the +libstonithd API. + +* The cluster always initiates fencing via + ``daemons/controld/controld_fencing.c:te_fence_node()`` (which calls the + ``fence()`` API method). This occurs when a transition graph synapse contains + a ``CRM_OP_FENCE`` XML operation. +* The main external clients are ``stonith_admin`` and ``cts-fence-helper``. + The ``DLM`` project also uses Pacemaker for fencing. + +Highlights of the fencing API: + +* ``stonith_api_new()`` creates and returns a new ``stonith_t`` object, whose + ``cmds`` member has methods for connect, disconnect, fence, etc. +* the ``fence()`` method creates and sends a ``STONITH_OP_FENCE XML`` request with + the desired action and target node. Callers do not have to choose or even + have any knowledge about particular fencing devices. + +Fencing queries +_______________ + +The function calls for a fencing request go something like this: + +The local fencer receives the client's request via an IPC or messaging +layer callback, which calls + +* ``stonith_command()``, which (for requests) calls + + * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a client) calls + + * ``initiate_remote_stonith_op()``, which creates a ``STONITH_OP_QUERY`` XML + request with the target, desired action, timeout, etc. then broadcasts + the operation to the cluster group (i.e. all fencer instances) and + starts a timer. The query is broadcast because (1) location constraints + might prevent the local node from accessing the stonith device directly, + and (2) even if the local node does have direct access, another node + might be preferred to carry out the fencing. + +Each fencer receives the original fencer's ``STONITH_OP_QUERY`` broadcast +request via IPC or messaging layer callback, which calls: + +* ``stonith_command()``, which (for requests) calls + + * ``handle_request()``, which (for ``STONITH_OP_QUERY`` from a peer) calls + + * ``stonith_query()``, which calls + + * ``get_capable_devices()`` with ``stonith_query_capable_device_cb()`` to add + device information to an XML reply and send it. (A message is + considered a reply if it contains ``T_STONITH_REPLY``, which is only + set by fencer peers, not clients.) + +The original fencer receives all peers' ``STONITH_OP_QUERY`` replies via IPC +or messaging layer callback, which calls: + +* ``stonith_command()``, which (for replies) calls + + * ``handle_reply()`` which (for ``STONITH_OP_QUERY``) calls + + * ``process_remote_stonith_query()``, which allocates a new query result + structure, parses device information into it, and adds it to the + operation object. It increments the number of replies received for this + operation, and compares it against the expected number of replies (i.e. + the number of active peers), and if this is the last expected reply, + calls + + * ``request_peer_fencing()``, which calculates the timeout and sends + ``STONITH_OP_FENCE`` request(s) to carry out the fencing. If the target + node has a fencing "topology" (which allows specifications such as + "this node can be fenced either with device A, or devices B and C in + combination"), it will choose the device(s), and send out as many + requests as needed. If it chooses a device, it will choose the peer; a + peer is preferred if it has "verified" access to the desired device, + meaning that it has the device "running" on it and thus has a monitor + operation ensuring reachability. + +Fencing operations +__________________ + +Each ``STONITH_OP_FENCE`` request goes something like this: + +The chosen peer fencer receives the ``STONITH_OP_FENCE`` request via IPC or +messaging layer callback, which calls: + +* ``stonith_command()``, which (for requests) calls + + * ``handle_request()``, which (for ``STONITH_OP_FENCE`` from a peer) calls + + * ``stonith_fence()``, which calls + + * ``schedule_stonith_command()`` (using supplied device if + ``F_STONITH_DEVICE`` was set, otherwise the highest-priority capable + device obtained via ``get_capable_devices()`` with + ``stonith_fence_get_devices_cb()``), which adds the operation to the + device's pending operations list and triggers processing. + +The chosen peer fencer's mainloop is triggered and calls + +* ``stonith_device_dispatch()``, which calls + + * ``stonith_device_execute()``, which pops off the next item from the device's + pending operations list. If acting as the (internally implemented) watchdog + agent, it panics the node, otherwise it calls + + * ``stonith_action_create()`` and ``stonith_action_execute_async()`` to + call the fencing agent. + +The chosen peer fencer's mainloop is triggered again once the fencing agent +returns, and calls + +* ``stonith_action_async_done()`` which adds the results to an action object + then calls its + + * done callback (``st_child_done()``), which calls ``schedule_stonith_command()`` + for a new device if there are further required actions to execute or if the + original action failed, then builds and sends an XML reply to the original + fencer (via ``send_async_reply()``), then checks whether any + pending actions are the same as the one just executed and merges them if so. + +Fencing replies +_______________ + +The original fencer receives the ``STONITH_OP_FENCE`` reply via IPC or +messaging layer callback, which calls: + +* ``stonith_command()``, which (for replies) calls + + * ``handle_reply()``, which calls + + * ``fenced_process_fencing_reply()``, which calls either + ``request_peer_fencing()`` (to retry a failed operation, or try the next + device in a topology if appropriate, which issues a new + ``STONITH_OP_FENCE`` request, proceeding as before) or + ``finalize_op()`` (if the operation is definitively failed or + successful). + + * ``finalize_op()`` broadcasts the result to all peers. + +Finally, all peers receive the broadcast result and call + +* ``finalize_op()``, which sends the result to all local clients. + + +.. index:: + single: fence history + +Fencing History +_______________ + +The fencer keeps a running history of all fencing operations. The bulk of the +relevant code is in `fenced_history.c` and ensures the history is synchronized +across all nodes even if a node leaves and rejoins the cluster. + +In libstonithd, this information is represented by `stonith_history_t` and is +queryable by the `stonith_api_operations_t:history()` method. `crm_mon` and +`stonith_admin` use this API to display the history. + + +.. index:: + single: scheduler + single: pacemaker-schedulerd + single: libpe_status + single: libpe_rules + single: libpacemaker + +Scheduler +######### + +``pacemaker-schedulerd`` is the Pacemaker daemon that runs the Pacemaker +scheduler for the controller, but "the scheduler" in general refers to related +library code in ``libpe_status`` and ``libpe_rules`` (``lib/pengine/*.c``), and +some of ``libpacemaker`` (``lib/pacemaker/pcmk_sched_*.c``). + +The purpose of the scheduler is to take a CIB as input and generate a +transition graph (list of actions that need to be taken) as output. + +The controller invokes the scheduler by contacting the scheduler daemon via +local IPC. Tools such as ``crm_simulate``, ``crm_mon``, and ``crm_resource`` +can also invoke the scheduler, but do so by calling the library functions +directly. This allows them to run using a ``CIB_file`` without the cluster +needing to be active. + +The main entry point for the scheduler code is +``lib/pacemaker/pcmk_sched_allocate.c:pcmk__schedule_actions()``. It sets +defaults and calls a series of functions for the scheduling. Some key steps: + +* ``unpack_cib()`` parses most of the CIB XML into data structures, and + determines the current cluster status. +* ``apply_node_criteria()`` applies factors that make resources prefer certain + nodes, such as shutdown locks, location constraints, and stickiness. +* ``pcmk__create_internal_constraints()`` creates internal constraints, such as + the implicit ordering for group members, or start actions being implicitly + ordered before promote actions. +* ``pcmk__handle_rsc_config_changes()`` processes resource history entries in + the CIB status section. This is used to decide whether certain + actions need to be done, such as deleting orphan resources, forcing a restart + when a resource definition changes, etc. +* ``allocate_resources()`` assigns resources to nodes. +* ``schedule_resource_actions()`` schedules resource-specific actions (which + might or might not end up in the final graph). +* ``pcmk__apply_orderings()`` processes ordering constraints in order to modify + action attributes such as optional or required. +* ``pcmk__create_graph()`` creates the transition graph. + +Challenges +__________ + +Working with the scheduler is difficult. Challenges include: + +* It is far too much code to keep more than a small portion in your head at one + time. +* Small changes can have large (and unexpected) effects. This is why we have a + large number of regression tests (``cts/cts-scheduler``), which should be run + after making code changes. +* It produces an insane amount of log messages at debug and trace levels. + You can put resource ID(s) in the ``PCMK_trace_tags`` environment variable to + enable trace-level messages only when related to specific resources. +* Different parts of the main ``pe_working_set_t`` structure are finalized at + different points in the scheduling process, so you have to keep in mind + whether information you're using at one point of the code can possibly change + later. For example, data unpacked from the CIB can safely be used anytime + after ``unpack_cib(),`` but actions may become optional or required anytime + before ``pcmk__create_graph()``. There's no easy way to deal with this. +* Many names of struct members, functions, etc., are suboptimal, but are part + of the public API and cannot be changed until an API backward compatibility + break. + + +.. index:: + single: pe_working_set_t + +Cluster Working Set +___________________ + +The main data object for the scheduler is ``pe_working_set_t``, which contains +all information needed about nodes, resources, constraints, etc., both as the +raw CIB XML and parsed into more usable data structures, plus the resulting +transition graph XML. The variable name is usually ``data_set``. + +.. index:: + single: pe_resource_t + +Resources +_________ + +``pe_resource_t`` is the data object representing cluster resources. A resource +has a variant: primitive (a.k.a. native), group, clone, or bundle. + +The resource object has members for two sets of methods, +``resource_object_functions_t`` from the ``libpe_status`` public API, and +``resource_alloc_functions_t`` whose implementation is internal to +``libpacemaker``. The actual functions vary by variant. + +The object functions have basic capabilities such as unpacking the resource +XML, and determining the current or planned location of the resource. + +The allocation functions have more obscure capabilities needed for scheduling, +such as processing location and ordering constraints. For example, +``pcmk__create_internal_constraints()`` simply calls the +``internal_constraints()`` method for each top-level resource in the cluster. + +.. index:: + single: pe_node_t + +Nodes +_____ + +Allocation of resources to nodes is done by choosing the node with the highest +score for a given resource. The scheduler does a bunch of processing to +generate the scores, then the actual allocation is straightforward. + +Node lists are frequently used. For example, ``pe_working_set_t`` has a +``nodes`` member which is a list of all nodes in the cluster, and +``pe_resource_t`` has a ``running_on`` member which is a list of all nodes on +which the resource is (or might be) active. These are lists of ``pe_node_t`` +objects. + +The ``pe_node_t`` object contains a ``struct pe_node_shared_s *details`` member +with all node information that is independent of resource allocation (the node +name, etc.). + +The working set's ``nodes`` member contains the original of this information. +All other node lists contain copies of ``pe_node_t`` where only the ``details`` +member points to the originals in the working set's ``nodes`` list. In this +way, the other members of ``pe_node_t`` (such as ``weight``, which is the node +score) may vary by node list, while the common details are shared. + +.. index:: + single: pe_action_t + single: pe_action_flags + +Actions +_______ + +``pe_action_t`` is the data object representing actions that might need to be +taken. These could be resource actions, cluster-wide actions such as fencing a +node, or "pseudo-actions" which are abstractions used as convenient points for +ordering other actions against. + +It has a ``flags`` member which is a bitmask of ``enum pe_action_flags``. The +most important of these are ``pe_action_runnable`` (if not set, the action is +"blocked" and cannot be added to the transition graph) and +``pe_action_optional`` (actions with this set will not be added to the +transition graph; actions often start out as optional, and may become required +later). + + +.. index:: + single: pe__colocation_t + +Colocations +___________ + +``pcmk__colocation_t`` is the data object representing colocations. + +Colocation constraints come into play in these parts of the scheduler code: + +* When sorting resources for assignment, so resources with highest node score + are assigned first (see ``cmp_resources()``) +* When updating node scores for resource assigment or promotion priority +* When assigning resources, so any resources to be colocated with can be + assigned first, and so colocations affect where the resource is assigned +* When choosing roles for promotable clone instances, so colocations involving + a specific role can affect which instances are promoted + +The resource allocation functions have several methods related to colocations: + +* ``apply_coloc_score():`` This applies a colocation's score to either the + dependent's allowed node scores (if called while resources are being + assigned) or the dependent's priority (if called while choosing promotable + instance roles). It can behave differently depending on whether it is being + called as the primary's method or as the dependent's method. +* ``add_colocated_node_scores():`` This updates a table of nodes for a given + colocation attribute and score. It goes through colocations involving a given + resource, and updates the scores of the nodes in the table with the best + scores of nodes that match up according to the colocation criteria. +* ``colocated_resources():`` This generates a list of all resources involved + in mandatory colocations (directly or indirectly via colocation chains) with + a given resource. + + +.. index:: + single: pe__ordering_t + single: pe_ordering + +Orderings +_________ + +Ordering constraints are simple in concept, but they are one of the most +important, powerful, and difficult to follow aspects of the scheduler code. + +``pe__ordering_t`` is the data object representing an ordering, better thought +of as a relationship between two actions, since the relation can be more +complex than just "this one runs after that one". + +For an ordering "A then B", the code generally refers to A as "first" or +"before", and B as "then" or "after". + +Much of the power comes from ``enum pe_ordering``, which are flags that +determine how an ordering behaves. There are many obscure flags with big +effects. A few examples: + +* ``pe_order_none`` means the ordering is disabled and will be ignored. It's 0, + meaning no flags set, so it must be compared with equality rather than + ``pcmk_is_set()``. +* ``pe_order_optional`` means the ordering does not make either action + required, so it only applies if they both become required for other reasons. +* ``pe_order_implies_first`` means that if action B becomes required for any + reason, then action A will become required as well. |