diff options
Diffstat (limited to 'doc/sphinx/Pacemaker_Administration/agents.rst')
-rw-r--r-- | doc/sphinx/Pacemaker_Administration/agents.rst | 443 |
1 files changed, 443 insertions, 0 deletions
diff --git a/doc/sphinx/Pacemaker_Administration/agents.rst b/doc/sphinx/Pacemaker_Administration/agents.rst new file mode 100644 index 0000000..e5b17e2 --- /dev/null +++ b/doc/sphinx/Pacemaker_Administration/agents.rst @@ -0,0 +1,443 @@ +.. index:: + single: resource agent + +Resource Agents +--------------- + + +Action Completion +################# + +If one resource depends on another resource via constraints, the cluster will +interpret an expected result as sufficient to continue with dependent actions. +This may cause timing issues if the resource agent start returns before the +service is not only launched but fully ready to perform its function, or if the +resource agent stop returns before the service has fully released all its +claims on system resources. At a minimum, the start or stop should not return +before a status command would return the expected (started or stopped) result. + + +.. index:: + single: OCF resource agent + single: resource agent; OCF + +OCF Resource Agents +################### + +.. index:: + single: OCF resource agent; location + +Location of Custom Scripts +__________________________ + +OCF Resource Agents are found in ``/usr/lib/ocf/resource.d/$PROVIDER`` + +When creating your own agents, you are encouraged to create a new directory +under ``/usr/lib/ocf/resource.d/`` so that they are not confused with (or +overwritten by) the agents shipped by existing providers. + +So, for example, if you choose the provider name of big-corp and want a new +resource named big-app, you would create a resource agent called +``/usr/lib/ocf/resource.d/big-corp/big-app`` and define a resource: + +.. code-block: xml + + <primitive id="custom-app" class="ocf" provider="big-corp" type="big-app"/> + + +.. index:: + single: OCF resource agent; action + +Actions +_______ + +All OCF resource agents are required to implement the following actions. + +.. table:: **Required Actions for OCF Agents** + + +--------------+-------------+------------------------------------------------+ + | Action | Description | Instructions | + +==============+=============+================================================+ + | start | Start the | .. index:: | + | | resource | single: OCF resource agent; start | + | | | single: start action | + | | | | + | | | Return 0 on success and an appropriate | + | | | error code otherwise. Must not report | + | | | success until the resource is fully | + | | | active. | + +--------------+-------------+------------------------------------------------+ + | stop | Stop the | .. index:: | + | | resource | single: OCF resource agent; stop | + | | | single: stop action | + | | | | + | | | Return 0 on success and an appropriate | + | | | error code otherwise. Must not report | + | | | success until the resource is fully | + | | | stopped. | + +--------------+-------------+------------------------------------------------+ + | monitor | Check the | .. index:: | + | | resource's | single: OCF resource agent; monitor | + | | state | single: monitor action | + | | | | + | | | Exit 0 if the resource is running, 7 | + | | | if it is stopped, and any other OCF | + | | | exit code if it is failed. NOTE: The | + | | | monitor script should test the state | + | | | of the resource on the local machine | + | | | only. | + +--------------+-------------+------------------------------------------------+ + | meta-data | Describe | .. index:: | + | | the | single: OCF resource agent; meta-data | + | | resource | single: meta-data action | + | | | | + | | | Provide information about this | + | | | resource in the XML format defined by | + | | | the OCF standard. Exit with 0. NOTE: | + | | | This is *not* required to be performed | + | | | as root. | + +--------------+-------------+------------------------------------------------+ + +OCF resource agents may optionally implement additional actions. Some are used +only with advanced resource types such as clones. + +.. table:: **Optional Actions for OCF Resource Agents** + + +--------------+-------------+------------------------------------------------+ + | Action | Description | Instructions | + +==============+=============+================================================+ + | validate-all | This should | .. index:: | + | | validate | single: OCF resource agent; validate-all | + | | the | single: validate-all action | + | | instance | | + | | parameters | Return 0 if parameters are valid, 2 if | + | | provided. | not valid, and 6 if resource is not | + | | | configured. | + +--------------+-------------+------------------------------------------------+ + | promote | Bring the | .. index:: | + | | local | single: OCF resource agent; promote | + | | instance of | single: promote action | + | | a promotable| | + | | clone | Return 0 on success | + | | resource to | | + | | the promoted| | + | | role. | | + +--------------+-------------+------------------------------------------------+ + | demote | Bring the | .. index:: | + | | local | single: OCF resource agent; demote | + | | instance of | single: demote action | + | | a promotable| | + | | clone | Return 0 on success | + | | resource to | | + | | the | | + | | unpromoted | | + | | role. | | + +--------------+-------------+------------------------------------------------+ + | notify | Used by the | .. index:: | + | | cluster to | single: OCF resource agent; notify | + | | send | single: notify action | + | | the agent | | + | | pre- and | Must not fail. Must exit with 0 | + | | post- | | + | | notification| | + | | events | | + | | telling the | | + | | resource | | + | | what has | | + | | happened and| | + | | will happen.| | + +--------------+-------------+------------------------------------------------+ + | reload | Reload the | .. index:: | + | | service's | single: OCF resource agent; reload | + | | own | single: reload action | + | | config. | | + | | | Not used by Pacemaker | + +--------------+-------------+------------------------------------------------+ + | reload-agent | Make | .. index:: | + | | effective | single: OCF resource agent; reload-agent | + | | any changes | single: reload-agent action | + | | in instance | | + | | parameters | This is used when the agent can handle a | + | | marked as | change in some of its parameters more | + | | reloadable | efficiently than stopping and starting the | + | | in the | resource. | + | | agent's | | + | | meta-data. | | + +--------------+-------------+------------------------------------------------+ + | recover | Restart the | .. index:: | + | | service. | single: OCF resource agent; recover | + | | | single: recover action | + | | | | + | | | Not used by Pacemaker | + +--------------+-------------+------------------------------------------------+ + +.. important:: + + If you create a new OCF resource agent, use `ocf-tester` to verify that the + agent complies with the OCF standard properly. + + +.. index:: + single: OCF resource agent; return code + +How are OCF Return Codes Interpreted? +_____________________________________ + +The first thing the cluster does is to check the return code against +the expected result. If the result does not match the expected value, +then the operation is considered to have failed, and recovery action is +initiated. + +There are three types of failure recovery: + +.. table:: **Types of recovery performed by the cluster** + + +-------+--------------------------------------------+--------------------------------------+ + | Type | Description | Action Taken by the Cluster | + +=======+============================================+======================================+ + | soft | .. index:: | Restart the resource or move it to a | + | | single: OCF resource agent; soft error | new location | + | | | | + | | A transient error occurred | | + +-------+--------------------------------------------+--------------------------------------+ + | hard | .. index:: | Move the resource elsewhere and | + | | single: OCF resource agent; hard error | prevent it from being retried on the | + | | | current node | + | | A non-transient error that | | + | | may be specific to the | | + | | current node | | + +-------+--------------------------------------------+--------------------------------------+ + | fatal | .. index:: | Stop the resource and prevent it | + | | single: OCF resource agent; fatal error | from being started on any cluster | + | | | node | + | | A non-transient error that | | + | | will be common to all | | + | | cluster nodes (e.g. a bad | | + | | configuration was specified) | | + +-------+--------------------------------------------+--------------------------------------+ + +.. _ocf_return_codes: + +OCF Return Codes +________________ + +The following table outlines the different OCF return codes and the type of +recovery the cluster will initiate when a failure code is received. Although +counterintuitive, even actions that return 0 (aka. ``OCF_SUCCESS``) can be +considered to have failed, if 0 was not the expected return value. + +.. table:: **OCF Exit Codes and their Recovery Types** + + +-------+-----------------------+---------------------------------------------------+----------+ + | Exit | OCF Alias | Description | Recovery | + | Code | | | | + +=======+=======================+===================================================+==========+ + | 0 | OCF_SUCCESS | .. index:: | soft | + | | | single: OCF_SUCCESS | | + | | | single: OCF return code; OCF_SUCCESS | | + | | | pair: OCF return code; 0 | | + | | | | | + | | | Success. The command completed successfully. | | + | | | This is the expected result for all start, | | + | | | stop, promote and demote commands. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 1 | OCF_ERR_GENERIC | .. index:: | soft | + | | | single: OCF_ERR_GENERIC | | + | | | single: OCF return code; OCF_ERR_GENERIC | | + | | | pair: OCF return code; 1 | | + | | | | | + | | | Generic "there was a problem" error code. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 2 | OCF_ERR_ARGS | .. index:: | hard | + | | | single: OCF_ERR_ARGS | | + | | | single: OCF return code; OCF_ERR_ARGS | | + | | | pair: OCF return code; 2 | | + | | | | | + | | | The resource's parameter values are not valid on | | + | | | this machine (for example, a value refers to a | | + | | | file not found on the local host). | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 3 | OCF_ERR_UNIMPLEMENTED | .. index:: | hard | + | | | single: OCF_ERR_UNIMPLEMENTED | | + | | | single: OCF return code; OCF_ERR_UNIMPLEMENTED | | + | | | pair: OCF return code; 3 | | + | | | | | + | | | The requested action is not implemented. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 4 | OCF_ERR_PERM | .. index:: | hard | + | | | single: OCF_ERR_PERM | | + | | | single: OCF return code; OCF_ERR_PERM | | + | | | pair: OCF return code; 4 | | + | | | | | + | | | The resource agent does not have | | + | | | sufficient privileges to complete the task. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 5 | OCF_ERR_INSTALLED | .. index:: | hard | + | | | single: OCF_ERR_INSTALLED | | + | | | single: OCF return code; OCF_ERR_INSTALLED | | + | | | pair: OCF return code; 5 | | + | | | | | + | | | The tools required by the resource are | | + | | | not installed on this machine. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 6 | OCF_ERR_CONFIGURED | .. index:: | fatal | + | | | single: OCF_ERR_CONFIGURED | | + | | | single: OCF return code; OCF_ERR_CONFIGURED | | + | | | pair: OCF return code; 6 | | + | | | | | + | | | The resource's parameter values are inherently | | + | | | invalid (for example, a required parameter was | | + | | | not given). | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 7 | OCF_NOT_RUNNING | .. index:: | N/A | + | | | single: OCF_NOT_RUNNING | | + | | | single: OCF return code; OCF_NOT_RUNNING | | + | | | pair: OCF return code; 7 | | + | | | | | + | | | The resource is safely stopped. This should only | | + | | | be returned by monitor actions, not stop actions. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 8 | OCF_RUNNING_PROMOTED | .. index:: | soft | + | | | single: OCF_RUNNING_PROMOTED | | + | | | single: OCF return code; OCF_RUNNING_PROMOTED | | + | | | pair: OCF return code; 8 | | + | | | | | + | | | The resource is running in the promoted role. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 9 | OCF_FAILED_PROMOTED | .. index:: | soft | + | | | single: OCF_FAILED_PROMOTED | | + | | | single: OCF return code; OCF_FAILED_PROMOTED | | + | | | pair: OCF return code; 9 | | + | | | | | + | | | The resource is (or might be) in the promoted | | + | | | role but has failed. The resource will be | | + | | | demoted, stopped and then started (and possibly | | + | | | promoted) again. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 190 | OCF_DEGRADED | .. index:: | none | + | | | single: OCF_DEGRADED | | + | | | single: OCF return code; OCF_DEGRADED | | + | | | pair: OCF return code; 190 | | + | | | | | + | | | The resource is properly active, but in such a | | + | | | condition that future failures are more likely. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | 191 | OCF_DEGRADED_PROMOTED | .. index:: | none | + | | | single: OCF_DEGRADED_PROMOTED | | + | | | single: OCF return code; OCF_DEGRADED_PROMOTED | | + | | | pair: OCF return code; 191 | | + | | | | | + | | | The resource is properly active in the promoted | | + | | | role, but in such a condition that future | | + | | | failures are more likely. | | + +-------+-----------------------+---------------------------------------------------+----------+ + | other | *none* | Custom error code. | soft | + +-------+-----------------------+---------------------------------------------------+----------+ + +Exceptions to the recovery handling described above: + +* Probes (non-recurring monitor actions) that find a resource active + (or in the promoted role) will not result in recovery action unless it is + also found active elsewhere. +* The recovery action taken when a resource is found active more than + once is determined by the resource's ``multiple-active`` property. +* Recurring actions that return ``OCF_ERR_UNIMPLEMENTED`` + do not cause any type of recovery. +* Actions that return one of the "degraded" codes will be treated the same as + if they had returned success, but status output will indicate that the + resource is degraded. + + +.. index:: + single: resource agent; LSB + single: LSB resource agent + single: init script + +LSB Resource Agents (Init Scripts) +################################## + +LSB Compliance +______________ + +The relevant part of the +`LSB specifications <http://refspecs.linuxfoundation.org/lsb.shtml>`_ +includes a description of all the return codes listed here. + +Assuming `some_service` is configured correctly and currently +inactive, the following sequence will help you determine if it is +LSB-compatible: + +#. Start (stopped): + + .. code-block:: none + + # /etc/init.d/some_service start ; echo "result: $?" + + * Did the service start? + * Did the echo command print ``result: 0`` (in addition to the init script's + usual output)? + +#. Status (running): + + .. code-block:: none + + # /etc/init.d/some_service status ; echo "result: $?" + + * Did the script accept the command? + * Did the script indicate the service was running? + * Did the echo command print ``result: 0`` (in addition to the init script's + usual output)? + +#. Start (running): + + .. code-block:: none + + # /etc/init.d/some_service start ; echo "result: $?" + + * Is the service still running? + * Did the echo command print ``result: 0`` (in addition to the init + script's usual output)? + +#. Stop (running): + + .. code-block:: none + + # /etc/init.d/some_service stop ; echo "result: $?" + + * Was the service stopped? + * Did the echo command print ``result: 0`` (in addition to the init + script's usual output)? + +#. Status (stopped): + + .. code-block:: none + + # /etc/init.d/some_service status ; echo "result: $?" + + * Did the script accept the command? + * Did the script indicate the service was not running? + * Did the echo command print ``result: 3`` (in addition to the init + script's usual output)? + +#. Stop (stopped): + + .. code-block:: none + + # /etc/init.d/some_service stop ; echo "result: $?" + + * Is the service still stopped? + * Did the echo command print ``result: 0`` (in addition to the init + script's usual output)? + +#. Status (failed): + + This step is not readily testable and relies on manual inspection of the script. + + The script can use one of the error codes (other than 3) listed in the + LSB spec to indicate that it is active but failed. This tells the + cluster that before moving the resource to another node, it needs to + stop it on the existing one first. + +If the answer to any of the above questions is no, then the script is not +LSB-compliant. Your options are then to either fix the script or write an OCF +agent based on the existing script. |