diff options
Diffstat (limited to 'doc/dev-guides/ra-dev-guide.asc')
-rw-r--r-- | doc/dev-guides/ra-dev-guide.asc | 2072 |
1 files changed, 2072 insertions, 0 deletions
diff --git a/doc/dev-guides/ra-dev-guide.asc b/doc/dev-guides/ra-dev-guide.asc new file mode 100644 index 0000000..7a788b6 --- /dev/null +++ b/doc/dev-guides/ra-dev-guide.asc @@ -0,0 +1,2072 @@ += The OCF Resource Agent Developer's Guide + +== Introduction + +This document is to serve as a guide and reference for all developers, +maintainers, and contributors working on OCF (Open Cluster Framework) +compliant cluster resource agents. It explains the anatomy and general +functionality of a resource agent, illustrates the resource agent API, +and provides valuable hints and tips to resource agent authors. + +=== What is a resource agent? + +A resource agent is an executable that manages a cluster resource. No +formal definition of a cluster resource exists, other than "anything a +cluster manages is a resource." Cluster resources can be as diverse as +IP addresses, file systems, database services, and entire virtual +machines -- to name just a few examples. + +=== Who or what uses a resource agent? + +Any Open Cluster Framework (OCF) compliant cluster management +application is capable of managing resources using the resource agents +described in this document. At the time of writing, two OCF compliant +cluster management applications exist for the Linux platform: + +* _Pacemaker_, a cluster manager supporting both the Corosync and + Heartbeat cluster messaging frameworks. Pacemaker evolved out of the + Linux-HA project. +* _RGmanager_, the cluster manager bundled in Red Hat Cluster + Suite. It supports the Corosync cluster messaging framework + exclusively. + +=== Which language is a resource agent written in? + +An OCF compliant resource agent can be implemented in _any_ +programming language. The API is not language specific. However, most +resource agents are implemented as shell scripts, which is why this +guide primarily uses example code written in shell language. + +=== Is there a naming convention? + +Yes! We have agreed to the following convention for resource agent +names: Please name resource agents using lower case letters, with +words separated by dashes (+example-agent-name+). + +Existing agents may or may not follow this convention, but it is the +intention to make sure future agents follow this rule. + +== API definitions + +=== Environment variables + +A resource agent receives all configuration information about the +resource it manages via environment variables. The names of these +environment variables are always the name of the resource parameter, +prefixed with +OCF_RESKEY_+. For example, if the resource has an +ip+ +parameter set to +192.168.1.1+, then the resource agent will have +access to an environment variable +OCF_RESKEY_ip+ holding that value. + +For any resource parameter that is not required to be set by the user +-- that is, its parameter definition in the resource agent metadata +does not specify +required="true"+ -- then the resource agent must + +* Provide a reasonable default. This should be advertised in the + metadata. By convention, the resource agent uses a variable named + +OCF_RESKEY_<parametername>_default+ that holds this default. +* Alternatively, cater correctly for the value being empty. + +In addition, the cluster manager may also support _meta_ resource +parameters. These do not apply directly to the resource configuration, +but rather specify _how_ the cluster resource manager is expected to manage +the resource. For example, the Pacemaker cluster manager uses the ++target-role+ meta parameter to specify whether the resource should be +started or stopped. + +Meta parameters are passed into the resource agent in the ++OCF_RESKEY_CRM_meta_+ namespace, with any hypens converted to +underscores. Thus, the +target-role+ attribute maps to an environment +variable named +OCF_RESKEY_CRM_meta_target_role+. + +The <<_script_variables>> section contains other system environment +variables. + +=== Actions + +Any resource agent must support one command-line argument which +specifies the action the resource agent is about to execute. The +following actions must be supported by any resource agent: + +* +start+ -- starts the resource. +* +stop+ -- shuts down the resource. +* +monitor+ -- queries the resource for its state. +* +meta-data+ -- dumps the resource agent metadata. + +In addition, resource agents may optionally support the following +actions: + +* +promote+ -- turns a resource into the +Master+ role (Master/Slave + resources only). +* +demote+ -- turns a resource into the +Slave+ role (Master/Slave + resources only). +* +migrate_to+ and +migrate_from+ -- implement live migration of + resources. +* +validate-all+ -- validates a resource's configuration. +* +usage+ or +help+ -- displays a usage message when the resource + agent is invoked from the command line, rather than by the cluster + manager. +* +notify+ -- inform resource about changes in state of other clones. +* +status+ -- historical (deprecated) synonym for +monitor+. + +=== Timeouts + +Action timeouts are enforced outside the resource agent proper. It is +the cluster manager's responsibility to monitor how long a resource +agent action has been running, and terminate it if it does not meet +its completion deadline. Thus, resource agents need not themselves +check for any timeout expiry. + +Resource agents can, however, _advise_ the user of sensible timeout +values (which, when correctly set, will be duly enforced by the +cluster manager). See <<_metadata,the following section>> for details +on how a resource agent advertises its suggested timeouts. + +=== Metadata + +Every resource agent must describe its own purpose and supported +parameters in a set of XML metadata. This metadata is used by cluster +management applications for on-line help, and resource agent man pages +are generated from it as well. The following is a fictitious set of +metadata from an imaginary resource agent: + +[source,xml] +-------------------------------------------------------------------------- +<?xml version="1.0"?> +<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd"> +<resource-agent name="foobar" version="0.1"> + <version>1.0</version> + <longdesc lang="en"> +This is a fictitious example resource agent written for the +OCF Resource Agent Developers Guide. + </longdesc> + <shortdesc lang="en">Example resource agent + for budding OCF RA developers</shortdesc> + <parameters> + <parameter name="eggs" unique="0" required="1"> + <longdesc lang="en"> + Number of eggs, an example numeric parameter + </longdesc> + <shortdesc lang="en">Number of eggs</shortdesc> + <content type="integer"/> + </parameter> + <parameter name="superfrobnicate" unique="0" required="0"> + <longdesc lang="en"> + Enable superfrobnication, an example boolean parameter + </longdesc> + <shortdesc lang="en">Enable superfrobnication</shortdesc> + <content type="boolean" default="false"/> + </parameter> + <parameter name="datadir" unique="0" required="1"> + <longdesc lang="en"> + Data directory, an example string parameter + </longdesc> + <shortdesc lang="en">Data directory</shortdesc> + <content type="string"/> + </parameter> + </parameters> + <actions> + <action name="start" timeout="20" /> + <action name="stop" timeout="20" /> + <action name="monitor" timeout="20" + interval="10" depth="0" /> + <action name="notify" timeout="20" /> + <action name="reload" timeout="20" /> + <action name="migrate_to" timeout="20" /> + <action name="migrate_from" timeout="20" /> + <action name="meta-data" timeout="5" /> + <action name="validate-all" timeout="20" /> + </actions> +</resource-agent> +-------------------------------------------------------------------------- + +The +resource-agent+ element, of which there must only be one per +resource agent, defines the resource agent +name+ and +version+. The ++version+ element specifies the OCF version standard the metadata complies +with. + +The +longdesc+ and +shortdesc+ elements in +resource-agent+ provide a +long and short description of the resource agent's +functionality. While +shortdesc+ is a one-line description of what +the resource agent does and is usually used in terse listings, ++longdesc+ should give a full-blown description of the resource agent +in as much detail as possible. + +The +parameters+ element describes the resource agent parameters, and +should hold any number of +parameter+ children -- one for each +parameter that the resource agent supports. + +Every +parameter+ should, like the +resource-agent+ as a whole, come +with a +shortdesc+ and a +longdesc+, and also a +content+ child that +describes the parameter's expected content. + +On the +content+ element, there may be four different attributes: + +* +type+ describes the parameter type (+string+, +integer+, or + +boolean+). If unset, +type+ defaults to +string+. + +* +required+ indicates whether setting the parameter is mandatory + (+required="true"+) or optional (+required="false"+). + +* For optional parameters, it is customary to provide a sensible + default via the +default+ attribute. + +* Finally, the +unique+ attribute (allowed values: +true+ or +false+) + indicates that a specific value must be unique across the cluster, + for this parameter of this particular resource type. For example, a + highly available floating IP address is declared +unique+ -- as that + one IP address should run only once throughout the cluster, avoiding + duplicates. + +The +actions+ list defines the actions that the resource agent +advertises as supported. + +Every +action+ should list its own +timeout+ value. This is a +hint to the user what _minimal_ timeout should be configured for the +action. This is meant to cater for the fact that some resources are +quick to start and stop (IP addresses or filesystems, for example), +some may take several minutes to do so (such as databases). + +In addition, recurring actions (such as +monitor+) should also specify +a recommended minimum +interval+, which is the time between two +consecutive invocations of the same action. Like +timeout+, this value +does not constitute a default -- it is merely a hint for the user +which action interval to configure, at minimum. + +== Return codes + +For any invocation, resource agents must exit with a defined return +code that informs the caller of the outcome of the invoked +action. The return codes are explained in detail in the following +subsections. + +=== +OCF_SUCCESS+ (0) + +The action completed successfully. This is the expected return code +for any successful +start+, +stop+, +promote+, +demote+, ++migrate_from+, +migrate_to+, +meta_data+, +help+, and +usage+ action. + +For +monitor+ (and its deprecated alias, +status+), however, a +modified convention applies: + +* For primitive (stateless) resources, +OCF_SUCCESS+ from +monitor+ + means that the resource is running. Non-running and gracefully + shut-down resources must instead return +OCF_NOT_RUNNING+. + +* For master/slave (stateful) resources, +OCF_SUCCESS+ from +monitor+ + means that the resource is running _in Slave mode_. Resources + running in Master mode must instead return +OCF_RUNNING_MASTER+, and + gracefully shut-down resources must instead return + +OCF_NOT_RUNNING+. + +=== +OCF_ERR_GENERIC+ (1) + +The action returned a generic error. A resource agent should use this +exit code only when none of the more specific error codes, defined +below, accurately describes the problem. + +The cluster resource manager interprets this exit code as a _soft_ +error. This means that unless specifically configured otherwise, the +resource manager will attempt to recover a resource which failed with ++OCF_ERR_GENERIC+ in-place -- usually by restarting the resource on +the same node. + +=== +OCF_ERR_ARGS+ (2) + +The resource’s configuration is not valid on this machine. E.g. it +refers to a location not found on the node. + +NOTE: The resource agent should not return this error when instructed +to perform an action that it does not support. Instead, under those +circumstances, it should return +OCF_ERR_UNIMPLEMENTED+. + +=== +OCF_ERR_UNIMPLEMENTED+ (3) + +The resource agent was instructed to execute an action that the agent +does not implement. + +Not all resource agent actions are mandatory. +promote+, +demote+, ++migrate_to+, +migrate_from+, and +notify+, are all optional actions +which the resource agent may or may not implement. When a non-stateful +resource agent is misconfigured as a master/slave resource, for +example, then the resource agent should alert the user about this +misconfiguration by returning +OCF_ERR_UNIMPLEMENTED+ on the +promote+ +and +demote+ actions. + +=== +OCF_ERR_PERM+ (4) + +The action failed due to insufficient permissions. This may be due to +the agent not being able to open a certain file, to listen on a +specific socket, to write to a directory, or similar. + +The cluster resource manager interprets this exit code as a _hard_ +error. This means that unless specifically configured otherwise, the +resource manager will attempt to recover a resource which failed with +this error by restarting the resource on a different node (where the +permission problem may not exist). + +=== +OCF_ERR_INSTALLED+ (5) + +The action failed because a required component is missing on the node +where the action was executed. This may be due to a required binary +not being executable, or a vital configuration file being unreadable. + +The cluster resource manager interprets this exit code as a _hard_ +error. This means that unless specifically configured otherwise, the +resource manager will attempt to recover a resource which failed with +this error by restarting the resource on a different node (where the +required files or binaries may be present). + +=== +OCF_ERR_CONFIGURED+ (6) + +The action failed because the user misconfigured the resource. For +example, the user may have configured an alphanumeric string for a +parameter that really should be an integer. + +The cluster resource manager interprets this exit code as a _fatal_ +error. Since this is a configuration error that is present +cluster-wide, it would make no sense to recover such a resource on a +different node, let alone in-place. When a resource fails with this +error, the cluster manager will attempt to shut down the resource, and +wait for administrator intervention. + +=== +OCF_NOT_RUNNING+ (7) + +The resource was found not to be running. This is an exit code that +may be returned by the +monitor+ action exclusively. Note that this +implies that the resource has either _gracefully_ shut down, or has +never been started. + +If the resource is not running due to an error condition, the ++monitor+ action should instead return one of the +OCF_ERR_+ exit +codes or +OCF_FAILED_MASTER+. + +=== +OCF_RUNNING_MASTER+ (8) + +The resource was found to be running in the +Master+ role. This +applies only to stateful (Master/Slave) resources, and only to +their +monitor+ action. + +Note that there is no specific exit code for "running in slave +mode". This is because their is no functional distinction between a +primitive resource running normally, and a stateful resource running +as a slave. The +monitor+ action of a stateful resource running +normally in the +Slave+ role should simply return +OCF_SUCCESS+. + +=== +OCF_FAILED_MASTER+ (9) + +The resource was found to have failed in the +Master+ role. This +applies only to stateful (Master/Slave) resources, and only to their ++monitor+ action. + +The cluster resource manager interprets this exit code as a _soft_ +error. This means that unless specifically configured otherwise, the +resource manager will attempt to recover a resource which failed with ++$OCF_FAILED_MASTER+ in-place -- usually by demoting, stopping, +starting and then promoting the resource on the same node. + + +== Resource agent structure + +A typical (shell-based) resource agent contains standard structural +items, in the order as listed in this section. It describes the +expected behavior of a resource agent with respect to the various +actions it supports, using a fictitous resource agent named +foobar+ +as an example. + +=== Resource agent interpreter + +Any resource agent implemented as a script must specify its +interpreter using standard "shebang" (+#!+) header syntax. + +[source,bash] +-------------------------------------------------------------------------- +#!/bin/sh +-------------------------------------------------------------------------- + +If a resource agent is written in shell, specifying the generic shell +interpreter (+#!/bin/sh+) is generally preferred, though not +required. Resource agents declared as +/bin/sh+ compatible must not +use constructs native to a specific shell (such as, for example, ++${!variable}+ syntax native to +bash+). It is advisable to +occasionally run such resource agents through a sanitization utility +such as +checkbashisms+. + +It is considered a regression to introduce a patch that will make a +previously +sh+ compatible resource agent suitable only for +bash+, ++ksh+, or any other non-generic shell. It is, however, perfectly +acceptable for a new resource agent to explicitly define a specific +shell, such as +/bin/bash+, as its interpreter. + +=== Author and license information + +The resource agent should contain a comment listing the resource agent +author(s) and/or copyright holder(s), and stating the license that +applies to the resource agent: + +[source,bash] +-------------------------------------------------------------------------- +# +# Resource Agent for managing foobar resources. +# +# License: GNU General Public License (GPL) +# (c) 2008-2010 John Doe, Jane Roe, +# and Linux-HA contributors +-------------------------------------------------------------------------- + +When a resource agent refers to a license for which multiple versions +exist, it is assumed that the current version applies. + +=== Initialization + +Any shell resource agent should source the +ocf-shellfuncs+ function +library. With the syntax below, this is done in terms of ++$OCF_FUNCTIONS_DIR+, which -- for testing purposes, and also for +generating documentation -- may be overridden from the command line. + +[source,bash] +-------------------------------------------------------------------------- +# Initialization: +: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat} +. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs +-------------------------------------------------------------------------- + +=== Functions implementing resource agent actions + +What follows next are the functions implementing the resource agent's +advertised actions. The individual actions are described in detail in +<<_resource_agent_actions>>. + +=== Execution block + +This is the part of the resource agent that actually executes when the +resource agent is invoked. It typically follows a fairly standard +structure: + +[source,bash] +-------------------------------------------------------------------------- +# Make sure meta-data and usage always succeed +case $__OCF_ACTION in +meta-data) foobar_meta_data + exit $OCF_SUCCESS + ;; +usage|help) foobar_usage + exit $OCF_SUCCESS + ;; +esac + +# Anything other than meta-data and usage must pass validation +foobar_validate_all || exit $? + +# Translate each action into the appropriate function call +case $__OCF_ACTION in +start) foobar_start;; +stop) foobar_stop;; +status|monitor) foobar_monitor;; +promote) foobar_promote;; +demote) foobar_demote;; +notify) foobar_notify;; +reload) ocf_log info "Reloading..." + foobar_start + ;; +validate-all) ;; +*) foobar_usage + exit $OCF_ERR_UNIMPLEMENTED + ;; +esac +rc=$? + +# The resource agent may optionally log a debug message +ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION returned $rc" +exit $rc +-------------------------------------------------------------------------- + + +== Resource agent actions + +Each action is typically implemented in a separate function or method +in the resource agent. By convention, these are usually named ++<agent>_<action>+, so the function implementing the +start+ action in ++foobar+ would be named +foobar_start()+. + +As a general rule, whenever the resource agent encounters an error +that it is not able to recover, it is permitted to immediately exit, +throw an exception, or otherwise cease execution. Examples for this +include configuration issues, missing binaries, permission problems, +etc. It is not necessary to pass these errors up the call stack. + +It is the cluster manager's responsibility to initiate the appropriate +recovery action based on the user's configuration. The resource agent +should not guess at said configuration. + +=== +start+ action + +When invoked with the +start+ action, the resource agent must start +the resource if it is not yet running. This means that the agent must +verify the resource's configuration, query its state, and then start +it only if it is not running. A common way of doing this would be to +invoke the +validate_all+ and +monitor+ function first, as in the +following example: + +[source,bash] +-------------------------------------------------------------------------- +foobar_start() { + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + # if resource is already running, bail out early + if foobar_monitor; then + ocf_log info "Resource is already running" + return $OCF_SUCCESS + fi + + # actually start up the resource here (make sure to immediately + # exit with an $OCF_ERR_ error code if anything goes seriously + # wrong) + ... + + # After the resource has been started, check whether it started up + # correctly. If the resource starts asynchronously, the agent may + # spin on the monitor function here -- if the resource does not + # start up within the defined timeout, the cluster manager will + # consider the start action failed + while ! foobar_monitor; do + ocf_log debug "Resource has not started yet, waiting" + sleep 1 + done + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + + +=== +stop+ action + +When invoked with the +stop+ action, the resource agent must stop the +resource, if it is running. This means that the agent must verify the +resource configuration, query its state, and then stop it only if it +is currently running. A common way of doing this would be to invoke +the +validate_all+ and +monitor+ function first. It is important to +understand that +stop+ is a force operation -- the resource agent must +do everything in its power to shut down, the resource, short of +rebooting the node or shutting it off. Consider the following example: + +[source,bash] +-------------------------------------------------------------------------- +foobar_stop() { + local rc + + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + foobar_monitor + rc=$? + case "$rc" in + "$OCF_SUCCESS") + # Currently running. Normal, expected behavior. + ocf_log debug "Resource is currently running" + ;; + "$OCF_RUNNING_MASTER") + # Running as a Master. Need to demote before stopping. + ocf_log info "Resource is currently running as Master" + foobar_demote || \ + ocf_log warn "Demote failed, trying to stop anyway" + ;; + "$OCF_NOT_RUNNING") + # Currently not running. Nothing to do. + ocf_log info "Resource is already stopped" + return $OCF_SUCCESS + ;; + esac + + # actually shut down the resource here (make sure to immediately + # exit with an $OCF_ERR_ error code if anything goes seriously + # wrong) + ... + + # After the resource has been stopped, check whether it shut down + # correctly. If the resource stops asynchronously, the agent may + # spin on the monitor function here -- if the resource does not + # shut down within the defined timeout, the cluster manager will + # consider the stop action failed + while foobar_monitor; do + ocf_log debug "Resource has not stopped yet, waiting" + sleep 1 + done + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS + +} +-------------------------------------------------------------------------- + +NOTE: The expected exit code for a successful stop operation is ++$OCF_SUCCESS+, _not_ +$OCF_NOT_RUNNING+. + +IMPORTANT: A failed stop operation is a potentially dangerous +situation which the cluster manager will almost invariably try to +resolve by means of node fencing. In other words, the cluster manager +will forcibly evict from the cluster a node on which a stop operation +has failed. While this measure serves ultimately to protect data, it +does cause disruption to applications and their users. Thus, a +resource agent should make sure that it exits with an error only if +all avenues for proper resource shutdown have been exhausted. + +=== +monitor+ action + +The +monitor+ action queries the current status of a resource. It must +discern between three different states: + +* resource is currently running (return +$OCF_SUCCESS+); +* resource has stopped gracefully (return +$OCF_NOT_RUNNING+); +* resource has run into a problem and must be considered failed + (return the appropriate +$OCF_ERR_+ code to indicate the nature of the + problem). + + +[source,bash] +-------------------------------------------------------------------------- +foobar_monitor() { + local rc + + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + ocf_run frobnicate --test + + # This example assumes the following exit code convention + # for frobnicate: + # 0: running, and fully caught up with master + # 1: gracefully stopped + # any other: error + case "$?" in + 0) + rc=$OCF_SUCCESS + ocf_log debug "Resource is running" + ;; + 1) + rc=$OCF_NOT_RUNNING + ocf_log debug "Resource is not running" + ;; + *) + ocf_log err "Resource has failed" + exit $OCF_ERR_GENERIC + esac + + return $rc +} +-------------------------------------------------------------------------- + +Stateful (master/slave) resource agents may use a more elaborate +monitoring scheme where they can provide "hints" to the cluster +manager identifying which instance is best suited to assume the ++Master+ role. <<_specifying_a_master_preference>> explains the +details. + +NOTE: The cluster manager may invoke the +monitor+ action for a +_probe_, which is a test whether the resource is currently +running. Normally, the monitor operation would behave exactly the same +during a probe and a "real" monitor action. If a specific resource +does require special treatment for probes, however, the +ocf_is_probe+ +convenience function is available in the OCF shell functions library +for that purpose. + +=== +validate-all+ action + +The +validate-all+ action tests for correct resource agent +configuration and a working environment. +validate-all+ should exit +with one of the following return codes: + +* +$OCF_SUCCESS+ -- all is well, the configuration is valid and + usable. +* +$OCF_ERR_CONFIGURED+ -- the user has misconfigured the resource. +* +$OCF_ERR_INSTALLED+ -- the resource has possibly been configured + correctly, but a vital component is missing on the node where + +validate-all+ is being executed. +* +$OCF_ERR_PERM+ -- the resource is configured correctly and is not + missing any required components, but is suffering from a permission + issue (such as not being able to create a necessary file). + ++validate-all+ is usually wrapped in a function that is not only +called when explicitly invoking the corresponding action, but also -- +as a sanity check -- from just about any other function. Therefore, +the resource agent author must keep in mind that the function may be +invoked during the +start+, +stop+, and +monitor+ operations, and also +during probes. + +Probes pose a separate challenge for validation. During a probe (when +the cluster manager may expect the resource _not_ to be running on the +node where the probe is executed), some required components may be +_expected_ to not be available on the affected node. For example, this +includes any shared data on storage devices not available for reading +during the probe. The +validate-all+ function may thus need to treat +probes specially, using the +ocf_is_probe+ convenience function: + +[source,bash] +-------------------------------------------------------------------------- +foobar_validate_all() { + # Test for configuration errors first + if ! ocf_is_decimal $OCF_RESKEY_eggs; then + ocf_log err "eggs is not numeric!" + exit $OCF_ERR_CONFIGURED + fi + + # Test for required binaries + check_binary frobnicate + + # Check for data directory (this may be on shared storage, so + # disable this test during probes) + if ! ocf_is_probe; then + if ! [ -d $OCF_RESKEY_datadir ]; then + ocf_log err "$OCF_RESKEY_datadir does not exist or is not a directory!" + exit $OCF_ERR_INSTALLED + fi + fi + + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + +=== +meta-data+ action + +The +meta-data+ action dumps the resource agent metadata to standard +output. The output must follow the metadata format as specified in +<<_metadata>>. + +[source,bash] +-------------------------------------------------------------------------- +foobar_meta_data { + cat <<EOF +<?xml version="1.0"?> +<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd"> +<resource-agent name="foobar" version="0.1"> + <version>1.0</version> + <longdesc lang="en"> +... +EOF +} +-------------------------------------------------------------------------- + +=== +promote+ action + +The +promote+ action is optional. It must only be supported by +_stateful_ resource agents, which means agents that discern between +two distinct _roles_: +Master+ and +Slave+. +Slave+ is functionally +identical to the +Started+ state in a stateless resource agent. Thus, +while a regular (stateless) resource agent only needs to implement ++start+ and +stop+, a stateful resource agent must also support the ++promote+ action to be able to make a transition between the +Started+ +(+Slave+) and +Master+ roles. + +[source,bash] +-------------------------------------------------------------------------- +foobar_promote() { + local rc + + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + # test the resource's current state + foobar_monitor + rc=$? + case "$rc" in + "$OCF_SUCCESS") + # Running as slave. Normal, expected behavior. + ocf_log debug "Resource is currently running as Slave" + ;; + "$OCF_RUNNING_MASTER") + # Already a master. Unexpected, but not a problem. + ocf_log info "Resource is already running as Master" + return $OCF_SUCCESS + ;; + "$OCF_NOT_RUNNING") + # Currently not running. Need to start before promoting. + ocf_log info "Resource is currently not running" + foobar_start + ;; + *) + # Failed resource. Let the cluster manager recover. + ocf_log err "Unexpected error, cannot promote" + exit $rc + ;; + esac + + # actually promote the resource here (make sure to immediately + # exit with an $OCF_ERR_ error code if anything goes seriously + # wrong) + ocf_run frobnicate --master-mode || exit $OCF_ERR_GENERIC + + # After the resource has been promoted, check whether the + # promotion worked. If the resource promotion is asynchronous, the + # agent may spin on the monitor function here -- if the resource + # does not assume the Master role within the defined timeout, the + # cluster manager will consider the promote action failed. + while true; do + foobar_monitor + if [ $? -eq $OCF_RUNNING_MASTER ]; then + ocf_log debug "Resource promoted" + break + else + ocf_log debug "Resource still awaiting promotion" + sleep 1 + fi + done + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + +=== +demote+ action + +The +demote+ action is optional. It must only be supported by +_stateful_ resource agents, which means agents that discern between +two distict _roles_: +Master+ and +Slave+. +Slave+ is functionally +identical to the +Started+ state in a stateless resource agent. Thus, +while a regular (stateless) resource agent only needs to implement ++start+ and +stop+, a stateful resource agent must also support the ++demote+ action to be able to make a transition between the +Master+ +and +Started+ (+Slave+) roles. + +[source,bash] +-------------------------------------------------------------------------- +foobar_demote() { + local rc + + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + # test the resource's current state + foobar_monitor + rc=$? + case "$rc" in + "$OCF_RUNNING_MASTER") + # Running as master. Normal, expected behavior. + ocf_log debug "Resource is currently running as Master" + ;; + "$OCF_SUCCESS") + # Alread running as slave. Nothing to do. + ocf_log debug "Resource is currently running as Slave" + return $OCF_SUCCESS + ;; + "$OCF_NOT_RUNNING") + # Currently not running. Getting a demote action + # in this state is unexpected. Exit with an error + # and let the cluster manager recover. + ocf_log err "Resource is currently not running" + exit $OCF_ERR_GENERIC + ;; + *) + # Failed resource. Let the cluster manager recover. + ocf_log err "Unexpected error, cannot demote" + exit $rc + ;; + esac + + # actually demote the resource here (make sure to immediately + # exit with an $OCF_ERR_ error code if anything goes seriously + # wrong) + ocf_run frobnicate --unset-master-mode || exit $OCF_ERR_GENERIC + + # After the resource has been demoted, check whether the + # demotion worked. If the resource demotion is asynchronous, the + # agent may spin on the monitor function here -- if the resource + # does not assume the Slave role within the defined timeout, the + # cluster manager will consider the demote action failed. + while true; do + foobar_monitor + if [ $? -eq $OCF_RUNNING_MASTER ]; then + ocf_log debug "Resource still demoting" + sleep 1 + else + ocf_log debug "Resource demoted" + break + fi + done + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + +=== +migrate_to+ action + +The +migrate_to+ action can serve one of two purposes: + +* Initiate a native _push_ type migration for the resource. In other + words, instruct the resource to move _to_ a specific node from the + node it is currently running on. The resource agent knows about its + destination node via the +$OCF_RESKEY_CRM_meta_migrate_target+ environment + variable. + +* Freeze the resource in a _freeze/thaw_ (also known as + _suspend/resume_) type migration. In this mode, the resource does + not need any information about its destination node at this point. + +The example below illustrates a push type migration: + +[source,bash] +-------------------------------------------------------------------------- +foobar_migrate_to() { + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + # if resource is not running, bail out early + if ! foobar_monitor; then + ocf_log err "Resource is not running" + exit $OCF_ERR_GENERIC + fi + + # actually start up the resource here (make sure to immediately + # exit with an $OCF_ERR_ error code if anything goes seriously + # wrong) + ocf_run frobnicate --migrate \ + --dest=$OCF_RESKEY_CRM_meta_migrate_target \ + || exit OCF_ERR_GENERIC + ... + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + +In contrast, a freeze/thaw type migration may implement its freeze +operation like this: + +[source,bash] +-------------------------------------------------------------------------- +foobar_migrate_to() { + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + # if resource is not running, bail out early + if ! foobar_monitor; then + ocf_log err "Resource is not running" + exit $OCF_ERR_GENERIC + fi + + # actually start up the resource here (make sure to immediately + # exit with an $OCF_ERR_ error code if anything goes seriously + # wrong) + ocf_run frobnicate --freeze || exit OCF_ERR_GENERIC + ... + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + + +=== +migrate_from+ action + +The +migrate_from+ action can serve one of two purposes: + +* Complete a native _push_ type migration for the resource. In other + words, check whether the migration has succeeded properly, and the + resource is running on the local node. The resource agent knows + about its the migration source via the + +$OCF_RESKEY_CRM_meta_migrate_source+ environment variable. + +* Thaw the resource in a _freeze/thaw_ (also known as + _suspend/resume_) type migration. In this mode, the resource usually + not need any information about its source node at this point. + +The example below illustrates a push type migration: + +[source,bash] +-------------------------------------------------------------------------- +foobar_migrate_from() { + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + # After the resource has been migrated, check whether it resumed + # correctly. If the resource starts asynchronously, the agent may + # spin on the monitor function here -- if the resource does not + # run within the defined timeout, the cluster manager will + # consider the migrate_from action failed + while ! foobar_monitor; do + ocf_log debug "Resource has not yet migrated, waiting" + sleep 1 + done + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + +In contrast, a freeze/thaw type migration may implement its thaw +operation like this: + +[source,bash] +-------------------------------------------------------------------------- +foobar_migrate_from() { + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + # actually start up the resource here (make sure to immediately + # exit with an $OCF_ERR_ error code if anything goes seriously + # wrong) + ocf_run frobnicate --thaw || exit OCF_ERR_GENERIC + + # After the resource has been migrated, check whether it resumed + # correctly. If the resource starts asynchronously, the agent may + # spin on the monitor function here -- if the resource does not + # run within the defined timeout, the cluster manager will + # consider the migrate_from action failed + while ! foobar_monitor; do + ocf_log debug "Resource has not yet migrated, waiting" + sleep 1 + done + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + + +=== +notify+ action + +With notifications, instances of clones (and of master/slave +resources, which are an extended kind of clones) can inform each other +about their state. When notifications are enabled, certain actions on +any instance of a clone carries a +pre+ and +post+ notification. + +List of actions that trigger notifications: + +* start +* stop +* promote +* demote + +The cluster manager invokes the +notify+ operation on _all_ clone +instances. For +notify+ operations, additional environment variables +are passed into the resource agent during execution: + +* +$OCF_RESKEY_CRM_meta_notify_type+ -- the notification type (+pre+ + or +post+) + +* +$OCF_RESKEY_CRM_meta_notify_operation+ -- the operation (action) + that the notification is about (+start+, +stop+, +promote+, +demote+ + etc.) + +* +$OCF_RESKEY_CRM_meta_notify_start_uname+ -- node name of the node + where the resource is being started (+start+ notifications only) + +* +$OCF_RESKEY_CRM_meta_notify_stop_uname+ -- node name of the node + where the resource is being stopped (+stop+ notifications only) + +* +$OCF_RESKEY_CRM_meta_notify_master_uname+ -- node name of the node + where the resource currently _is in_ the Master role + +* +$OCF_RESKEY_CRM_meta_notify_promote_uname+ -- node name of the node + where the resource currently _is being promoted to_ the Master role + (+promote+ notifications only) + +* +$OCF_RESKEY_CRM_meta_notify_demote_uname+ -- node name of the node + where the resource currently _is being demoted to_ the Slave role + (+demote+ notifications only) + +Notifications come in particularly handy for master/slave resources +using a "pull" scheme, where the master is a publisher and the slave a +subscriber. Since the master is obviously only available as such when +a promotion has occurred, the slaves can use a "pre-promote" +notification to configure themselves to subscribe to the right +publisher. + +Likewise, the subscribers may want to unsubscribe from the publisher +after it has relinquished its master status, and a "post-demote" +notification can be used for that purpose. + +Consider the example below to illustrate the concept. + +[source,bash] +-------------------------------------------------------------------------- +foobar_notify() { + local type_op + type_op="${OCF_RESKEY_CRM_meta_notify_type}-${OCF_RESKEY_CRM_meta_notify_operation}" + + ocf_log debug "Received $type_op notification." + case "$type_op" in + 'pre-promote') + ocf_run frobnicate --slave-mode \ + --master=$OCF_RESKEY_CRM_meta_notify_promote_uname \ + || exit $OCF_ERR_GENERIC + ;; + 'post-demote') + ocf_run frobnicate --unset-slave-mode || exit $OCF_ERR_GENERIC + ;; + esac + + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + +NOTE: A master/slave resource agent may support a _multi-master_ +configuration, where there is possibly more than one master at any +given time. If that is the case, then the ++$OCF_RESKEY_CRM_meta_notify_*_uname+ variables may each contain a +space-separated lists of hostnames, rather than a single host name as +shown in the example. Under those circumstances the resource agent +would have to properly iterate over this list. + +== Script variables + +This section outlines variables typically available to resource agents, +primarily for convenience purposes. For additional variables +available while the agent is being executed, refer to +<<_environment_variables>> and <<_return_codes>>. + +=== +$OCF_RA_VERSION_MAJOR+ + +The major version number of the resource agent API that the cluster +manager is currently using. + +=== +$OCF_RA_VERSION_MINOR+ + +The minor version number of the resource agent API that the cluster +manager is currently using. + +=== +$OCF_ROOT+ + +The root of the OCF resource agent hierarchy. This should never be +changed by a resource agent. This is usually +/usr/lib/ocf+. + +=== +$OCF_FUNCTIONS_DIR+ + +The directory where the resource agents shell function library, ++ocf-shellfuncs+, resides. This is usually defined in terms of ++$OCF_ROOT+ and should never be changed by a resource agent. This +variable may, however, be overridden from the command line while +testing a new or modified resource agent. + +=== +$OCF_EXIT_REASON_PREFIX+ + +Used as a prefix when printing error messages from the resource agent. +Script functions use this automaticly so no explicit use is required +for shell based scripts. + +=== +$OCF_RESOURCE_INSTANCE+ + +The resource instance name. For primitive (non-clone, non-stateful) +resources, this is simply the resource name. For clones and stateful +resources, this is the primitive name, followed by a colon an the +clone instance number (such as +p_foobar:0+). + +=== +$OCF_RESOURCE_TYPE+ + +The resource type of the current resource, e.g. IPaddr2. + +=== +$OCF_RESOURCE_PROVIDER+ + +The resource provider, e.g. heartbeat. This may not be in all cluster +managers of Resource Agent API version 1.0. + +=== +$__OCF_ACTION+ + +The currently invoked action. This is exactly the first command-line +argument that the cluster manager specifies when it invokes the +resource agent. + +=== +$__SCRIPT_NAME+ + +The name of the resource agent. This is exactly the base name of the +resource agent script, with leading directory names removed. + +=== +$HA_RSCTMP+ + +A temporary directory for use by resource agents. The system startup +sequence (on any LSB compliant Linux distribution) guarantees that +this directory is emptied on system startup, so this directory will +not contain any stale data after a node reboot. + +== Convenience functions + +=== Logging: +ocf_log+ + +Resource agents should use the +ocf_log+ function for logging +purposes. This convenient logging wrapper is invoked as follows: + +[source,bash] +-------------------------------------------------------------------------- +ocf_log <severity> "Log message" +-------------------------------------------------------------------------- + +It supports following the following severity levels: + +* +debug+ -- for debugging messages. Most logging configurations + suppress this level by default. +* +info+ -- for informational messages about the agent's behavior or + status. +* +warn+ -- for warnings. This is for any messages which reflect + unexpected behavior that does _not_ constitute an unrecoverable + error. +* +err+ -- for errors. As a general rule, this logging level should + only be used immediately prior to an +exit+ with the appropriate + error code. +* +crit+ -- for critical errors. As with +err+, this logging level + should not be used unless the resource agent also exits with an + error code. Very rarely used. + +=== Testing for binaries: +have_binary+ and +check_binary+ + +A resource agent may need to test for the availability of a specific +executable. The +have_binary+ convenience function comes in handy +here: + +[source,bash] +-------------------------------------------------------------------------- +if ! have_binary frobnicate; then + ocf_log warn "Missing frobnicate binary, frobnication disabled!" +fi +-------------------------------------------------------------------------- + +If a missing binary is a fatal problem for the resource, then the ++check_binary+ function should be used: + +[source,bash] +-------------------------------------------------------------------------- +check_binary frobnicate +-------------------------------------------------------------------------- + +Using +check_binary+ is a shorthand method for testing for the +existence (and executability) of the specified binary, and exiting +with +$OCF_ERR_INSTALLED+ if it cannot be found or executed. + +NOTE: Both +have_binary+ and +check_binary+ honor +$PATH+ when the +binary to test for is not specified as a full path. It is usually wise +to _not_ test for a full path, as binary installations path may vary +by distribution or user policy. + +=== Executing commands and capturing their output: +ocf_run+ + +Whenever a resource agent needs to execute a command and capture its +output, it should use the +ocf_run+ convenience function, invoked as +in this example: + +[source,bash] +-------------------------------------------------------------------------- +ocf_run frobnicate --spam=eggs || exit $OCF_ERR_GENERIC +-------------------------------------------------------------------------- + +With the command specified above, the resource agent will invoke ++frobnicate --spam=eggs+ and capture its output and +exit code. If the exit code is nonzero (indicating an error), ++ocf_run+ logs the command output with the +err+ logging severity, and +the resource agent subsequently exits. If the exit code is zero +(indicating success), any command output will be logged with the +info+ +logging severity. + +If the resource agent wishes to ignore the output of a successful +command execution, it can use the +-q+ flag with +ocf_run+. In the +example below, +ocf_run+ will only log output if the command exit code +is nonzero. + +[source,bash] +-------------------------------------------------------------------------- +ocf_run -q frobnicate --spam=eggs || exit $OCF_ERR_GENERIC +-------------------------------------------------------------------------- + +Finally, if the resource agent wants to log the output of a command +with a nonzero exit code with a severity _other_ than error, it may do +so by adding the +-info+ or +-warn+ option to +ocf_run+: + +[source,bash] +-------------------------------------------------------------------------- +ocf_run -warn frobnicate --spam=eggs +-------------------------------------------------------------------------- + +=== Locks: +ocf_take_lock+ and +ocf_release_lock_on_exit+ + +Occasionally, there may be different resources of the same type in a +cluster configuration that should not execute actions in +parallel. When a resource agent needs to guard against parallel +execution on the same machine, it can use the +ocf_take_lock+ and ++ocf_release_lock_on_exit+ convenience functions: + +[source,bash] +-------------------------------------------------------------------------- +LOCKFILE=${HA_RSCTMP}/foobar +ocf_release_lock_on_exit $LOCKFILE + +foobar_start() { + ... + ocf_take_lock $LOCKFILE + ... +} +-------------------------------------------------------------------------- + ++ocf_take_lock+ attempts to acquire the designated +$LOCKFILE+. When +it is unavailable, it sleeps a random amount of time between 0 and 1 +seconds, and retries. +ocf_release_lock_on_exit+ releases the lock +file when the agent exits (for any reason). + +=== Testing for numerical values: +ocf_is_decimal+ + +Specifically for parameter validation, it can be helpful to test +whether a given value is numeric. The +ocf_is_decimal+ function exists +for that purpose: +-------------------------------------------------------------------------- +foobar_validate_all() { + if ! ocf_is_decimal $OCF_RESKEY_eggs; then + ocf_log err "eggs is not numeric!" + exit $OCF_ERR_CONFIGURED + fi + ... +} +-------------------------------------------------------------------------- + +=== Testing for boolean values: +ocf_is_true+ + +When a resource agent defines a boolean parameter, the value +for this parameter may be specified by the user as +0+/+1+, ++true+/+false+, or +on+/+off+. Since it is tedious to test for all +these values from within the resource agent, the agent should instead +use the +ocf_is_true+ convenience function: + +[source,bash] +-------------------------------------------------------------------------- +if ocf_is_true $OCF_RESKEY_superfrobnicate; then + ocf_run frobnicate --super +fi +-------------------------------------------------------------------------- + +NOTE: If +ocf_is_true+ is used against an empty or non-existant +variable, it always returns an exit code of +1+, which is equivalent +to +false+. + +=== Version comparison: +ocf_version_cmp+ + +A resource agent may want to check the version of software +installed. +ocf_version_cmp+ takes care of all the necessary +details. + +The return codes are + +* +0+ -- the first version is smaller (earlier) than the second +* +1+ -- the two versions are equal +* +2+ -- the first version is greater (later) than the second +* +3+ -- one of arguments is not recognized as a version string + +The versions are allowed to contain digits, dots, and dashes. + +[source,bash] +-------------------------------------------------------------------------- +local v=`gooey --version` +ocf_version_cmp "$v" 12.0.8-1 +case $? in + 0) ocf_log err "we do not support version $v, it is too old" + exit $OCF_ERR_INSTALLED + ;; + [12]) ;; # we can work with versions >= 12.0.8-1 + 3) ocf_log err "gooey produced version <$v>, too funky for me" + exit $OCF_ERR_INSTALLED + ;; +esac +-------------------------------------------------------------------------- + +=== Pseudo resources: +ha_pseudo_resource+ + +"Pseudo resources" are those where the resource agent in fact does not +actually start or stop something akin to a runnable process, but +merely executes a single action and then needs some form of tracing +whether that action has been executed or not. The +portblock+ resource +agent is an example of this. + +Resource agents for pseudo resources can use a convenience function, ++ha_pseudo_resource+, which makes use of _tracking files_ to keep tabs +on the status of a resource. If +foobar+ was designed to manage a +pseudo resource, then its +start+ action could look like this: + +[source,bash] +-------------------------------------------------------------------------- +foobar_start() { + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + # if resource is already running, bail out early + if foobar_monitor; then + ocf_log info "Resource is already running" + return $OCF_SUCCESS + fi + + # start the pseudo resource + ha_pseudo_resource ${OCF_RESOURCE_INSTANCE} start + + # After the resource has been started, check whether it started up + # correctly. If the resource starts asynchronously, the agent may + # spin on the monitor function here -- if the resource does not + # start up within the defined timeout, the cluster manager will + # consider the start action failed + while ! foobar_monitor; do + ocf_log debug "Resource has not started yet, waiting" + sleep 1 + done + + # only return $OCF_SUCCESS if _everything_ succeeded as expected + return $OCF_SUCCESS +} +-------------------------------------------------------------------------- + + +== Conventions + +This section contains a collection of conventions that have emerged in +the resource agent repositories over the years. Following these +conventions is by no means mandatory for resource agent authors, but +it is a good idea based on the +http://en.wikipedia.org/wiki/Principle_of_least_surprise[Principle of +Least Surprise] -- resource agents following these conventions will be +easier to understand, review, and use than those that do not. + +=== Well-known parameter names + +Several parameter names are supported by a number of resource +agents. For new resource agents, following these examples is generally +a good idea: + +* +binary+ -- the name of a binary that principally manages the + resource, such as a server daemon +* +config+ -- the full path to a configuration file +* +pid+ -- the full path to a file holding a process ID (PID) +* +log+ -- the full path to a log file +* +socket+ -- the full path to a UNIX socket that the resource manages +* +ip+ -- an IP address that a daemon binds to +* +port+ -- a TCP or UDP port that a daemon binds to + +Needless to say, resource agents should only implement any of these +parameters if they are sensible to use in the agent's context. + +=== Parameter defaults + +Defaults for resource agent parameters should be set by initializing +variables with the suffix +_default+: + +[source,bash] +-------------------------------------------------------------------------- +# Defaults +OCF_RESKEY_superfrobnicate_default=0 + +: ${OCF_RESKEY_superfrobnicate=${OCF_RESKEY_superfrobnicate_default}} +-------------------------------------------------------------------------- + +NOTE: The resource agent should make sure that it sets a default for +any parameter not marked as +required+ in the metadata. + + +=== Honoring +PATH+ for binaries + +When a resource agent supports a parameter designed to hold the name +of a binary (such as a daemon, or a client utility for querying +status), then that parameter should honor the +PATH+ environment +variable. Do not supply full paths. Thus, the following approach: + +[source,bash] +-------------------------------------------------------------------------- +# Good example -- do it this way +OCF_RESKEY_frobnicate_default="frobnicate" +: ${OCF_RESKEY_frobnicate="${OCF_RESKEY_frobnicate_default}"} +-------------------------------------------------------------------------- + +is much preferred over specifying a full path, as shown here: + +[source,bash] +-------------------------------------------------------------------------- +# Bad example -- avoid if you can +OCF_RESKEY_frobnicate_default="/usr/local/sbin/frobnicate" +: ${OCF_RESKEY_frobnicate="${OCF_RESKEY_frobnicate_default}"} +-------------------------------------------------------------------------- + +This rule holds for defaults, as well. + + + +== Special considerations + +=== Licensing + +Whenever possible, resource agent contributors are _encouraged_ to use +the GNU General Public License (GPL), version 2 and later, for any new +resource agents. The shell functions library does not strictly mandate +this, however, as it is licensed under the GNU Lesser General Public +License (LGPL), version 2.1 and later (so it can be used by non-GPL +agents). + +The resource agent _must_ explicitly state its own license in the +agent source code. + + +=== Locale settings + +When sourcing +ocf-shellfuncs+ as explained in <<_initialization>>, +any resource agent automatically sets +LANG+ and +LC_ALL+ to the +C+ +locale. Resource agents can thus expect to always operate in the +C+ +locale, and need not reset +LANG+ or any of the +LC_+ environment +variables themselves. + + +=== Testing for running processes + +For testing whether a particular process (with a known process ID) is +currently running, a frequently found method is to send it a +0+ +signal and catch errors, similar to this example: + +[source,bash] +-------------------------------------------------------------------------- +if kill -s 0 `cat $daemon_pid_file`; then + ocf_log debug "Process is currently running" +else + ocf_log warn "Process is dead, removing pid file" + rm -f $daemon_pid_file +if +-------------------------------------------------------------------------- + +IMPORTANT: An approach far superior to this example is to instead test +the _functionality_ of the daemon by connecting to it with a client +process, as shown in the example in +<<_literal_monitor_literal_action>>. + + +=== Specifying a master preference + +Stateful (master/slave) resources must set their own _master +preference_ -- they can thus provide hints to the cluster manager +which is the the best instance to promote to the +Master+ role. + +IMPORTANT: It is acceptable for multiple instances to have identical +positive master preferences. In that case, the cluster resource +manager will automatically select a resource agent to +promote. However, if _all_ instances have the (default) master score +of zero, the cluster manager will not promote any instance at +all. Thus, it is crucial that at least one instance has a positive +master score. + +For this purpose, +crm_master+ comes in handy. This convenience +wrapper around the +crm_attribute+ sets a node attribute named ++master-<<_literal_ocf_resource_instance_literal,$OCF_RESOURCE_INSTANCE>>+ +for the node it is being executed on, and fills this attribute with +the specified value. The cluster manager is then expected to translate +this into a promotion score for the corresponding instance, and base +its promotion preference on that score. + +Stateful resource agents typically execute +crm_master+ during the +<<_literal_monitor_literal_action,+monitor+>> and/or +<<_literal_notify_literal_action,+notify+>> action. + +The following example assumes that the +foobar+ resource agent can +test the application's status by executing a binary that returns +certain exit codes based on whether + +* the resource is either in the master role, or is a slave that is + fully caught up with the master (at any rate, it has current data), + or +* the resource is in the slave role, but through some form of + asynchronous replication has "fallen behind" the master, or +* the resource has gracefully stopped, or +* the resource has unexpectedly failed. + +[source,bash] +-------------------------------------------------------------------------- +foobar_monitor() { + local rc + + # exit immediately if configuration is not valid + foobar_validate_all || exit $? + + ocf_run frobnicate --test + + # This example assumes the following exit code convention + # for frobnicate: + # 0: running, and fully caught up with master + # 1: gracefully stopped + # 2: running, but lagging behind master + # any other: error + case "$?" in + 0) + rc=$OCF_SUCCESS + ocf_log debug "Resource is running" + # Set a high master preference. The current master + # will always get this, plus 1. Any current slaves + # will get a high preference so that if the master + # fails, they are next in line to take over. + crm_master -l reboot -v 100 + ;; + 1) + rc=$OCF_NOT_RUNNING + ocf_log debug "Resource is not running" + # Remove the master preference for this node + crm_master -l reboot -D + ;; + 2) + rc=$OCF_SUCCESS + ocf_log debug "Resource is lagging behind master" + # Set a low master preference: if the master fails + # right now, and there is another slave that does + # not lag behind the master, its higher master + # preference will win and that slave will become + # the new master + crm_master -l reboot -v 5 + ;; + *) + ocf_log err "Resource has failed" + exit $OCF_ERR_GENERIC + esac + + return $rc +} +-------------------------------------------------------------------------- + + +== Testing resource agents + +This section discusses automated testing for resource agents. Testing +is a vital aspect of development; it is crucial both for creating new +resource agents, and for modifying existing ones. + + +=== Testing with +ocf-tester+ + +The resource agents repository (and hence, any installed resource +agents package) contains a utility named +ocf-tester+. This shell +script allows you to conveniently and easily test the functionality of +your resource agent. + ++ocf-tester+ is commonly invoked, as +root+, like this: + +-------------------------------------------------------------------------- +ocf-tester -n <name> [-o <param>=<value> ... ] <resource agent> +-------------------------------------------------------------------------- + +* +<name>+ is an arbitrary resource name. + +* You may set any number of +<param>=<value>+ with the +-o+ option, + corresponding to any resource parameters you wish to set for + testing. + +* +<resource agent>+ is the full path to your resource agent. + +When invoked, +ocf-tester+ executes all mandatory actions and enforces +action behavior as explained in <<_resource_agent_actions>>. + +It also tests for optional actions. Optional actions must behave as +expected when advertised, but do not cause +ocf-tester+ to flag an +error if not implemented. + +IMPORTANT: +ocf-tester+ does not initiate "dry runs" of actions, nor +does it create resource dummies of any kind. Instead, it exercises the +actual resource agent as-is, whether that may include opening and +closing databases, mounting file systems, starting or stopping virtual +machines, etc. Use with care. + +For example, you could run +ocf-tester+ on the +foobar+ resource agent +as follows: + +-------------------------------------------------------------------------- +# ocf-tester -n foobartest \ + -o superfrobnicate=true \ + -o datadir=/tmp \ + /home/johndoe/ra-dev/foobar +Beginning tests for /home/johndoe/ra-dev/foobar... +* Your agent does not support the notify action (optional) +* Your agent does not support the reload action (optional) +/home/johndoe/ra-dev/foobar passed all tests +-------------------------------------------------------------------------- + +If the resource agent exhibits some difficult to grasp behaviour, +which is typically the case with just developed software, there +are +-v+ and +-d+ options to dump more output. If that does not +help, instruct +ocf-tester+ to trace the resource agent with ++-X+ (make sure to redirect output to a file, unless you are a +really fast reader). + +=== Testing with +ocft+ + ++ocft+ is a testing tool for resource agents. The main difference +to +ocf-tester+ is that +ocft+ can automate creating complex +testing environments. That includes package installation and +arbitrary shell scripting. + +==== +ocft+ components + ++ocft+ consists of the following components: + +* A test case generator (+/usr/sbin/ocft+) -- generates shell + scripts from test case configuration files + +* Configuration files (+/usr/share/resource-agents/ocft/configs/+) -- + a configuration file contains environment setup and test cases + for one resource agent + +* The testing scripts are stored in +/var/lib/resource-agents/ocft/cases/+, + but normally there is no need to inspect them + +==== Customizing the testing environment + ++ocft+ modifies the runtime environment of the resource agent +either by changing environment variables (through the interface +defined by OCF) or by running ad-hoc shell scripts which can for +instance change permissions of a file or unmount a file system. + +==== How to test + +You need to know the software (resource) you want to test. Draw a +sketch of all interesting scenarios, with all expected and +unexpected conditions and how the resource agent should react to +them. Then you need to encode these conditions and the expected +outcomes as +ocft+ test cases. Running ocft is then simple: + +--------------------------------------- +# ocft make <RA> +# ocft test <RA> +--------------------------------------- + +The first subcommand generates the scripts for your test cases +whereas the second runs them and checks the outcome. + +==== +ocft+ configuration file syntax + +There are four top level options each of which can contain +one or more sub-options. + +===== +CONFIG+ (top level option) + +This option is global and influences every test case. + + ** +AgentRoot+ (sub-option) +--------------------------------------- +AgentRoot /usr/lib/ocf/resource.d/xxx +--------------------------------------- + +Normally, we assume that the resource agent lives under the ++heartbeat+ provider. Use `AgentRoot` to test agent which is +distributed by another vendor. + + ** +InstallPackage+ (sub-option) +--------------------------------------- +InstallPackage package [package2 [...]] +--------------------------------------- + +Install packages necessary for testing. The installation is +skipped if the packages have already been installed. + + ** 'HangTimeout' (sub-option) +--------------------------------------- +HangTimeout secs +--------------------------------------- + +The maximum time allowed for a single RA action. If this timer +expires, the action is considered as failed. + +===== +SETUP-AGENT+ (top level option) +--------------------------------------- +SETUP-AGENT + bash commands +--------------------------------------- + +If the RA needs to be initialized before testing, you can put +bash code here for that purpose. The initialization is done only +once. If you need to reinitialize then delete the ++/tmp/.[AGENT_NAME]_set+ stamp file. + +===== +CASE+ (top level option) +--------------------------------------- +CASE "description" +--------------------------------------- + +This is the main building block of the test suite. Each test +case is to be described in one +CASE+ top level option. + +One case consists of several suboptions typically followed by the ++RunAgent+ suboption. + + ** +Var+ (sub-option) +--------------------------------------- +Var VARIABLE=value +--------------------------------------- + +It is to set up an environment variable of the resource agent. They +usually appear to be OCF_RESKEY_xxx. One point is to be noted is there +is no blank by both sides of "=". + + ** +Unvar+ (sub-option) +--------------------------------------- +Unvar VARIABLE [VARIABLE2 [...]] +--------------------------------------- + +Remove the environment variable. + + ** +Include+ (sub-option) +--------------------------------------- +Include macro_name +--------------------------------------- + +Include statements in 'macro_name'. See below for description of ++CASE-BLOCK+. + +** +Bash+ (sub-option) +--------------------------------------- +Bash bash_codes +--------------------------------------- + +This option is to set up the environment of OS, where you can insert +BASH code to customize the system randomly. Note, do not cause +unrecoverable consequences to the system. + +** +BashAtExit+ (sub-option) +--------------------------------------- +BashAtExit bash_codes +--------------------------------------- + +This option is to recover the OS environment in order to run another +test case correctly. Of cause you can use 'Bash' option to recover +it. However, if mistakes occur in the process, the script will quit +directly instead of running your recovery codes. If it happens, you +ought to use BashAtExit which can restore the system environment +before you quit. + +** +RunAgent+ (sub-option) +--------------------------------------- +RunAgent cmd [ret_value] +--------------------------------------- + +This option is to run resource agent. "cmd" is the parameter of the +resource agent, such as "start, status, stop ...". The second +parameter is optional. It will compare the actual returned value with +the expected value when the script has run recourse agent. If +differs, bugs will be found. + +It is also possible to execute a suboption on a remote host +instead of locally. The protocol used is ssh and the command is +run in the background. Just add the +@<ipaddr>+ suffix to the +suboption name. For instance: + +--------------------------------------- +Bash@192.168.1.100 date +--------------------------------------- + +would run the date program. Remote commands are run in +background. + +NB: Not clear how can ssh be automated as we don't know in +advance the environment. Perhaps use "well-known" host names such +as "node2"? Also, if the command runs in the background, it's not +clear how is the exit code checked. Finally, does Var@node make +sense? Or is the current environment somehow copied over? We +probably need an example here. + +Need examples in general. + +===== +CASE-BLOCK+ (top level option) +--------------------------------------- +CASE-BLOCK macro_name +--------------------------------------- + +The +CASE-BLOCK+ option defines a macro which can be +Include+d +in any +CASE+. All +CASE+ suboptions are valid in +CASE-BLOCK+. + + +== Installing and packaging resource agents + +This section discusses what to do with your resource agent once it is +done and tested -- where to install it, and how to include it in either +your own application package or in the Linux-HA resource agents +repository. + +=== Installing resource agents + +If you choose to include your resource agent in your own project, make +sure it installs into the correct location. Resource agents should +install into the +/usr/lib/ocf/resource.d/<provider>+ directory, where ++<provider>+ is the name of your project or any other name you wish to +identify the resource agent with. + +For example, if your +foobar+ resource agent is being packaged as part +of a project named +fortytwo+, then the correct full path to your +resource agent would be ++/usr/lib/ocf/resource.d/fortytwo/foobar+. Make sure your resource +agent installs with +0755+ (+-rwxr-xr-x+) permission bits. + +When installed this way, OCF-compliant cluster resource managers will +be able to properly identify, parse, and execute your resource +agent. The Pacemaker cluster manager, for example, would map the +above-mentioned installation path to the +ocf:fortytwo:foobar+ +resource type identifier. + +=== Packaging resource agents + +When you package resource agents as part of your own project, you +should apply the considerations outlined in this section. + +NOTE: If you instead prefer to submit your resource agent to the +Linux-HA resource agents repository, see +<<_submitting_resource_agents>> for information on doing so. + +==== RPM packaging + +It is recommended to put your OCF resource agent(s) in an RPM +sub-package, with the name +<toppackage>-resource-agents+. Ensure that +the package owns its provider directory, and depends on the upstream ++resource-agents+ package which lays out the directory hierarchy and +provides convenience shell functions. An example RPM spec snippet is +given below: + +-------------------------------------------------------------------------- +%package resource-agents +Summary: OCF resource agent for Foobar +Group: System Environment/Base +Requires: %{name} = %{version}-%{release}, resource-agents + +%description resource-agents +This package contains the OCF-compliant resource agents for Foobar. + +%files resource-agents +%defattr(755,root,root,-) +%dir %{_prefix}/lib/ocf/resource.d/fortytwo +%{_prefix}/lib/ocf/resource.d/fortytwo/foobar +-------------------------------------------------------------------------- + +NOTE: If an RPM spec file contains a +%package+ declaration, then RPM +considers this a sub-package which inherits top-level fields such as ++Name+, +Version+, +License+, etc. Sub-packages have the top-level +package name automatically prepended to their own name. Thus the snippet +above would create a sub-package named +foobar-resource-agents+ +(presuming the package +Name+ is +foobar+). + +==== Debian packaging + +For Debian packages, like for <<_rpm_packaging,RPMs>>, it is +recommended to create a separate package holding your resource agents, +which then should depend on the +cluster-agents+ package. + +NOTE: This section assumes that you are packaging with +debhelper+. + +An example +debian/control+ snippet is given below: + +-------------------------------------------------------------------------- +Package: foobar-cluster-agents +Priority: extra +Architecture: all +Depends: cluster-agents +Description: OCF-compliant resource agents for Foobar +-------------------------------------------------------------------------- + +You will also create a separate +.install+ file. Sticking with the +example of installing the +foobar+ resource agent as a sub-package of ++fortytwo+, the +debian/fortytwo-cluster-agents.install+ file could +consist of the following content: + +-------------------------------------------------------------------------- +usr/lib/ocf/resource.d/fortytwo/foobar +-------------------------------------------------------------------------- + +=== Submitting resource agents + +If you choose not to bundle your resource agent with your own package, +but instead wish to submit it to the upstream resource agent +repository hosted on +https://github.com/ClusterLabs/resource-agents[the ClusterLabs +repository on GitHub], please follow the steps outlined in this section. + +Create a fork of the +https://github.com/ClusterLabs/resource-agents[upstream repository] and +clone it with the following commands: + +-------------------------------------------------------------------------- +git clone git://github.com/<your-username>/resource-agents +git remote add upstream git@github.com:ClusterLabs/resource-agents.git +git checkout -b <new-branch> +-------------------------------------------------------------------------- + +Then, copy your resource agent into the +heartbeat+ subdirectory: +-------------------------------------------------------------------------- +cd resource-agents/heartbeat +cp /path/to/your/local/copy/of/foobar . +chmod 0755 foobar +cd .. +-------------------------------------------------------------------------- + +Next, modify the +Makefile.am+ file in +resource-agents/heartbeat+ and +add your new resource agent to the +ocf_SCRIPTS+ list. This will make +sure the agent is properly installed. + +Lastly, open Makefile.am in +resource-agents/doc/man+ and add ++ocf_heartbeat_<name>.7+ to the +man_MANS+ variable. This will +automatically generate a resource agent manual page from its metadata, +and then install that man page into the correct location. + +Now, add your new resource agents, and the two modifications to the +Makefiles, to your changeset: + +-------------------------------------------------------------------------- +git add heartbeat/foobar +git add heartbeat/Makefile.am +git add doc/man/Makefile.am +git commit +-------------------------------------------------------------------------- + +In your commit message, be sure to include a meaningful description, +for example: +-------------------------------------------------------------------------- +High: foobar: new resource agent + +This new resource agent adds functionality to manage a foobar service. +It supports being configured as a primitive or as a master/slave set, +and also optionally supports superfrobnication. +-------------------------------------------------------------------------- + +Now push the patch set to GitHub: +-------------------------------------------------------------------------- +git push +-------------------------------------------------------------------------- + +Create a Pull Request (PR) on Github that will be reviewed by the +upstream developers. + +Once your new resource agent has been accepted for merging, one of the +upstream developers will Merge the Pull Request into the upstream +repository. At that point, you can update your main branch from +upstream, and remove your own branch. + +-------------------------------------------------------------------------- +git checkout main +git fetch upstream +git merge upstream/main +git branch -D <branch> +-------------------------------------------------------------------------- + +=== Maintaining resource agents + +If you maintain a specific resource agent, or you are making repeated +contributions to the codebase, it's usually a good idea to maintain +your own _fork_ of the +ClusterLabs/resource-agents+ repository on +GitHub. + +To do so, + +* https://github.com/signup[Create a GitHub account] if you do not + have one already. +* http://help.github.com/fork-a-repo/[Fork] the + https://github.com/ClusterLabs/resource-agents[+resource-agents+ + repository]. +* Clone your personal fork into a local working copy. + +As you work on resource agents, *please* commit early, and commit +often. You can always fold commits later with +git rebase -i+. + +Once you have made a number of changes that you would like others to +review, push them to your GitHub fork and send a post to the ++linux-ha-dev+ mailing list pointing people to it. + +After the review is done, fix up your tree with any requested changes, +and then issue a pull request. There are two ways of doing so: + +* You can use the +git request-pull+ utility to get a pre-populated + email skeleton summarizing your changesets. Add any information you + see fit, and send it to the list. It is a good idea to prefix your + email subject with +[GIT PULL]+ so upstream maintainers can pick the + message out easily. + +* You can also issue a pull request directly on GitHub. GitHub + automatically notifies upstream maintainers about new pull requests + by email. Please refer to + http://help.github.com/send-pull-requests/[github:help] for details + on initiating pull requests. |