diff options
Diffstat (limited to 'doc/sphinx/Pacemaker_Explained/utilization.rst')
-rw-r--r-- | doc/sphinx/Pacemaker_Explained/utilization.rst | 264 |
1 files changed, 264 insertions, 0 deletions
diff --git a/doc/sphinx/Pacemaker_Explained/utilization.rst b/doc/sphinx/Pacemaker_Explained/utilization.rst new file mode 100644 index 0000000..93c67cd --- /dev/null +++ b/doc/sphinx/Pacemaker_Explained/utilization.rst @@ -0,0 +1,264 @@ +.. _utilization: + +Utilization and Placement Strategy +---------------------------------- + +Pacemaker decides where to place a resource according to the resource +allocation scores on every node. The resource will be allocated to the +node where the resource has the highest score. + +If the resource allocation scores on all the nodes are equal, by the default +placement strategy, Pacemaker will choose a node with the least number of +allocated resources for balancing the load. If the number of resources on each +node is equal, the first eligible node listed in the CIB will be chosen to run +the resource. + +Often, in real-world situations, different resources use significantly +different proportions of a node's capacities (memory, I/O, etc.). +We cannot balance the load ideally just according to the number of resources +allocated to a node. Besides, if resources are placed such that their combined +requirements exceed the provided capacity, they may fail to start completely or +run with degraded performance. + +To take these factors into account, Pacemaker allows you to configure: + +#. The capacity a certain node provides. + +#. The capacity a certain resource requires. + +#. An overall strategy for placement of resources. + +Utilization attributes +###################### + +To configure the capacity that a node provides or a resource requires, +you can use *utilization attributes* in ``node`` and ``resource`` objects. +You can name utilization attributes according to your preferences and define as +many name/value pairs as your configuration needs. However, the attributes' +values must be integers. + +.. topic:: Specifying CPU and RAM capacities of two nodes + + .. code-block:: xml + + <node id="node1" type="normal" uname="node1"> + <utilization id="node1-utilization"> + <nvpair id="node1-utilization-cpu" name="cpu" value="2"/> + <nvpair id="node1-utilization-memory" name="memory" value="2048"/> + </utilization> + </node> + <node id="node2" type="normal" uname="node2"> + <utilization id="node2-utilization"> + <nvpair id="node2-utilization-cpu" name="cpu" value="4"/> + <nvpair id="node2-utilization-memory" name="memory" value="4096"/> + </utilization> + </node> + +.. topic:: Specifying CPU and RAM consumed by several resources + + .. code-block:: xml + + <primitive id="rsc-small" class="ocf" provider="pacemaker" type="Dummy"> + <utilization id="rsc-small-utilization"> + <nvpair id="rsc-small-utilization-cpu" name="cpu" value="1"/> + <nvpair id="rsc-small-utilization-memory" name="memory" value="1024"/> + </utilization> + </primitive> + <primitive id="rsc-medium" class="ocf" provider="pacemaker" type="Dummy"> + <utilization id="rsc-medium-utilization"> + <nvpair id="rsc-medium-utilization-cpu" name="cpu" value="2"/> + <nvpair id="rsc-medium-utilization-memory" name="memory" value="2048"/> + </utilization> + </primitive> + <primitive id="rsc-large" class="ocf" provider="pacemaker" type="Dummy"> + <utilization id="rsc-large-utilization"> + <nvpair id="rsc-large-utilization-cpu" name="cpu" value="3"/> + <nvpair id="rsc-large-utilization-memory" name="memory" value="3072"/> + </utilization> + </primitive> + +A node is considered eligible for a resource if it has sufficient free +capacity to satisfy the resource's requirements. The nature of the required +or provided capacities is completely irrelevant to Pacemaker -- it just makes +sure that all capacity requirements of a resource are satisfied before placing +a resource to a node. + +Utilization attributes used on a node object can also be *transient* *(since 2.1.6)*. +These attributes are added to a ``transient_attributes`` section for the node +and are forgotten by the cluster when the node goes offline. The ``attrd_updater`` +tool can be used to set these attributes. + +.. topic:: Transient utilization attribute for node cluster-1 + + .. code-block:: xml + + <transient_attributes id="cluster-1"> + <utilization id="status-cluster-1"> + <nvpair id="status-cluster-1-cpu" name="cpu" value="1"/> + </utilization> + </transient_attributes> + +.. note:: + + Utilization is supported for bundles *(since 2.1.3)*, but only for bundles + with an inner primitive. Any resource utilization values should be specified + for the inner primitive, but any priority meta-attribute should be specified + for the outer bundle. + + +Placement Strategy +################## + +After you have configured the capacities your nodes provide and the +capacities your resources require, you need to set the ``placement-strategy`` +in the global cluster options, otherwise the capacity configurations have +*no effect*. + +Four values are available for the ``placement-strategy``: + +* **default** + + Utilization values are not taken into account at all. + Resources are allocated according to allocation scores. If scores are equal, + resources are evenly distributed across nodes. + +* **utilization** + + Utilization values are taken into account *only* when deciding whether a node + is considered eligible (i.e. whether it has sufficient free capacity to satisfy + the resource's requirements). Load-balancing is still done based on the + number of resources allocated to a node. + +* **balanced** + + Utilization values are taken into account when deciding whether a node + is eligible to serve a resource *and* when load-balancing, so an attempt is + made to spread the resources in a way that optimizes resource performance. + +* **minimal** + + Utilization values are taken into account *only* when deciding whether a node + is eligible to serve a resource. For load-balancing, an attempt is made to + concentrate the resources on as few nodes as possible, thereby enabling + possible power savings on the remaining nodes. + +Set ``placement-strategy`` with ``crm_attribute``: + + .. code-block:: none + + # crm_attribute --name placement-strategy --update balanced + +Now Pacemaker will ensure the load from your resources will be distributed +evenly throughout the cluster, without the need for convoluted sets of +colocation constraints. + +Allocation Details +################## + +Which node is preferred to get consumed first when allocating resources? +________________________________________________________________________ + +* The node with the highest node weight gets consumed first. Node weight + is a score maintained by the cluster to represent node health. + +* If multiple nodes have the same node weight: + + * If ``placement-strategy`` is ``default`` or ``utilization``, + the node that has the least number of allocated resources gets consumed first. + + * If their numbers of allocated resources are equal, + the first eligible node listed in the CIB gets consumed first. + + * If ``placement-strategy`` is ``balanced``, + the node that has the most free capacity gets consumed first. + + * If the free capacities of the nodes are equal, + the node that has the least number of allocated resources gets consumed first. + + * If their numbers of allocated resources are equal, + the first eligible node listed in the CIB gets consumed first. + + * If ``placement-strategy`` is ``minimal``, + the first eligible node listed in the CIB gets consumed first. + +Which node has more free capacity? +__________________________________ + +If only one type of utilization attribute has been defined, free capacity +is a simple numeric comparison. + +If multiple types of utilization attributes have been defined, then +the node that is numerically highest in the the most attribute types +has the most free capacity. For example: + +* If ``nodeA`` has more free ``cpus``, and ``nodeB`` has more free ``memory``, + then their free capacities are equal. + +* If ``nodeA`` has more free ``cpus``, while ``nodeB`` has more free ``memory`` + and ``storage``, then ``nodeB`` has more free capacity. + +Which resource is preferred to be assigned first? +_________________________________________________ + +* The resource that has the highest ``priority`` (see :ref:`resource_options`) gets + allocated first. + +* If their priorities are equal, check whether they are already running. The + resource that has the highest score on the node where it's running gets allocated + first, to prevent resource shuffling. + +* If the scores above are equal or the resources are not running, the resource has + the highest score on the preferred node gets allocated first. + +* If the scores above are equal, the first runnable resource listed in the CIB + gets allocated first. + +Limitations and Workarounds +########################### + +The type of problem Pacemaker is dealing with here is known as the +`knapsack problem <http://en.wikipedia.org/wiki/Knapsack_problem>`_ and falls into +the `NP-complete <http://en.wikipedia.org/wiki/NP-complete>`_ category of computer +science problems -- a fancy way of saying "it takes a really long time +to solve". + +Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours +or days, finding an optimal solution while services remain unavailable. + +So instead of trying to solve the problem completely, Pacemaker uses a +*best effort* algorithm for determining which node should host a particular +service. This means it arrives at a solution much faster than traditional +linear programming algorithms, but by doing so at the price of leaving some +services stopped. + +In the contrived example at the start of this chapter: + +* ``rsc-small`` would be allocated to ``node1`` + +* ``rsc-medium`` would be allocated to ``node2`` + +* ``rsc-large`` would remain inactive + +Which is not ideal. + +There are various approaches to dealing with the limitations of +pacemaker's placement strategy: + +* **Ensure you have sufficient physical capacity.** + + It might sound obvious, but if the physical capacity of your nodes is (close to) + maxed out by the cluster under normal conditions, then failover isn't going to + go well. Even without the utilization feature, you'll start hitting timeouts and + getting secondary failures. + +* **Build some buffer into the capabilities advertised by the nodes.** + + Advertise slightly more resources than we physically have, on the (usually valid) + assumption that a resource will not use 100% of the configured amount of + CPU, memory and so forth *all* the time. This practice is sometimes called *overcommit*. + +* **Specify resource priorities.** + + If the cluster is going to sacrifice services, it should be the ones you care + about (comparatively) the least. Ensure that resource priorities are properly set + so that your most important resources are scheduled first. |