summaryrefslogtreecommitdiffstats
path: root/doc/sphinx/Pacemaker_Explained/utilization.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/sphinx/Pacemaker_Explained/utilization.rst')
-rw-r--r--doc/sphinx/Pacemaker_Explained/utilization.rst264
1 files changed, 264 insertions, 0 deletions
diff --git a/doc/sphinx/Pacemaker_Explained/utilization.rst b/doc/sphinx/Pacemaker_Explained/utilization.rst
new file mode 100644
index 0000000..93c67cd
--- /dev/null
+++ b/doc/sphinx/Pacemaker_Explained/utilization.rst
@@ -0,0 +1,264 @@
+.. _utilization:
+
+Utilization and Placement Strategy
+----------------------------------
+
+Pacemaker decides where to place a resource according to the resource
+allocation scores on every node. The resource will be allocated to the
+node where the resource has the highest score.
+
+If the resource allocation scores on all the nodes are equal, by the default
+placement strategy, Pacemaker will choose a node with the least number of
+allocated resources for balancing the load. If the number of resources on each
+node is equal, the first eligible node listed in the CIB will be chosen to run
+the resource.
+
+Often, in real-world situations, different resources use significantly
+different proportions of a node's capacities (memory, I/O, etc.).
+We cannot balance the load ideally just according to the number of resources
+allocated to a node. Besides, if resources are placed such that their combined
+requirements exceed the provided capacity, they may fail to start completely or
+run with degraded performance.
+
+To take these factors into account, Pacemaker allows you to configure:
+
+#. The capacity a certain node provides.
+
+#. The capacity a certain resource requires.
+
+#. An overall strategy for placement of resources.
+
+Utilization attributes
+######################
+
+To configure the capacity that a node provides or a resource requires,
+you can use *utilization attributes* in ``node`` and ``resource`` objects.
+You can name utilization attributes according to your preferences and define as
+many name/value pairs as your configuration needs. However, the attributes'
+values must be integers.
+
+.. topic:: Specifying CPU and RAM capacities of two nodes
+
+ .. code-block:: xml
+
+ <node id="node1" type="normal" uname="node1">
+ <utilization id="node1-utilization">
+ <nvpair id="node1-utilization-cpu" name="cpu" value="2"/>
+ <nvpair id="node1-utilization-memory" name="memory" value="2048"/>
+ </utilization>
+ </node>
+ <node id="node2" type="normal" uname="node2">
+ <utilization id="node2-utilization">
+ <nvpair id="node2-utilization-cpu" name="cpu" value="4"/>
+ <nvpair id="node2-utilization-memory" name="memory" value="4096"/>
+ </utilization>
+ </node>
+
+.. topic:: Specifying CPU and RAM consumed by several resources
+
+ .. code-block:: xml
+
+ <primitive id="rsc-small" class="ocf" provider="pacemaker" type="Dummy">
+ <utilization id="rsc-small-utilization">
+ <nvpair id="rsc-small-utilization-cpu" name="cpu" value="1"/>
+ <nvpair id="rsc-small-utilization-memory" name="memory" value="1024"/>
+ </utilization>
+ </primitive>
+ <primitive id="rsc-medium" class="ocf" provider="pacemaker" type="Dummy">
+ <utilization id="rsc-medium-utilization">
+ <nvpair id="rsc-medium-utilization-cpu" name="cpu" value="2"/>
+ <nvpair id="rsc-medium-utilization-memory" name="memory" value="2048"/>
+ </utilization>
+ </primitive>
+ <primitive id="rsc-large" class="ocf" provider="pacemaker" type="Dummy">
+ <utilization id="rsc-large-utilization">
+ <nvpair id="rsc-large-utilization-cpu" name="cpu" value="3"/>
+ <nvpair id="rsc-large-utilization-memory" name="memory" value="3072"/>
+ </utilization>
+ </primitive>
+
+A node is considered eligible for a resource if it has sufficient free
+capacity to satisfy the resource's requirements. The nature of the required
+or provided capacities is completely irrelevant to Pacemaker -- it just makes
+sure that all capacity requirements of a resource are satisfied before placing
+a resource to a node.
+
+Utilization attributes used on a node object can also be *transient* *(since 2.1.6)*.
+These attributes are added to a ``transient_attributes`` section for the node
+and are forgotten by the cluster when the node goes offline. The ``attrd_updater``
+tool can be used to set these attributes.
+
+.. topic:: Transient utilization attribute for node cluster-1
+
+ .. code-block:: xml
+
+ <transient_attributes id="cluster-1">
+ <utilization id="status-cluster-1">
+ <nvpair id="status-cluster-1-cpu" name="cpu" value="1"/>
+ </utilization>
+ </transient_attributes>
+
+.. note::
+
+ Utilization is supported for bundles *(since 2.1.3)*, but only for bundles
+ with an inner primitive. Any resource utilization values should be specified
+ for the inner primitive, but any priority meta-attribute should be specified
+ for the outer bundle.
+
+
+Placement Strategy
+##################
+
+After you have configured the capacities your nodes provide and the
+capacities your resources require, you need to set the ``placement-strategy``
+in the global cluster options, otherwise the capacity configurations have
+*no effect*.
+
+Four values are available for the ``placement-strategy``:
+
+* **default**
+
+ Utilization values are not taken into account at all.
+ Resources are allocated according to allocation scores. If scores are equal,
+ resources are evenly distributed across nodes.
+
+* **utilization**
+
+ Utilization values are taken into account *only* when deciding whether a node
+ is considered eligible (i.e. whether it has sufficient free capacity to satisfy
+ the resource's requirements). Load-balancing is still done based on the
+ number of resources allocated to a node.
+
+* **balanced**
+
+ Utilization values are taken into account when deciding whether a node
+ is eligible to serve a resource *and* when load-balancing, so an attempt is
+ made to spread the resources in a way that optimizes resource performance.
+
+* **minimal**
+
+ Utilization values are taken into account *only* when deciding whether a node
+ is eligible to serve a resource. For load-balancing, an attempt is made to
+ concentrate the resources on as few nodes as possible, thereby enabling
+ possible power savings on the remaining nodes.
+
+Set ``placement-strategy`` with ``crm_attribute``:
+
+ .. code-block:: none
+
+ # crm_attribute --name placement-strategy --update balanced
+
+Now Pacemaker will ensure the load from your resources will be distributed
+evenly throughout the cluster, without the need for convoluted sets of
+colocation constraints.
+
+Allocation Details
+##################
+
+Which node is preferred to get consumed first when allocating resources?
+________________________________________________________________________
+
+* The node with the highest node weight gets consumed first. Node weight
+ is a score maintained by the cluster to represent node health.
+
+* If multiple nodes have the same node weight:
+
+ * If ``placement-strategy`` is ``default`` or ``utilization``,
+ the node that has the least number of allocated resources gets consumed first.
+
+ * If their numbers of allocated resources are equal,
+ the first eligible node listed in the CIB gets consumed first.
+
+ * If ``placement-strategy`` is ``balanced``,
+ the node that has the most free capacity gets consumed first.
+
+ * If the free capacities of the nodes are equal,
+ the node that has the least number of allocated resources gets consumed first.
+
+ * If their numbers of allocated resources are equal,
+ the first eligible node listed in the CIB gets consumed first.
+
+ * If ``placement-strategy`` is ``minimal``,
+ the first eligible node listed in the CIB gets consumed first.
+
+Which node has more free capacity?
+__________________________________
+
+If only one type of utilization attribute has been defined, free capacity
+is a simple numeric comparison.
+
+If multiple types of utilization attributes have been defined, then
+the node that is numerically highest in the the most attribute types
+has the most free capacity. For example:
+
+* If ``nodeA`` has more free ``cpus``, and ``nodeB`` has more free ``memory``,
+ then their free capacities are equal.
+
+* If ``nodeA`` has more free ``cpus``, while ``nodeB`` has more free ``memory``
+ and ``storage``, then ``nodeB`` has more free capacity.
+
+Which resource is preferred to be assigned first?
+_________________________________________________
+
+* The resource that has the highest ``priority`` (see :ref:`resource_options`) gets
+ allocated first.
+
+* If their priorities are equal, check whether they are already running. The
+ resource that has the highest score on the node where it's running gets allocated
+ first, to prevent resource shuffling.
+
+* If the scores above are equal or the resources are not running, the resource has
+ the highest score on the preferred node gets allocated first.
+
+* If the scores above are equal, the first runnable resource listed in the CIB
+ gets allocated first.
+
+Limitations and Workarounds
+###########################
+
+The type of problem Pacemaker is dealing with here is known as the
+`knapsack problem <http://en.wikipedia.org/wiki/Knapsack_problem>`_ and falls into
+the `NP-complete <http://en.wikipedia.org/wiki/NP-complete>`_ category of computer
+science problems -- a fancy way of saying "it takes a really long time
+to solve".
+
+Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours
+or days, finding an optimal solution while services remain unavailable.
+
+So instead of trying to solve the problem completely, Pacemaker uses a
+*best effort* algorithm for determining which node should host a particular
+service. This means it arrives at a solution much faster than traditional
+linear programming algorithms, but by doing so at the price of leaving some
+services stopped.
+
+In the contrived example at the start of this chapter:
+
+* ``rsc-small`` would be allocated to ``node1``
+
+* ``rsc-medium`` would be allocated to ``node2``
+
+* ``rsc-large`` would remain inactive
+
+Which is not ideal.
+
+There are various approaches to dealing with the limitations of
+pacemaker's placement strategy:
+
+* **Ensure you have sufficient physical capacity.**
+
+ It might sound obvious, but if the physical capacity of your nodes is (close to)
+ maxed out by the cluster under normal conditions, then failover isn't going to
+ go well. Even without the utilization feature, you'll start hitting timeouts and
+ getting secondary failures.
+
+* **Build some buffer into the capabilities advertised by the nodes.**
+
+ Advertise slightly more resources than we physically have, on the (usually valid)
+ assumption that a resource will not use 100% of the configured amount of
+ CPU, memory and so forth *all* the time. This practice is sometimes called *overcommit*.
+
+* **Specify resource priorities.**
+
+ If the cluster is going to sacrifice services, it should be the ones you care
+ about (comparatively) the least. Ensure that resource priorities are properly set
+ so that your most important resources are scheduled first.