summaryrefslogtreecommitdiffstats
path: root/doc/sphinx/Pacemaker_Explained/utilization.rst
blob: 93c67cdf3133b59ca423ea937f13dfb5acea40d1 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
.. _utilization:

Utilization and Placement Strategy
----------------------------------

Pacemaker decides where to place a resource according to the resource
allocation scores on every node. The resource will be allocated to the
node where the resource has the highest score.

If the resource allocation scores on all the nodes are equal, by the default
placement strategy, Pacemaker will choose a node with the least number of
allocated resources for balancing the load. If the number of resources on each
node is equal, the first eligible node listed in the CIB will be chosen to run
the resource.

Often, in real-world situations, different resources use significantly
different proportions of a node's capacities (memory, I/O, etc.).
We cannot balance the load ideally just according to the number of resources
allocated to a node. Besides, if resources are placed such that their combined
requirements exceed the provided capacity, they may fail to start completely or
run with degraded performance.

To take these factors into account, Pacemaker allows you to configure:

#. The capacity a certain node provides.

#. The capacity a certain resource requires.

#. An overall strategy for placement of resources.

Utilization attributes
######################

To configure the capacity that a node provides or a resource requires,
you can use *utilization attributes* in ``node`` and ``resource`` objects.
You can name utilization attributes according to your preferences and define as
many name/value pairs as your configuration needs. However, the attributes'
values must be integers.

.. topic:: Specifying CPU and RAM capacities of two nodes

   .. code-block:: xml

      <node id="node1" type="normal" uname="node1">
        <utilization id="node1-utilization">
          <nvpair id="node1-utilization-cpu" name="cpu" value="2"/>
          <nvpair id="node1-utilization-memory" name="memory" value="2048"/>
        </utilization>
      </node>
      <node id="node2" type="normal" uname="node2">
        <utilization id="node2-utilization">
          <nvpair id="node2-utilization-cpu" name="cpu" value="4"/>
          <nvpair id="node2-utilization-memory" name="memory" value="4096"/>
        </utilization>
      </node>

.. topic:: Specifying CPU and RAM consumed by several resources

   .. code-block:: xml

      <primitive id="rsc-small" class="ocf" provider="pacemaker" type="Dummy">
        <utilization id="rsc-small-utilization">
          <nvpair id="rsc-small-utilization-cpu" name="cpu" value="1"/>
          <nvpair id="rsc-small-utilization-memory" name="memory" value="1024"/>
        </utilization>
      </primitive>
      <primitive id="rsc-medium" class="ocf" provider="pacemaker" type="Dummy">
        <utilization id="rsc-medium-utilization">
          <nvpair id="rsc-medium-utilization-cpu" name="cpu" value="2"/>
          <nvpair id="rsc-medium-utilization-memory" name="memory" value="2048"/>
        </utilization>
      </primitive>
      <primitive id="rsc-large" class="ocf" provider="pacemaker" type="Dummy">
        <utilization id="rsc-large-utilization">
          <nvpair id="rsc-large-utilization-cpu" name="cpu" value="3"/>
          <nvpair id="rsc-large-utilization-memory" name="memory" value="3072"/>
        </utilization>
      </primitive>

A node is considered eligible for a resource if it has sufficient free
capacity to satisfy the resource's requirements. The nature of the required
or provided capacities is completely irrelevant to Pacemaker -- it just makes
sure that all capacity requirements of a resource are satisfied before placing
a resource to a node.

Utilization attributes used on a node object can also be *transient* *(since 2.1.6)*.
These attributes are added to a ``transient_attributes`` section for the node
and are forgotten by the cluster when the node goes offline.  The ``attrd_updater``
tool can be used to set these attributes.

.. topic:: Transient utilization attribute for node cluster-1

   .. code-block:: xml

      <transient_attributes id="cluster-1">
        <utilization id="status-cluster-1">
          <nvpair id="status-cluster-1-cpu" name="cpu" value="1"/>
        </utilization>
      </transient_attributes>

.. note::

   Utilization is supported for bundles *(since 2.1.3)*, but only for bundles
   with an inner primitive. Any resource utilization values should be specified
   for the inner primitive, but any priority meta-attribute should be specified
   for the outer bundle.


Placement Strategy
##################

After you have configured the capacities your nodes provide and the
capacities your resources require, you need to set the ``placement-strategy``
in the global cluster options, otherwise the capacity configurations have
*no effect*.

Four values are available for the ``placement-strategy``: 

* **default**

   Utilization values are not taken into account at all.
   Resources are allocated according to allocation scores. If scores are equal,
   resources are evenly distributed across nodes.

* **utilization**

   Utilization values are taken into account *only* when deciding whether a node
   is considered eligible (i.e. whether it has sufficient free capacity to satisfy
   the resource's requirements). Load-balancing is still done based on the
   number of resources allocated to a node. 

* **balanced**

   Utilization values are taken into account when deciding whether a node
   is eligible to serve a resource *and* when load-balancing, so an attempt is
   made to spread the resources in a way that optimizes resource performance.

* **minimal**

   Utilization values are taken into account *only* when deciding whether a node
   is eligible to serve a resource. For load-balancing, an attempt is made to
   concentrate the resources on as few nodes as possible, thereby enabling
   possible power savings on the remaining nodes. 

Set ``placement-strategy`` with ``crm_attribute``:

   .. code-block:: none

      # crm_attribute --name placement-strategy --update balanced

Now Pacemaker will ensure the load from your resources will be distributed
evenly throughout the cluster, without the need for convoluted sets of
colocation constraints.

Allocation Details
##################

Which node is preferred to get consumed first when allocating resources?
________________________________________________________________________

* The node with the highest node weight gets consumed first. Node weight
  is a score maintained by the cluster to represent node health.

* If multiple nodes have the same node weight:

 * If ``placement-strategy`` is ``default`` or ``utilization``,
   the node that has the least number of allocated resources gets consumed first.

   * If their numbers of allocated resources are equal,
     the first eligible node listed in the CIB gets consumed first.

 * If ``placement-strategy`` is ``balanced``,
   the node that has the most free capacity gets consumed first.

   * If the free capacities of the nodes are equal,
     the node that has the least number of allocated resources gets consumed first.

     * If their numbers of allocated resources are equal,
       the first eligible node listed in the CIB gets consumed first.

 * If ``placement-strategy`` is ``minimal``,
   the first eligible node listed in the CIB gets consumed first.

Which node has more free capacity?
__________________________________

If only one type of utilization attribute has been defined, free capacity
is a simple numeric comparison.

If multiple types of utilization attributes have been defined, then
the node that is numerically highest in the the most attribute types
has the most free capacity. For example:

* If ``nodeA`` has more free ``cpus``, and ``nodeB`` has more free ``memory``,
  then their free capacities are equal.

* If ``nodeA`` has more free ``cpus``, while ``nodeB`` has more free ``memory``
  and ``storage``, then ``nodeB`` has more free capacity.

Which resource is preferred to be assigned first?
_________________________________________________

* The resource that has the highest ``priority`` (see :ref:`resource_options`) gets
  allocated first.

* If their priorities are equal, check whether they are already running. The
  resource that has the highest score on the node where it's running gets allocated
  first, to prevent resource shuffling.

* If the scores above are equal or the resources are not running, the resource has
  the highest score on the preferred node gets allocated first.

* If the scores above are equal, the first runnable resource listed in the CIB
  gets allocated first.

Limitations and Workarounds
###########################

The type of problem Pacemaker is dealing with here is known as the
`knapsack problem <http://en.wikipedia.org/wiki/Knapsack_problem>`_ and falls into
the `NP-complete <http://en.wikipedia.org/wiki/NP-complete>`_ category of computer
science problems -- a fancy way of saying "it takes a really long time
to solve".

Clearly in a HA cluster, it's not acceptable to spend minutes, let alone hours
or days, finding an optimal solution while services remain unavailable.

So instead of trying to solve the problem completely, Pacemaker uses a
*best effort* algorithm for determining which node should host a particular
service. This means it arrives at a solution much faster than traditional
linear programming algorithms, but by doing so at the price of leaving some
services stopped.

In the contrived example at the start of this chapter:

* ``rsc-small`` would be allocated to ``node1``

* ``rsc-medium`` would be allocated to ``node2``

* ``rsc-large`` would remain inactive

Which is not ideal.

There are various approaches to dealing with the limitations of
pacemaker's placement strategy:

* **Ensure you have sufficient physical capacity.**

   It might sound obvious, but if the physical capacity of your nodes is (close to)
   maxed out by the cluster under normal conditions, then failover isn't going to
   go well. Even without the utilization feature, you'll start hitting timeouts and
   getting secondary failures.

* **Build some buffer into the capabilities advertised by the nodes.**

   Advertise slightly more resources than we physically have, on the (usually valid)
   assumption that a resource will not use 100% of the configured amount of
   CPU, memory and so forth *all* the time. This practice is sometimes called *overcommit*.

* **Specify resource priorities.**

   If the cluster is going to sacrifice services, it should be the ones you care
   about (comparatively) the least. Ensure that resource priorities are properly set
   so that your most important resources are scheduled first.