summaryrefslogtreecommitdiffstats
path: root/doc/website-v1/rsctest-guide.adoc
blob: 2dcd8659108d3a948405e21ecd1e91a6f4b80418 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
= Resource testing =

Never created a pacemaker cluster configuration before? Please
read on.

Ever created a pacemaker configuration without errors? All
resources worked from the get go on all your nodes? Really? We
want a photo of you!

Seriously, it is so error prone to get a cluster resource
definition right that I think I ever only managed to do it with
`Dummy` resources. There are many intricate details that have to be
just right, and all of them are stuffed in a single place as simple
name-value attributes. Then there are multiple nodes, each node
containing a complex system environment inevitably always in flux and
changing (entropy anybody?).

Now, once you defined your set of resources and are about to
_commit_ the configuration (at that point it usually takes a
deep breath to do so), be ready to meet an avalanche of error
messages, not all of which are easy to understand or follow. Not
to mention that you need to read the logs too. Even though we do
have a link:history-tutorial.html[tool] to help with digging through
the logs, it is going to be an interesting experience and not quite
recommended if you're just starting with pacemaker clusters. Even the
experts can save a lot of time and headaches by following the advice
below.

== Basic usage ==

Enter resource testing. It is a special feature designed to help
users find problems in resource configurations.

The usage is very simple:

----
crm(live)configure# rsctest web-server 
Probing resources ..
testing on xen-f: apache web-ip
testing on xen-g: apache web-ip
crm(live)configure# 
----

What actually happened above and what is it good for? From the
output we can infer that the `web-server` resource is actually a
group comprising one apache web server and one IP address.
Indeed:

----
crm(live)configure# show web-server 
group web-server apache web-ip \
        meta target-role="Stopped"
crm(live)configure# 
----

The `rsctest` command first established that the resources are
stopped on all nodes in the cluster. Then it tests the resources
in the order defined by the resource group on all nodes. It does
this by manually starting the resources, one by one, then running
a "monitor" for each resource to make sure that the resources are
healthy, and finally stopping the resources in reverse order.

Since there is no additional output, the test passed. It looks
like we have a properly defined web server group.

== Reporting problems ==

Now, the above run was not very interesting so let's spoil the
idyll:

----
xen-f:~ # mv /etc/apache2/httpd.conf /tmp
----

We moved the apache configuration file away on node `xen-f`.  The
`apache` resource should fail now:

----
crm(live)configure# rsctest web-server 
Probing resources ..
testing on xen-f: apache
host xen-f (exit code 5)
xen-f stderr:
2013/10/17_16:51:26 ERROR: Configuration file /etc/apache2/httpd.conf not found!
2013/10/17_16:51:26 ERROR: environment is invalid, resource considered stopped

testing on xen-g: apache web-ip
crm(live)configure# 
----

As expected, `apache` failed to start on node `xen-f`. When the
cluster resource manager runs an operation on a resource, all
messages are logged (there is no terminal attached to the
cluster, anyway). All one can see in the resource status is the type
of the exit code. In this case, it is an installation problem.

For instance, the output could look like this:

----
xen-f:~ # crm status
Last updated: Thu Oct 17 19:21:44 2013
Last change: Thu Oct 17 19:21:28 2013 by root via crm_resource on xen-f
...
Failed actions:
    apache_start_0 on xen-f 'not installed' (5): call=2074, status=complete,
last-rc-change='Thu Oct 17 19:21:31 2013', queued=164ms, exec=0ms
----

That does not look very informative. With `rsctest` we can
immediately see what the problem is. It saves us prowling the
logs looking for messages of the `apache` resource agent.

Note that the IP address is not tested, because the resource it
depends on could not be started.

== What is tested? ==

The start, monitor, and stop operations, in exactly that order,
are tested for every resource specified. Note that normally the
two latter operations should never fail if the resource agent is
well implemented. The RA should under normal circumstances be
able to stop or monitor a started resource. However, this is
_not_ a replacement for resource agent testing. If that is what
you are looking for, see
http://www.linux-ha.org/doc/dev-guides/_testing_resource_agents.html[the
RA testing chapter] of the RA development guide.

== Protecting resources ==

The `rsctest` command goes to great lengths to prevent starting a
resource on more than one node at the same time. For some stuff
that would actually mean data corruption and we certainly don't
want that to happen.

----
xen-f:~ # (echo start web-server; echo show web-server) | crm -w resource
resource web-server is running on: xen-g 
xen-f:~ # crm configure rsctest web-server
Probing resources .WARNING: apache:probe: resource running at xen-g
.WARNING: web-ip:probe: resource running at xen-g

Stop all resources before testing!
xen-f:~ # crm configure rsctest web-server xen-f
Probing resources .WARNING: apache:probe: resource running at xen-g
.WARNING: web-ip:probe: resource running at xen-g

Stop all resources before testing!
xen-f:~ # 
----

As you can see, if `rsctest` finds any of the resources running
on any node it refuses to run any tests.

== Multi-state and clone resources ==

Apart from groups, the `rsctest` can also handle the other two
special kinds of resources. Let's take a look at one `drbd`-based
configuration:

----
crm(live)configure# show ms_drbd_nfs drbd0-vg 
primitive drbd0-vg ocf:heartbeat:LVM \
        params volgrpname="drbd0-vg"
primitive p_drbd_nfs ocf:linbit:drbd \
        meta target-role="Stopped" \
        params drbd_resource="nfs" \
        op monitor interval="15" role="Master" \
        op monitor interval="30" role="Slave" \
        op start interval="0" timeout="300" \
        op stop interval="0" timeout="120"
ms ms_drbd_nfs p_drbd_nfs \
        meta notify="true" clone-max="2"
crm(live)configure# 
----

The `nfs` drbd resource contains a volume group `drbd0-vg`.

----
crm(live)configure# rsctest ms_drbd_nfs drbd0-vg 
Probing resources ..
testing on xen-f: p_drbd_nfs drbd0-vg
testing on xen-g: p_drbd_nfs drbd0-vg
crm(live)configure# 
----

For the multi-state (master-slave) resources, the involved
resource motions are somewhat more complex: the resource is first
started on both nodes and then promoted on the node where the
next resource is to be tested (in this case the volume group).
Then it gets demoted to slave and promoted on the other
node to master so that the depending resources can be tested on
that node too.

Note that even though we asked for `ms_drbd_nfs` to be tested,
there is `p_drbd_nfs` in the output which is the primitive
encapsulated in the master-slave resource. You can specify either
one.

== Stonith resources ==

The stonith resources are also special and need special
treatment. What is tested is just the device status. Actually
fencing nodes was deemed too drastic. Please use `node fence` to
test the fencing device effectiveness. It also does not matter
whether the stonith resource is "running" on any node: being
started is just something that happens virtually in the
`stonithd` process.

== Summary ==

- use `rsctest` to make sure that the resources can be started
  correctly on all nodes

- `rsctest` protects resources by making sure beforehand that
  none of them is currently running on any of the cluster nodes

- `rsctest` understands groups, master-slave (multi-state), and
  clone resources, but nothing else of the configuration
  (constraints or any other placement/order cluster configuration
  elements)

- it is up to the user to test resources only on nodes which are
  really supposed to run them and in a proper order (if that
  order is expressed via constraints)

- `rsctest` cannot protect resources if they are running on
  nodes which are not present in the cluster or from bad RA
  implementations (but neither would a cluster resource manager)

- `rsctest` was designed as a debugging and configuration aid, and is
  not intended to provide full Resource Agent test coverage.

== `crmsh` help and online resources (_sic!_) ==

- link:crm.8.html#topics_Testing[`crm help Testing`]

- link:crm.8.html#cmdhelp_configure_rsctest[`crm configure help
rsctest`]