summaryrefslogtreecommitdiffstats
path: root/doc/hb_report.8.txt
blob: 5efbc320fa6d0693f92643337706f0cd593a19da (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
:man source:   hb_report
:man version:  1.2
:man manual:   Pacemaker documentation

hb_report(8)
============


NAME
----
hb_report - create report for CRM based clusters (Pacemaker)


SYNOPSIS
--------
*hb_report* -f {time|"cts:"testnum} [-t time] [-u user] [-l file]
       [-n nodes] [-E files] [-p patt] [-L patt] [-e prog]
	   [-MSDCZAQVsvhd] [dest]


DESCRIPTION
-----------
The hb_report(1) is a utility to collect all information (logs,
configuration files, system information, etc) relevant to
Pacemaker (CRM) over the given period of time.


OPTIONS
-------
dest::
	The report name. It can also contain a path where to put the
	report tarball. If left out, the tarball is created in the
	current directory named "hb_report-current_date", for instance
	hb_report-Wed-03-Mar-2010.

*-d*::
	Don't create the compressed tar, but leave the result in a
	directory.

*-f* { time | "cts:"testnum }::
	The start time from which to collect logs. The time is in the
	format as used by the Date::Parse perl module. For cts tests,
	specify the "cts:" string followed by the test number. This
	option is required.

*-t* time::
	The end time to which to collect logs. Defaults to now.

*-n* nodes::
	A list of space separated hostnames (cluster members).
	hb_report may try to find out the set of nodes by itself, but
	if it runs on the loghost which, as it is usually the case,
	does not belong to the cluster, that may be difficult. Also,
	OpenAIS doesn't contain a list of nodes and if Pacemaker is
	not running, there is no way to find it out automatically.
	This option is cumulative (i.e. use -n "a b" or -n a -n b).

*-l* file::
	Log file location. If, for whatever reason, hb_report cannot
	find the log files, you can specify its absolute path.

*-E* files::
	Extra log files to collect. This option is cumulative. By
	default, /var/log/messages are collected along with the
	cluster logs.

*-M*::
	Don't collect extra log files, but only the file containing
	messages from the cluster subsystems.

*-L* patt::
	A list of regular expressions to match in log files for
	analysis. This option is additive (default: "CRIT: ERROR:").

*-p* patt::
	Additional patterns to match parameter name which contain
	sensitive information. This option is additive (default: "passw.*").

*-Q*::
	Quick run. Gathering some system information can be expensive.
	With this option, such operations are skipped and thus
	information collecting sped up. The operations considered
	I/O or CPU intensive: verifying installed packages content,
	sanitizing files for sensitive information, and producing dot
	files from PE inputs.

*-A*::
	This is an OpenAIS cluster. hb_report has some heuristics to
	find the cluster stack, but that is not always reliable.
	By default, hb_report assumes that it is run on a Heartbeat
	cluster.

*-u* user::
	The ssh user. hb_report will try to login to other nodes
	without specifying a user, then as "root", and finally as
	"hacluster". If you have another user for administration over
	ssh, please use this option.

*-X* ssh-options::
	Extra ssh options. These will be added to every ssh
	invocation. Alternatively, use `$HOME/.ssh/config` to setup
	desired ssh connection options.

*-S*::
	Single node operation. Run hb_report only on this node and
	don't try to start slave collectors on other members of the
	cluster. Under normal circumstances this option is not
	needed. Use if ssh(1) does not work to other nodes.

*-Z*::
	If the destination directory exist, remove it instead of
	exiting (this is default for CTS).

*-V*::
	Print the version including the last repository changeset.

*-v*::
	Increase verbosity. Normally used to debug unexpected
	behaviour.

*-h*::
	Show usage and some examples.

*-D* (obsolete)::
	Don't invoke editor to fill the description text file.

*-e* prog (obsolete)::
	Your favourite text editor. Defaults to $EDITOR, vim, vi,
	emacs, or nano, whichever is found first.

*-C* (obsolete)::
	Remove the destination directory once the report has been put
	in a tarball.

EXAMPLES
--------
Last night during the backup there were several warnings
encountered (logserver is the log host):

	logserver# hb_report -f 3:00 -t 4:00 -n "node1 node2" report

collects everything from all nodes from 3am to 4am last night.
The files are compressed to a tarball report.tar.bz2.

Just found a problem during testing:

	# note the current time
	node1# date
	Fri Sep 11 18:51:40 CEST 2009
	node1# /etc/init.d/heartbeat start
	node1# nasty-command-that-breaks-things
	node1# sleep 120 #wait for the cluster to settle
	node1# hb_report -f 18:51 hb1

	# if hb_report can't figure out that this is corosync
	node1# hb_report -f 18:51 -A hb1

	# if hb_report can't figure out the cluster members
	node1# hb_report -f 18:51 -n "node1 node2" hb1

The files are compressed to a tarball hb1.tar.bz2.

INTERPRETING RESULTS
--------------------
The compressed tar archive is the final product of hb_report.
This is one example of its content, for a CTS test case on a
three node OpenAIS cluster:

	$ ls -RF 001-Restart

	001-Restart:
	analysis.txt     events.txt  logd.cf       s390vm13/  s390vm16/
	description.txt  ha-log.txt  openais.conf  s390vm14/

	001-Restart/s390vm13:
	STOPPED  crm_verify.txt  hb_uuid.txt  openais.conf@   sysinfo.txt
	cib.txt  dlm_dump.txt    logd.cf@     pengine/        sysstats.txt
	cib.xml  events.txt      messages     permissions.txt

	001-Restart/s390vm13/pengine:
	pe-input-738.bz2  pe-input-740.bz2  pe-warn-450.bz2
	pe-input-739.bz2  pe-warn-449.bz2   pe-warn-451.bz2

	001-Restart/s390vm14:
	STOPPED  crm_verify.txt  hb_uuid.txt  openais.conf@   sysstats.txt
	cib.txt  dlm_dump.txt    logd.cf@     permissions.txt
	cib.xml  events.txt      messages     sysinfo.txt

	001-Restart/s390vm16:
	STOPPED  crm_verify.txt  hb_uuid.txt  messages        sysinfo.txt
	cib.txt  dlm_dump.txt    hostcache    openais.conf@   sysstats.txt
	cib.xml  events.txt      logd.cf@     permissions.txt

The top directory contains information which pertains to the
cluster or event as a whole. Files with exactly the same content
on all nodes will also be at the top, with per-node links created
(as it is in this example the case with openais.conf and logd.cf).

The cluster log files are named ha-log.txt regardless of the
actual log file name on the system. If it is found on the
loghost, then it is placed in the top directory. If not, the top
directory ha-log.txt contains all nodes logs merged and sorted by
time. Files named messages are excerpts of /var/log/messages from
nodes.

Most files are copied verbatim or they contain output of a
command. For instance, cib.xml is a copy of the CIB found in
/var/lib/heartbeat/crm/cib.xml. crm_verify.txt is output of the
crm_verify(8) program.

Some files are result of a more involved processing:

	*analysis.txt*::
	A set of log messages matching user defined patterns (may be
	provided with the -L option).

	*events.txt*::
	A set of log messages matching event patterns. It should
	provide information about major cluster motions without
	unnecessary details.  These patterns are devised by the
	cluster experts.  Currently, the patterns cover membership
	and quorum changes, resource starts and stops, fencing
	(stonith) actions, and cluster starts and stops. events.txt
	is always generated for each node. In case the central
	cluster log was found, also combined for all nodes.

	*permissions.txt*::
	One of the more common problem causes are file and directory
	permissions. hb_report looks for a set of predefined
	directories and checks their permissions. Any issues are
	reported here.

	*backtraces.txt*::
	gdb generated backtrace information for cores dumped
	within the specified period.

	*sysinfo.txt*::
	Various release information about the platform, kernel,
	operating system, packages, and anything else deemed to be
	relevant. The static part of the system.

	*sysstats.txt*::
	Output of various system commands such as ps(1), uptime(1),
	netstat(8), and ifconfig(8). The dynamic part of the system.

description.txt should contain a user supplied description of the
problem, but since it is very seldom used, it will be dropped
from the future releases.

PREREQUISITES
-------------

ssh::
	It is not strictly required, but you won't regret having a
	password-less ssh. It is not too difficult to setup and will save
	you a lot of time. If you can't have it, for example because your
	security policy does not allow such a thing, or you just prefer
	menial work, then you will have to resort to the semi-manual
	semi-automated report generation. See below for instructions.
	+
	If you need to supply a password for your passphrase/login, then
	always use the `-u` option.
	+
	For extra ssh(1) options, if you're too lazy to setup
	$HOME/.ssh/config, use the `-X` option. Do not forget to put
	the options in quotes.

sudo::
	If the ssh user (as specified with the `-u` option) is other
	than `root`, then `hb_report` uses `sudo` to collect the
	information which is readable only by the `root` user. In that
	case it is required to setup the `sudoers` file properly. The
	user (or group to which the user belongs) should have the
	following line:
	+
	<user> ALL = NOPASSWD: /usr/sbin/hb_report
	+
	See the `sudoers(5)` man page for more details.

Times::
	In order to find files and messages in the given period and to
	parse the `-f` and `-t` options, `hb_report` uses perl and one of the
	`Date::Parse` or `Date::Manip` perl modules. Note that you need
	only one of these. Furthermore, on nodes which have no logs and
	where you don't run `hb_report` directly, no date parsing is
	necessary. In other words, if you run this on a loghost then you
	don't need these perl modules on the cluster nodes.
	+
	On rpm based distributions, you can find `Date::Parse` in
	`perl-TimeDate` and on Debian and its derivatives in
	`libtimedate-perl`.

Core dumps::
	To backtrace core dumps gdb is needed and the packages with
	the debugging info. The debug info packages may be installed
	at the time the report is created. Let's hope that you will
	need this really seldom.

TIMES
-----

Specifying times can at times be a nuisance. That is why we have
chosen to use one of the perl modules--they do allow certain
freedom when talking dates. You can either read the instructions
at the
http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES[Date::Parse
examples page].
or just rely on common sense and try stuff like:

	3:00          (today at 3am)
	15:00         (today at 3pm)
	2007/9/1 2pm  (September 1st at 2pm)
	Tue Sep 15 20:46:27 CEST 2009 (September 15th etc)

`hb_report` will (probably) complain if it can't figure out what do
you mean.

Try to delimit the event as close as possible in order to reduce
the size of the report, but still leaving a minute or two around
for good measure.

`-f` is not optional. And don't forget to quote dates when they
contain spaces.


Should I send all this to the rest of Internet?
-----------------------------------------------

By default, the sensitive data in CIB and PE files is not mangled
by `hb_report` because that makes PE input files mostly useless.
If you still have no other option but to send the report to a
public mailing list and do not want the sensitive data to be
included, use the `-s` option. Without this option, `hb_report`
will issue a warning if it finds information which should not be
exposed. By default, parameters matching 'passw.*' are considered
sensitive.  Use the `-p` option to specify additional regular
expressions to match variable names which may contain information
you don't want to leak. For example:

	# hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report

Heartbeat's ha.cf is always sanitized. Logs and other files are
not filtered.

LOGS
----

It may be tricky to find syslog logs. The scheme used is to log a
unique message on all nodes and then look it up in the usual
syslog locations. This procedure is not foolproof, in particular
if the syslog files are in a non-standard directory. We look in
/var/log /var/logs /var/syslog /var/adm /var/log/ha
/var/log/cluster. In case we can't find the logs, please supply
their location:

	# hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1

If you have different log locations on different nodes, well,
perhaps you'd like to make them the same and make life easier for
everybody.

Files starting with "ha-" are preferred. In case syslog sends
messages to more than one file, if one of them is named ha-log or
ha-debug those will be favoured over syslog or messages.

hb_report supports also archived logs in case the period
specified extends that far in the past. The archives must reside
in the same directory as the current log and their names must
be prefixed with the name of the current log (syslog-1.gz or
messages-20090105.bz2).

If there is no separate log for the cluster, possibly unrelated
messages from other programs are included. We don't filter logs,
but just pick a segment for the period you specified.

MANUAL REPORT COLLECTION
------------------------

So, your ssh doesn't work. In that case, you will have to run
this procedure on all nodes. Use `-S` so that `hb_report` doesn't
bother with ssh:

	# hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1

If you also have a log host which is not in the cluster, then
you'll have to copy the log to one of the nodes and tell us where
it is:

	# hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1

OPERATION
---------
hb_report collects files and other information in a fairly
straightforward way. The most complex tasks are discovering the
log file locations (if syslog is used which is the most common
case) and coordinating the operation on multiple nodes.

The instance of hb_report running on the host where it was
invoked is the master instance. Instances running on other nodes
are slave instances. The master instance communicates with slave
instances by ssh. There are multiple ssh invocations per run, so
it is essential that the ssh works without password, i.e. with
the public key authentication and authorized_keys.

The operation consists of three phases. Each phase must finish
on all nodes before the next one can commence. The first phase
consists of logging unique messages through syslog on all nodes.
This is the shortest of all phases.

The second phase is the most involved. During this phase all
local information is collected, which includes:

- logs (both current and archived if the start time is far in the past)
- various configuration files (corosync, heartbeat, logd)
- the CIB (both as xml and as represented by the crm shell)
- pengine inputs (if this node was the DC at any point in
  time over the given period)
- system information and status
- package information and status
- dlm lock information
- backtraces (if there were core dumps)

The third phase is collecting information from all nodes and
analyzing it. The analyzis consists of the following tasks:

- identify files equal on all nodes which may then be moved to
  the top directory
- save log messages matching user defined patterns
  (defaults to ERRORs and CRITical conditions)
- report if there were coredumps and by whom
- report crm_verify(8) results
- save log messages matching major events to events.txt
- in case logging is configured without loghost, node logs and
  events files are combined using a perl utility


BUGS
----
Finding logs may at times be extremely difficult, depending on
how weird the syslog configuration. It would be nice to ask
syslog-ng developers to provide a way to find out the log
destination based on facility and priority.

If you think you found a bug, please rerun with the -v option and
attach the output to bugzilla.

hb_report can function in a satisfactory way only if ssh works to
all nodes using authorized_keys (without password).

There are way too many options.


AUTHOR
------
Written by Dejan Muhamedagic, <dejan@suse.de>


RESOURCES
---------
Pacemaker: <http://clusterlabs.org/>

Heartbeat and other Linux HA resources: <http://linux-ha.org/wiki>

OpenAIS: <http://www.openais.org/>

Corosync: <http://www.corosync.org/>


SEE ALSO
--------
Date::Parse(3)


COPYING
-------
Copyright \(C) 2007-2009 Dejan Muhamedagic. Free use of this
software is granted under the terms of the GNU General Public License (GPL).