summaryrefslogtreecommitdiffstats
path: root/doc/hb_report.8.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/hb_report.8.txt')
-rw-r--r--doc/hb_report.8.txt478
1 files changed, 478 insertions, 0 deletions
diff --git a/doc/hb_report.8.txt b/doc/hb_report.8.txt
new file mode 100644
index 0000000..5efbc32
--- /dev/null
+++ b/doc/hb_report.8.txt
@@ -0,0 +1,478 @@
+:man source: hb_report
+:man version: 1.2
+:man manual: Pacemaker documentation
+
+hb_report(8)
+============
+
+
+NAME
+----
+hb_report - create report for CRM based clusters (Pacemaker)
+
+
+SYNOPSIS
+--------
+*hb_report* -f {time|"cts:"testnum} [-t time] [-u user] [-l file]
+ [-n nodes] [-E files] [-p patt] [-L patt] [-e prog]
+ [-MSDCZAQVsvhd] [dest]
+
+
+DESCRIPTION
+-----------
+The hb_report(1) is a utility to collect all information (logs,
+configuration files, system information, etc) relevant to
+Pacemaker (CRM) over the given period of time.
+
+
+OPTIONS
+-------
+dest::
+ The report name. It can also contain a path where to put the
+ report tarball. If left out, the tarball is created in the
+ current directory named "hb_report-current_date", for instance
+ hb_report-Wed-03-Mar-2010.
+
+*-d*::
+ Don't create the compressed tar, but leave the result in a
+ directory.
+
+*-f* { time | "cts:"testnum }::
+ The start time from which to collect logs. The time is in the
+ format as used by the Date::Parse perl module. For cts tests,
+ specify the "cts:" string followed by the test number. This
+ option is required.
+
+*-t* time::
+ The end time to which to collect logs. Defaults to now.
+
+*-n* nodes::
+ A list of space separated hostnames (cluster members).
+ hb_report may try to find out the set of nodes by itself, but
+ if it runs on the loghost which, as it is usually the case,
+ does not belong to the cluster, that may be difficult. Also,
+ OpenAIS doesn't contain a list of nodes and if Pacemaker is
+ not running, there is no way to find it out automatically.
+ This option is cumulative (i.e. use -n "a b" or -n a -n b).
+
+*-l* file::
+ Log file location. If, for whatever reason, hb_report cannot
+ find the log files, you can specify its absolute path.
+
+*-E* files::
+ Extra log files to collect. This option is cumulative. By
+ default, /var/log/messages are collected along with the
+ cluster logs.
+
+*-M*::
+ Don't collect extra log files, but only the file containing
+ messages from the cluster subsystems.
+
+*-L* patt::
+ A list of regular expressions to match in log files for
+ analysis. This option is additive (default: "CRIT: ERROR:").
+
+*-p* patt::
+ Additional patterns to match parameter name which contain
+ sensitive information. This option is additive (default: "passw.*").
+
+*-Q*::
+ Quick run. Gathering some system information can be expensive.
+ With this option, such operations are skipped and thus
+ information collecting sped up. The operations considered
+ I/O or CPU intensive: verifying installed packages content,
+ sanitizing files for sensitive information, and producing dot
+ files from PE inputs.
+
+*-A*::
+ This is an OpenAIS cluster. hb_report has some heuristics to
+ find the cluster stack, but that is not always reliable.
+ By default, hb_report assumes that it is run on a Heartbeat
+ cluster.
+
+*-u* user::
+ The ssh user. hb_report will try to login to other nodes
+ without specifying a user, then as "root", and finally as
+ "hacluster". If you have another user for administration over
+ ssh, please use this option.
+
+*-X* ssh-options::
+ Extra ssh options. These will be added to every ssh
+ invocation. Alternatively, use `$HOME/.ssh/config` to setup
+ desired ssh connection options.
+
+*-S*::
+ Single node operation. Run hb_report only on this node and
+ don't try to start slave collectors on other members of the
+ cluster. Under normal circumstances this option is not
+ needed. Use if ssh(1) does not work to other nodes.
+
+*-Z*::
+ If the destination directory exist, remove it instead of
+ exiting (this is default for CTS).
+
+*-V*::
+ Print the version including the last repository changeset.
+
+*-v*::
+ Increase verbosity. Normally used to debug unexpected
+ behaviour.
+
+*-h*::
+ Show usage and some examples.
+
+*-D* (obsolete)::
+ Don't invoke editor to fill the description text file.
+
+*-e* prog (obsolete)::
+ Your favourite text editor. Defaults to $EDITOR, vim, vi,
+ emacs, or nano, whichever is found first.
+
+*-C* (obsolete)::
+ Remove the destination directory once the report has been put
+ in a tarball.
+
+EXAMPLES
+--------
+Last night during the backup there were several warnings
+encountered (logserver is the log host):
+
+ logserver# hb_report -f 3:00 -t 4:00 -n "node1 node2" report
+
+collects everything from all nodes from 3am to 4am last night.
+The files are compressed to a tarball report.tar.bz2.
+
+Just found a problem during testing:
+
+ # note the current time
+ node1# date
+ Fri Sep 11 18:51:40 CEST 2009
+ node1# /etc/init.d/heartbeat start
+ node1# nasty-command-that-breaks-things
+ node1# sleep 120 #wait for the cluster to settle
+ node1# hb_report -f 18:51 hb1
+
+ # if hb_report can't figure out that this is corosync
+ node1# hb_report -f 18:51 -A hb1
+
+ # if hb_report can't figure out the cluster members
+ node1# hb_report -f 18:51 -n "node1 node2" hb1
+
+The files are compressed to a tarball hb1.tar.bz2.
+
+INTERPRETING RESULTS
+--------------------
+The compressed tar archive is the final product of hb_report.
+This is one example of its content, for a CTS test case on a
+three node OpenAIS cluster:
+
+ $ ls -RF 001-Restart
+
+ 001-Restart:
+ analysis.txt events.txt logd.cf s390vm13/ s390vm16/
+ description.txt ha-log.txt openais.conf s390vm14/
+
+ 001-Restart/s390vm13:
+ STOPPED crm_verify.txt hb_uuid.txt openais.conf@ sysinfo.txt
+ cib.txt dlm_dump.txt logd.cf@ pengine/ sysstats.txt
+ cib.xml events.txt messages permissions.txt
+
+ 001-Restart/s390vm13/pengine:
+ pe-input-738.bz2 pe-input-740.bz2 pe-warn-450.bz2
+ pe-input-739.bz2 pe-warn-449.bz2 pe-warn-451.bz2
+
+ 001-Restart/s390vm14:
+ STOPPED crm_verify.txt hb_uuid.txt openais.conf@ sysstats.txt
+ cib.txt dlm_dump.txt logd.cf@ permissions.txt
+ cib.xml events.txt messages sysinfo.txt
+
+ 001-Restart/s390vm16:
+ STOPPED crm_verify.txt hb_uuid.txt messages sysinfo.txt
+ cib.txt dlm_dump.txt hostcache openais.conf@ sysstats.txt
+ cib.xml events.txt logd.cf@ permissions.txt
+
+The top directory contains information which pertains to the
+cluster or event as a whole. Files with exactly the same content
+on all nodes will also be at the top, with per-node links created
+(as it is in this example the case with openais.conf and logd.cf).
+
+The cluster log files are named ha-log.txt regardless of the
+actual log file name on the system. If it is found on the
+loghost, then it is placed in the top directory. If not, the top
+directory ha-log.txt contains all nodes logs merged and sorted by
+time. Files named messages are excerpts of /var/log/messages from
+nodes.
+
+Most files are copied verbatim or they contain output of a
+command. For instance, cib.xml is a copy of the CIB found in
+/var/lib/heartbeat/crm/cib.xml. crm_verify.txt is output of the
+crm_verify(8) program.
+
+Some files are result of a more involved processing:
+
+ *analysis.txt*::
+ A set of log messages matching user defined patterns (may be
+ provided with the -L option).
+
+ *events.txt*::
+ A set of log messages matching event patterns. It should
+ provide information about major cluster motions without
+ unnecessary details. These patterns are devised by the
+ cluster experts. Currently, the patterns cover membership
+ and quorum changes, resource starts and stops, fencing
+ (stonith) actions, and cluster starts and stops. events.txt
+ is always generated for each node. In case the central
+ cluster log was found, also combined for all nodes.
+
+ *permissions.txt*::
+ One of the more common problem causes are file and directory
+ permissions. hb_report looks for a set of predefined
+ directories and checks their permissions. Any issues are
+ reported here.
+
+ *backtraces.txt*::
+ gdb generated backtrace information for cores dumped
+ within the specified period.
+
+ *sysinfo.txt*::
+ Various release information about the platform, kernel,
+ operating system, packages, and anything else deemed to be
+ relevant. The static part of the system.
+
+ *sysstats.txt*::
+ Output of various system commands such as ps(1), uptime(1),
+ netstat(8), and ifconfig(8). The dynamic part of the system.
+
+description.txt should contain a user supplied description of the
+problem, but since it is very seldom used, it will be dropped
+from the future releases.
+
+PREREQUISITES
+-------------
+
+ssh::
+ It is not strictly required, but you won't regret having a
+ password-less ssh. It is not too difficult to setup and will save
+ you a lot of time. If you can't have it, for example because your
+ security policy does not allow such a thing, or you just prefer
+ menial work, then you will have to resort to the semi-manual
+ semi-automated report generation. See below for instructions.
+ +
+ If you need to supply a password for your passphrase/login, then
+ always use the `-u` option.
+ +
+ For extra ssh(1) options, if you're too lazy to setup
+ $HOME/.ssh/config, use the `-X` option. Do not forget to put
+ the options in quotes.
+
+sudo::
+ If the ssh user (as specified with the `-u` option) is other
+ than `root`, then `hb_report` uses `sudo` to collect the
+ information which is readable only by the `root` user. In that
+ case it is required to setup the `sudoers` file properly. The
+ user (or group to which the user belongs) should have the
+ following line:
+ +
+ <user> ALL = NOPASSWD: /usr/sbin/hb_report
+ +
+ See the `sudoers(5)` man page for more details.
+
+Times::
+ In order to find files and messages in the given period and to
+ parse the `-f` and `-t` options, `hb_report` uses perl and one of the
+ `Date::Parse` or `Date::Manip` perl modules. Note that you need
+ only one of these. Furthermore, on nodes which have no logs and
+ where you don't run `hb_report` directly, no date parsing is
+ necessary. In other words, if you run this on a loghost then you
+ don't need these perl modules on the cluster nodes.
+ +
+ On rpm based distributions, you can find `Date::Parse` in
+ `perl-TimeDate` and on Debian and its derivatives in
+ `libtimedate-perl`.
+
+Core dumps::
+ To backtrace core dumps gdb is needed and the packages with
+ the debugging info. The debug info packages may be installed
+ at the time the report is created. Let's hope that you will
+ need this really seldom.
+
+TIMES
+-----
+
+Specifying times can at times be a nuisance. That is why we have
+chosen to use one of the perl modules--they do allow certain
+freedom when talking dates. You can either read the instructions
+at the
+http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES[Date::Parse
+examples page].
+or just rely on common sense and try stuff like:
+
+ 3:00 (today at 3am)
+ 15:00 (today at 3pm)
+ 2007/9/1 2pm (September 1st at 2pm)
+ Tue Sep 15 20:46:27 CEST 2009 (September 15th etc)
+
+`hb_report` will (probably) complain if it can't figure out what do
+you mean.
+
+Try to delimit the event as close as possible in order to reduce
+the size of the report, but still leaving a minute or two around
+for good measure.
+
+`-f` is not optional. And don't forget to quote dates when they
+contain spaces.
+
+
+Should I send all this to the rest of Internet?
+-----------------------------------------------
+
+By default, the sensitive data in CIB and PE files is not mangled
+by `hb_report` because that makes PE input files mostly useless.
+If you still have no other option but to send the report to a
+public mailing list and do not want the sensitive data to be
+included, use the `-s` option. Without this option, `hb_report`
+will issue a warning if it finds information which should not be
+exposed. By default, parameters matching 'passw.*' are considered
+sensitive. Use the `-p` option to specify additional regular
+expressions to match variable names which may contain information
+you don't want to leak. For example:
+
+ # hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report
+
+Heartbeat's ha.cf is always sanitized. Logs and other files are
+not filtered.
+
+LOGS
+----
+
+It may be tricky to find syslog logs. The scheme used is to log a
+unique message on all nodes and then look it up in the usual
+syslog locations. This procedure is not foolproof, in particular
+if the syslog files are in a non-standard directory. We look in
+/var/log /var/logs /var/syslog /var/adm /var/log/ha
+/var/log/cluster. In case we can't find the logs, please supply
+their location:
+
+ # hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1
+
+If you have different log locations on different nodes, well,
+perhaps you'd like to make them the same and make life easier for
+everybody.
+
+Files starting with "ha-" are preferred. In case syslog sends
+messages to more than one file, if one of them is named ha-log or
+ha-debug those will be favoured over syslog or messages.
+
+hb_report supports also archived logs in case the period
+specified extends that far in the past. The archives must reside
+in the same directory as the current log and their names must
+be prefixed with the name of the current log (syslog-1.gz or
+messages-20090105.bz2).
+
+If there is no separate log for the cluster, possibly unrelated
+messages from other programs are included. We don't filter logs,
+but just pick a segment for the period you specified.
+
+MANUAL REPORT COLLECTION
+------------------------
+
+So, your ssh doesn't work. In that case, you will have to run
+this procedure on all nodes. Use `-S` so that `hb_report` doesn't
+bother with ssh:
+
+ # hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1
+
+If you also have a log host which is not in the cluster, then
+you'll have to copy the log to one of the nodes and tell us where
+it is:
+
+ # hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1
+
+OPERATION
+---------
+hb_report collects files and other information in a fairly
+straightforward way. The most complex tasks are discovering the
+log file locations (if syslog is used which is the most common
+case) and coordinating the operation on multiple nodes.
+
+The instance of hb_report running on the host where it was
+invoked is the master instance. Instances running on other nodes
+are slave instances. The master instance communicates with slave
+instances by ssh. There are multiple ssh invocations per run, so
+it is essential that the ssh works without password, i.e. with
+the public key authentication and authorized_keys.
+
+The operation consists of three phases. Each phase must finish
+on all nodes before the next one can commence. The first phase
+consists of logging unique messages through syslog on all nodes.
+This is the shortest of all phases.
+
+The second phase is the most involved. During this phase all
+local information is collected, which includes:
+
+- logs (both current and archived if the start time is far in the past)
+- various configuration files (corosync, heartbeat, logd)
+- the CIB (both as xml and as represented by the crm shell)
+- pengine inputs (if this node was the DC at any point in
+ time over the given period)
+- system information and status
+- package information and status
+- dlm lock information
+- backtraces (if there were core dumps)
+
+The third phase is collecting information from all nodes and
+analyzing it. The analyzis consists of the following tasks:
+
+- identify files equal on all nodes which may then be moved to
+ the top directory
+- save log messages matching user defined patterns
+ (defaults to ERRORs and CRITical conditions)
+- report if there were coredumps and by whom
+- report crm_verify(8) results
+- save log messages matching major events to events.txt
+- in case logging is configured without loghost, node logs and
+ events files are combined using a perl utility
+
+
+BUGS
+----
+Finding logs may at times be extremely difficult, depending on
+how weird the syslog configuration. It would be nice to ask
+syslog-ng developers to provide a way to find out the log
+destination based on facility and priority.
+
+If you think you found a bug, please rerun with the -v option and
+attach the output to bugzilla.
+
+hb_report can function in a satisfactory way only if ssh works to
+all nodes using authorized_keys (without password).
+
+There are way too many options.
+
+
+AUTHOR
+------
+Written by Dejan Muhamedagic, <dejan@suse.de>
+
+
+RESOURCES
+---------
+Pacemaker: <http://clusterlabs.org/>
+
+Heartbeat and other Linux HA resources: <http://linux-ha.org/wiki>
+
+OpenAIS: <http://www.openais.org/>
+
+Corosync: <http://www.corosync.org/>
+
+
+SEE ALSO
+--------
+Date::Parse(3)
+
+
+COPYING
+-------
+Copyright \(C) 2007-2009 Dejan Muhamedagic. Free use of this
+software is granted under the terms of the GNU General Public License (GPL).
+