diff options
Diffstat (limited to 'doc/hb_report.8.txt')
-rw-r--r-- | doc/hb_report.8.txt | 478 |
1 files changed, 478 insertions, 0 deletions
diff --git a/doc/hb_report.8.txt b/doc/hb_report.8.txt new file mode 100644 index 0000000..5efbc32 --- /dev/null +++ b/doc/hb_report.8.txt @@ -0,0 +1,478 @@ +:man source: hb_report +:man version: 1.2 +:man manual: Pacemaker documentation + +hb_report(8) +============ + + +NAME +---- +hb_report - create report for CRM based clusters (Pacemaker) + + +SYNOPSIS +-------- +*hb_report* -f {time|"cts:"testnum} [-t time] [-u user] [-l file] + [-n nodes] [-E files] [-p patt] [-L patt] [-e prog] + [-MSDCZAQVsvhd] [dest] + + +DESCRIPTION +----------- +The hb_report(1) is a utility to collect all information (logs, +configuration files, system information, etc) relevant to +Pacemaker (CRM) over the given period of time. + + +OPTIONS +------- +dest:: + The report name. It can also contain a path where to put the + report tarball. If left out, the tarball is created in the + current directory named "hb_report-current_date", for instance + hb_report-Wed-03-Mar-2010. + +*-d*:: + Don't create the compressed tar, but leave the result in a + directory. + +*-f* { time | "cts:"testnum }:: + The start time from which to collect logs. The time is in the + format as used by the Date::Parse perl module. For cts tests, + specify the "cts:" string followed by the test number. This + option is required. + +*-t* time:: + The end time to which to collect logs. Defaults to now. + +*-n* nodes:: + A list of space separated hostnames (cluster members). + hb_report may try to find out the set of nodes by itself, but + if it runs on the loghost which, as it is usually the case, + does not belong to the cluster, that may be difficult. Also, + OpenAIS doesn't contain a list of nodes and if Pacemaker is + not running, there is no way to find it out automatically. + This option is cumulative (i.e. use -n "a b" or -n a -n b). + +*-l* file:: + Log file location. If, for whatever reason, hb_report cannot + find the log files, you can specify its absolute path. + +*-E* files:: + Extra log files to collect. This option is cumulative. By + default, /var/log/messages are collected along with the + cluster logs. + +*-M*:: + Don't collect extra log files, but only the file containing + messages from the cluster subsystems. + +*-L* patt:: + A list of regular expressions to match in log files for + analysis. This option is additive (default: "CRIT: ERROR:"). + +*-p* patt:: + Additional patterns to match parameter name which contain + sensitive information. This option is additive (default: "passw.*"). + +*-Q*:: + Quick run. Gathering some system information can be expensive. + With this option, such operations are skipped and thus + information collecting sped up. The operations considered + I/O or CPU intensive: verifying installed packages content, + sanitizing files for sensitive information, and producing dot + files from PE inputs. + +*-A*:: + This is an OpenAIS cluster. hb_report has some heuristics to + find the cluster stack, but that is not always reliable. + By default, hb_report assumes that it is run on a Heartbeat + cluster. + +*-u* user:: + The ssh user. hb_report will try to login to other nodes + without specifying a user, then as "root", and finally as + "hacluster". If you have another user for administration over + ssh, please use this option. + +*-X* ssh-options:: + Extra ssh options. These will be added to every ssh + invocation. Alternatively, use `$HOME/.ssh/config` to setup + desired ssh connection options. + +*-S*:: + Single node operation. Run hb_report only on this node and + don't try to start slave collectors on other members of the + cluster. Under normal circumstances this option is not + needed. Use if ssh(1) does not work to other nodes. + +*-Z*:: + If the destination directory exist, remove it instead of + exiting (this is default for CTS). + +*-V*:: + Print the version including the last repository changeset. + +*-v*:: + Increase verbosity. Normally used to debug unexpected + behaviour. + +*-h*:: + Show usage and some examples. + +*-D* (obsolete):: + Don't invoke editor to fill the description text file. + +*-e* prog (obsolete):: + Your favourite text editor. Defaults to $EDITOR, vim, vi, + emacs, or nano, whichever is found first. + +*-C* (obsolete):: + Remove the destination directory once the report has been put + in a tarball. + +EXAMPLES +-------- +Last night during the backup there were several warnings +encountered (logserver is the log host): + + logserver# hb_report -f 3:00 -t 4:00 -n "node1 node2" report + +collects everything from all nodes from 3am to 4am last night. +The files are compressed to a tarball report.tar.bz2. + +Just found a problem during testing: + + # note the current time + node1# date + Fri Sep 11 18:51:40 CEST 2009 + node1# /etc/init.d/heartbeat start + node1# nasty-command-that-breaks-things + node1# sleep 120 #wait for the cluster to settle + node1# hb_report -f 18:51 hb1 + + # if hb_report can't figure out that this is corosync + node1# hb_report -f 18:51 -A hb1 + + # if hb_report can't figure out the cluster members + node1# hb_report -f 18:51 -n "node1 node2" hb1 + +The files are compressed to a tarball hb1.tar.bz2. + +INTERPRETING RESULTS +-------------------- +The compressed tar archive is the final product of hb_report. +This is one example of its content, for a CTS test case on a +three node OpenAIS cluster: + + $ ls -RF 001-Restart + + 001-Restart: + analysis.txt events.txt logd.cf s390vm13/ s390vm16/ + description.txt ha-log.txt openais.conf s390vm14/ + + 001-Restart/s390vm13: + STOPPED crm_verify.txt hb_uuid.txt openais.conf@ sysinfo.txt + cib.txt dlm_dump.txt logd.cf@ pengine/ sysstats.txt + cib.xml events.txt messages permissions.txt + + 001-Restart/s390vm13/pengine: + pe-input-738.bz2 pe-input-740.bz2 pe-warn-450.bz2 + pe-input-739.bz2 pe-warn-449.bz2 pe-warn-451.bz2 + + 001-Restart/s390vm14: + STOPPED crm_verify.txt hb_uuid.txt openais.conf@ sysstats.txt + cib.txt dlm_dump.txt logd.cf@ permissions.txt + cib.xml events.txt messages sysinfo.txt + + 001-Restart/s390vm16: + STOPPED crm_verify.txt hb_uuid.txt messages sysinfo.txt + cib.txt dlm_dump.txt hostcache openais.conf@ sysstats.txt + cib.xml events.txt logd.cf@ permissions.txt + +The top directory contains information which pertains to the +cluster or event as a whole. Files with exactly the same content +on all nodes will also be at the top, with per-node links created +(as it is in this example the case with openais.conf and logd.cf). + +The cluster log files are named ha-log.txt regardless of the +actual log file name on the system. If it is found on the +loghost, then it is placed in the top directory. If not, the top +directory ha-log.txt contains all nodes logs merged and sorted by +time. Files named messages are excerpts of /var/log/messages from +nodes. + +Most files are copied verbatim or they contain output of a +command. For instance, cib.xml is a copy of the CIB found in +/var/lib/heartbeat/crm/cib.xml. crm_verify.txt is output of the +crm_verify(8) program. + +Some files are result of a more involved processing: + + *analysis.txt*:: + A set of log messages matching user defined patterns (may be + provided with the -L option). + + *events.txt*:: + A set of log messages matching event patterns. It should + provide information about major cluster motions without + unnecessary details. These patterns are devised by the + cluster experts. Currently, the patterns cover membership + and quorum changes, resource starts and stops, fencing + (stonith) actions, and cluster starts and stops. events.txt + is always generated for each node. In case the central + cluster log was found, also combined for all nodes. + + *permissions.txt*:: + One of the more common problem causes are file and directory + permissions. hb_report looks for a set of predefined + directories and checks their permissions. Any issues are + reported here. + + *backtraces.txt*:: + gdb generated backtrace information for cores dumped + within the specified period. + + *sysinfo.txt*:: + Various release information about the platform, kernel, + operating system, packages, and anything else deemed to be + relevant. The static part of the system. + + *sysstats.txt*:: + Output of various system commands such as ps(1), uptime(1), + netstat(8), and ifconfig(8). The dynamic part of the system. + +description.txt should contain a user supplied description of the +problem, but since it is very seldom used, it will be dropped +from the future releases. + +PREREQUISITES +------------- + +ssh:: + It is not strictly required, but you won't regret having a + password-less ssh. It is not too difficult to setup and will save + you a lot of time. If you can't have it, for example because your + security policy does not allow such a thing, or you just prefer + menial work, then you will have to resort to the semi-manual + semi-automated report generation. See below for instructions. + + + If you need to supply a password for your passphrase/login, then + always use the `-u` option. + + + For extra ssh(1) options, if you're too lazy to setup + $HOME/.ssh/config, use the `-X` option. Do not forget to put + the options in quotes. + +sudo:: + If the ssh user (as specified with the `-u` option) is other + than `root`, then `hb_report` uses `sudo` to collect the + information which is readable only by the `root` user. In that + case it is required to setup the `sudoers` file properly. The + user (or group to which the user belongs) should have the + following line: + + + <user> ALL = NOPASSWD: /usr/sbin/hb_report + + + See the `sudoers(5)` man page for more details. + +Times:: + In order to find files and messages in the given period and to + parse the `-f` and `-t` options, `hb_report` uses perl and one of the + `Date::Parse` or `Date::Manip` perl modules. Note that you need + only one of these. Furthermore, on nodes which have no logs and + where you don't run `hb_report` directly, no date parsing is + necessary. In other words, if you run this on a loghost then you + don't need these perl modules on the cluster nodes. + + + On rpm based distributions, you can find `Date::Parse` in + `perl-TimeDate` and on Debian and its derivatives in + `libtimedate-perl`. + +Core dumps:: + To backtrace core dumps gdb is needed and the packages with + the debugging info. The debug info packages may be installed + at the time the report is created. Let's hope that you will + need this really seldom. + +TIMES +----- + +Specifying times can at times be a nuisance. That is why we have +chosen to use one of the perl modules--they do allow certain +freedom when talking dates. You can either read the instructions +at the +http://search.cpan.org/dist/TimeDate/lib/Date/Parse.pm#EXAMPLE_DATES[Date::Parse +examples page]. +or just rely on common sense and try stuff like: + + 3:00 (today at 3am) + 15:00 (today at 3pm) + 2007/9/1 2pm (September 1st at 2pm) + Tue Sep 15 20:46:27 CEST 2009 (September 15th etc) + +`hb_report` will (probably) complain if it can't figure out what do +you mean. + +Try to delimit the event as close as possible in order to reduce +the size of the report, but still leaving a minute or two around +for good measure. + +`-f` is not optional. And don't forget to quote dates when they +contain spaces. + + +Should I send all this to the rest of Internet? +----------------------------------------------- + +By default, the sensitive data in CIB and PE files is not mangled +by `hb_report` because that makes PE input files mostly useless. +If you still have no other option but to send the report to a +public mailing list and do not want the sensitive data to be +included, use the `-s` option. Without this option, `hb_report` +will issue a warning if it finds information which should not be +exposed. By default, parameters matching 'passw.*' are considered +sensitive. Use the `-p` option to specify additional regular +expressions to match variable names which may contain information +you don't want to leak. For example: + + # hb_report -f 18:00 -p "user.*" -p "secret.*" /var/tmp/report + +Heartbeat's ha.cf is always sanitized. Logs and other files are +not filtered. + +LOGS +---- + +It may be tricky to find syslog logs. The scheme used is to log a +unique message on all nodes and then look it up in the usual +syslog locations. This procedure is not foolproof, in particular +if the syslog files are in a non-standard directory. We look in +/var/log /var/logs /var/syslog /var/adm /var/log/ha +/var/log/cluster. In case we can't find the logs, please supply +their location: + + # hb_report -f 5pm -l /var/log/cluster1/ha-log -S /tmp/report_node1 + +If you have different log locations on different nodes, well, +perhaps you'd like to make them the same and make life easier for +everybody. + +Files starting with "ha-" are preferred. In case syslog sends +messages to more than one file, if one of them is named ha-log or +ha-debug those will be favoured over syslog or messages. + +hb_report supports also archived logs in case the period +specified extends that far in the past. The archives must reside +in the same directory as the current log and their names must +be prefixed with the name of the current log (syslog-1.gz or +messages-20090105.bz2). + +If there is no separate log for the cluster, possibly unrelated +messages from other programs are included. We don't filter logs, +but just pick a segment for the period you specified. + +MANUAL REPORT COLLECTION +------------------------ + +So, your ssh doesn't work. In that case, you will have to run +this procedure on all nodes. Use `-S` so that `hb_report` doesn't +bother with ssh: + + # hb_report -f 5:20pm -t 5:30pm -S /tmp/report_node1 + +If you also have a log host which is not in the cluster, then +you'll have to copy the log to one of the nodes and tell us where +it is: + + # hb_report -f 5:20pm -t 5:30pm -l /var/tmp/ha-log -S /tmp/report_node1 + +OPERATION +--------- +hb_report collects files and other information in a fairly +straightforward way. The most complex tasks are discovering the +log file locations (if syslog is used which is the most common +case) and coordinating the operation on multiple nodes. + +The instance of hb_report running on the host where it was +invoked is the master instance. Instances running on other nodes +are slave instances. The master instance communicates with slave +instances by ssh. There are multiple ssh invocations per run, so +it is essential that the ssh works without password, i.e. with +the public key authentication and authorized_keys. + +The operation consists of three phases. Each phase must finish +on all nodes before the next one can commence. The first phase +consists of logging unique messages through syslog on all nodes. +This is the shortest of all phases. + +The second phase is the most involved. During this phase all +local information is collected, which includes: + +- logs (both current and archived if the start time is far in the past) +- various configuration files (corosync, heartbeat, logd) +- the CIB (both as xml and as represented by the crm shell) +- pengine inputs (if this node was the DC at any point in + time over the given period) +- system information and status +- package information and status +- dlm lock information +- backtraces (if there were core dumps) + +The third phase is collecting information from all nodes and +analyzing it. The analyzis consists of the following tasks: + +- identify files equal on all nodes which may then be moved to + the top directory +- save log messages matching user defined patterns + (defaults to ERRORs and CRITical conditions) +- report if there were coredumps and by whom +- report crm_verify(8) results +- save log messages matching major events to events.txt +- in case logging is configured without loghost, node logs and + events files are combined using a perl utility + + +BUGS +---- +Finding logs may at times be extremely difficult, depending on +how weird the syslog configuration. It would be nice to ask +syslog-ng developers to provide a way to find out the log +destination based on facility and priority. + +If you think you found a bug, please rerun with the -v option and +attach the output to bugzilla. + +hb_report can function in a satisfactory way only if ssh works to +all nodes using authorized_keys (without password). + +There are way too many options. + + +AUTHOR +------ +Written by Dejan Muhamedagic, <dejan@suse.de> + + +RESOURCES +--------- +Pacemaker: <http://clusterlabs.org/> + +Heartbeat and other Linux HA resources: <http://linux-ha.org/wiki> + +OpenAIS: <http://www.openais.org/> + +Corosync: <http://www.corosync.org/> + + +SEE ALSO +-------- +Date::Parse(3) + + +COPYING +------- +Copyright \(C) 2007-2009 Dejan Muhamedagic. Free use of this +software is granted under the terms of the GNU General Public License (GPL). + |