doc/website-v1/history-guide.adoc


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275

= Cluster history =
:source-highlighter: pygments

This guide should help administrators and consultants tackle
issues in Pacemaker cluster installations. We concentrate
on troubleshooting and analysis methods with the crmsh history.

Cluster leaves numerous traces behind, more than any other
system.  The logs and the rest are spread among all cluster nodes
and multiple directories. The obvious difficulty is to show that
information in a consolidated manner. This is where crmsh
history helps.

Hopefully, the guide will help you investigate your
specific issue with more efficiency and less effort.

== Sample cluster

In <<Listing 1>> a modestly complex sample cluster is shown with
which we can experiment and break in some hopefully instructive
ways.

NOTE: We won't be going into how to setup nodes or configure the
	  cluster. For that, please refer to the
	  link:/start-guide[Getting Started] document.

[source,crmsh]
[caption="Listing 1: "]
.Sample cluster configuration[[Listing 1]]
-----------------
include::include/history-guide/sample-cluster.conf.crm[]
-----------------

If you're new to clusters, that configuration may look
overwhelming. A graphical presentation in <<Image 1>> of the
essential elements and relations between them is easier on the eye
(and the mind).

[caption="Image 1: "]
.Sample cluster configuration as a graph[[Image 1]]
image::/img/history-guide/sample-cluster.conf.png[link="/img/history-guide/sample-cluster.conf.png"]

As homework, try to match the two cluster representations.

== Quick (& dirty) start

For the impatient, we give here a few examples of history use.

Most of the time you will be dealing with various resource
(a.k.a. application) induced phenomena. For instance, while
preparing this document we noticed that a probe failed repeatedly
on a node which wasn't even running the resource <<Listing 2>>.

[source,ansiclr]
[caption="Listing 2: "]
.crm status output[[Listing 2]]
-----------------
include::include/history-guide/status-probe-fail.typescript[]
-----------------

The history +resource+ command shows log messages relevant to the
supplied resource <<Listing 3>>.

[source,ansiclr]
[caption="Listing 3: "]
.Logs on failed +nfs-server+ probe operation[[Listing 3]]
-----------------
include::include/history-guide/nfs-probe-err.typescript[]
-----------------

<1> NFS server error message.
<2> Warning about a non-existing user id.

NOTE: Messages logged by resource agents are always tagged with
      'type(ID)' (in <<Listing 3>>: +nfsserver(nfs-server)+).
      +
	  Everything dumped to +stderr/stdout+ (in <<Listing 3>>:
	  +id: rpcuser: no such user+) is captured and subsequently
	  logged by +lrmd+. The +stdout+ output is at the 'info'
	  severity which is by default _not_ logged by pacemaker
	  since version 1.1.12.

At the very top we see error message reporting that the
NFS server is running, but some other stuff, apparently
unexpectedly, is not. However, we know that it cannot be
running on the 'c' node as it is already running on the 'a' node.
Not being able to figure out what is going on, we had to turn on
tracing of the resource agent. <<Listing 4>> shows how to do
that.

[source,ansiclr]
[caption="Listing 4: "]
.Set `nfs-server` probe operation resource tracing[[Listing 4]]
-----------------
include::include/history-guide/resource-trace.typescript[]
-----------------

Trace of the +nfsserver+ RA revealed that the +nfs-server+ init
script (used internally by the resource agent) _always_ exits
with success for status. That was actually due to the recent port
to systemd and erroneous interpretation of `systemctl status`
semantics: it always exits with success (due to some paradigm
shift, we guess). FYI, `systemctl is-active` should be used
instead and it does report a service status as expected.

As a bonus, a minor issue about a non-existing user id +rpcuser+
is also revealed.

NOTE: Messages in the crm history log output are colored
      depending on the originating host.

The rest of this document gives more details about crmsh history.
If you're more of a just-try-it-out person, enter +crm history+
and experiment. With +history+ commands you cannot really break
anything (fingers crossed).

== Introduction to crmsh `history`

The history crmsh feature, as the name suggests, deals with the
past. It was conceived as a facility to bring to the fore all
trails pacemaker cluster leaves behind which are relevant to a
particular resource, node, or event. It is used in the first
place as a troubleshooting tool, but it can also be helpful in
studying pacemaker clusters.

To begin, we run the `info` command which gives an overview, as
shown in <<Listing 5>>.

[source,ansiclr]
[caption="Listing 5: "]
.Basic history information[[Listing 5]]
-----------------
include::include/history-guide/info.typescript[]
-----------------

The `timeframe` command limits the observed period and helps
focus on the events of interest. Here we wanted to look at the
10 minute period. Two transitions were executed during this time.

== Transitions

Transitions are basic units capturing cluster movements
(resource operations and node events). A transition
consists of a set of actions to reach a desired cluster
status as specified in the cluster configuration by the
user.

Every configuration or status change results in a transition.

Every transition is also a CIB, which is how cluster
configuration and status are stored. Transitions are saved
to files, the so called PE (Policy Engine) inputs.

In  <<Listing 6>> we show how to display transitions.
The listing is annotated to explain the output in more detail.


[source,ansiclr]
[caption="Listing 6: "]
.Viewing transitions[[Listing 6]]
-----------------
include::include/history-guide/basic-transition.typescript[]
-----------------

<1> The transition command without arguments displays the latest
transition.
<2> Graph of transition actions is provided by `graphviz`. See
<<Image 2>>.
<3> Output of `crm_simulate` with irrelevant stuff edited out.
`crm_simulate` was formerly known as `ptest`.
<4> Transition summary followed by selection of log messages.
History weeds out messages which are of lesser importance. See
<<Listing 8>> if you want to see what history has been hiding
from you here.

Incidentally, if you wonder why all transitions in these examples
are green, that is not because they were green in any sense of
the color, but just due to that being color of node 'c': as
chance would have it, 'c' was calling shots at the time (being
Designated Coordinator or DC). That is also why all `crmd` and
`pengine` messages are coming from 'c'.

NOTE: Transitions are the basis of pacemaker operation, make sure
that you understand them.

What you cannot see in the listing is a graph generated and shown
in a separate window in your X11 display. <<Image 2>> may not be
very involved, but we reckon it's as good a start as starts go.

[caption="Image 2: "]
.Graph for transition 1907[[Image 2]]
image::/img/history-guide/smallapache-start.png[link="/img/history-guide/smallapache-start.png"]

It may sometimes be useful to see what changed between two
transitions. History `diff` command is in action in <<Listing 7>>.

[source,ansiclr]
[caption="Listing 7: "]
.Viewing transitions[[Listing 7]]
-----------------
include::include/history-guide/diff.typescript[]
-----------------

<1> Configuration diff between two last transitions. Transitions
may be referenced with indexes starting at 0 and going backwards.
<2> Status diff between two last transitions.

Whereas configuration diff is (hopefully) obvious, status diff
needs some explanation: the status section of the PE inputs
(transitions) always lags behind the configuration.  This is
because at the time the transition is saved to a file, the
actions of that transition are yet to be executed. So, the status
section of transition _N_ corresponds to the configuration _N-1_.

[source,ansiclr]
[caption="Listing 8: "]
.Full transition log[[Listing 8]]
-----------------
include::include/history-guide/transition-log.typescript[]
-----------------

== Resource and node events

Apart from transitions, events such as resource start or stop are
what we usually want to examine. In our extremely exciting
example of apache resource restart, the history `resource`
command picks the most interesting resource related messages as
shown in <<Listing 9>>. Again, history shows only the most
important log parts.

NOTE: If you want to see more detail (which may not always be
	  recommendable), then use the history `detail` command to
	  increase the level of detail displayed.

[source,ansiclr]
[caption="Listing 9: "]
.Resource related messages[[Listing 9]]
-----------------
include::include/history-guide/resource.typescript[]
-----------------

Node related events are node start and stop (cluster-wise),
node membership changes, and stonith events (aka node fence).
We'll refrain from showing examples of the history `node`
command--it is analogue to the `resource` command.

== Viewing logs

History `log` command, unsurprisingly, displays logs. The
messages from various nodes are weaved and shown in different
colors for the sake of easier viewing. Unlike other history
commands, `log` shows all messages captured in the report. If you
find some of them irrelevant they can be filtered out:
the `exclude` command takes extended regular expressions and it
is additive. We usually set the exclude expression to at least
`ssh|systemd|kernel`. Use `exclude clear` to remove all
expressions. And don't forget the `timeframe` command that
imposes a time window on the report.

== External reports, configurations, and graphs

The information source history works with is `hb_report`
generated report. Even when examining live cluster, `hb_report` is
run behind the scene to collect the data before presenting it to
the user. Well, at least to generate the first report: there is a
special procedure for log refreshing and collecting new PE
inputs, which runs much faster than creating a report from
scratch. However, juggling with multiple sources, appending logs,
moving time windows, may not always be foolproof, and if
the source gets borked you can always ask for a brand new report
with `refresh force`.

Analyzing reports from external source is no different from what
we've seen so far. In fact, there's a `source` command which
tells history where to look for data.