summaryrefslogtreecommitdiffstats
path: root/doc/07-agent-based-monitoring.md
blob: 51e41acc4d75614c4570da8c2c9f994744e8c8ed (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
# Agent-based Checks <a id="agent-based-checks-addon"></a>

If the remote services are not directly accessible through the network, a
local agent installation exposing the results to check queries can
become handy.

Prior to installing and configuration an agent service, evaluate possible
options based on these requirements:

* Security (authentication, TLS certificates, secure connection handling, etc.)
* Connection direction
    * Master/satellite can execute commands directly or
    * Agent sends back passive/external check results
* Availability on specific OS types and versions
    * Packages available
* Configuration and initial setup
* Updates and maintenance, compatibility

Available agent types:

* [Icinga Agent](07-agent-based-monitoring.md#agent-based-checks-icinga) on Linux/Unix and Windows
* [SSH](07-agent-based-monitoring.md#agent-based-checks-ssh) on Linux/Unix
* [SNMP](07-agent-based-monitoring.md#agent-based-checks-snmp) on Linux/Unix and hardware
* [SNMP Traps](07-agent-based-monitoring.md#agent-based-checks-snmp-traps) as passive check results
* [REST API](07-agent-based-monitoring.md#agent-based-checks-rest-api) for passive external check results
* [NSClient++](07-agent-based-monitoring.md#agent-based-checks-nsclient) and [WMI](07-agent-based-monitoring.md#agent-based-checks-wmi) on Windows


## Icinga Agent <a id="agent-based-checks-icinga"></a>

For the most common setups on Linux/Unix and Windows, we recommend
to setup the Icinga agent in a [distributed environment](06-distributed-monitoring.md#distributed-monitoring).

![Icinga 2 Distributed Master with Agents](images/distributed-monitoring/icinga2_distributed_monitoring_scenarios_master_with_agents.png)

Key benefits:

* Directly integrated into the distributed monitoring stack of Icinga
* Works on Linux/Unix and Windows
* Secure communication with TLS
* Connection can be established from both sides. Once connected, command execution and check results are exchanged.
    * Master/satellite connects to agent
    * Agent connects to parent satellite/master
* Same configuration language and binaries
* Troubleshooting docs and community best practices

Follow the setup and configuration instructions [here](06-distributed-monitoring.md#distributed-monitoring-setup-agent-satellite).

On Windows hosts, the Icinga agent can query a local NSClient++ service
for additional checks in case there are no plugins available.

![Icinga 2 Windows Setup](images/distributed-monitoring/icinga2_windows_setup_wizard_01.png)

## SSH <a id="agent-based-checks-ssh"></a>

> **Tip**
>
> This is the recommended way for systems where the Icinga agent is not available
> Be it specific hardware architectures, old systems or forbidden to install an additional software.

This method uses the SSH service on the remote host to execute
an arbitrary plugin command line. The output and exit code is
returned and used by the core.

The `check_by_ssh` plugin takes care of this. It is available in the
[Monitoring Plugins](https://www.monitoring-plugins.org/) package.
For your convenience, the Icinga template library provides the [by_ssh](10-icinga-template-library.md#plugin-check-command-by-ssh)
CheckCommand already.

### SSH: Preparations <a id="agent-based-checks-ssh-preparations"></a>

SSH key pair for the Icinga daemon user. In case the user has no shell, temporarily enable this.
When asked for a passphrase, **do not set it** and press enter.

```bash
sudo su - icinga

ssh-keygen -b 4096 -t rsa -C "icinga@$(hostname) user for check_by_ssh" -f $HOME/.ssh/id_rsa
```

On the remote agent, create the icinga user and generate a temporary password.

```bash
useradd -m icinga
passwd icinga
```

Copy the public key from the Icinga server to the remote agent, e.g. with `ssh-copy-id`
or manually into `/home/icinga/.ssh/authorized_keys`.
This will ask for the password once.

```bash
sudo su - icinga

ssh-copy-id -i $HOME/.ssh/id_rsa icinga@ssh-agent1.localdomain
```

After the SSH key is copied, test at the connection **at least once** and
accept the host key verification. If you forget about this step, checks will
become UNKNOWN later.

```bash
ssh -i $HOME/.ssh/id_rsa icinga@ssh-agent1.localdomain
```

After the SSH key login works, disable the previously enabled logins.

* Remote agent user's password with `passwd -l icinga`
* Local icinga user terminal

Also, ensure that the permissions are correct for the `.ssh` directory
as otherwise logins will fail.

* `.ssh` directory: 700
* `.ssh/id_rsa.pub` public key file: 644
* `.ssh/id_rsa` private key file: 600


### SSH: Configuration <a id="agent-based-checks-ssh-config"></a>

First, create a host object which has SSH configured and enabled.
Mark this e.g. with the custom variable `agent_type` to later
use this for service apply rule matches. Best practice is to
store that in a specific template, either in the static configuration
or inside the Director.

```
template Host "ssh-agent" {
  check_command = "hostalive"

  vars.agent_type = "ssh"
  vars.os_type = "linux"
}

object Host "ssh-agent1.localdomain" {
  import "ssh-agent"

  address = "192.168.56.115"
}
```

Example for monitoring the remote users:

```
apply Service "users" {
  check_command = "by_ssh"

  vars.by_ssh_command = [ "/usr/lib/nagios/plugins/check_users" ]

  // Follows the same principle as with command arguments, e.g. for ordering
  vars.by_ssh_arguments = {
    "-w" = {
      value = "$users_wgreater$" // Can reference an existing custom variable defined on the host or service, evaluated at runtime
    }
    "-c" = {
      value = "$users_cgreater$"
    }
  }

  vars.users_wgreater = 3
  vars.users_cgreater = 5

  assign where host.vars.os_type == "linux" && host.vars.agent_type == "ssh"
}
```

A more advanced example with better arguments is shown in [this blogpost](https://www.netways.de/blog/2016/03/21/check_by_ssh-mit-icinga-2/).


## SNMP <a id="agent-based-checks-snmp"></a>

The SNMP daemon runs on the remote system and answers SNMP queries by plugin scripts.
The [Monitoring Plugins](https://www.monitoring-plugins.org/) package provides
the `check_snmp` plugin binary, but there are plenty of [existing plugins](05-service-monitoring.md#service-monitoring-plugins)
for specific use cases already around, for example monitoring Cisco routers.

The following example uses the [SNMP ITL](10-icinga-template-library.md#plugin-check-command-snmp)
CheckCommand and sets the `snmp_oid` custom variable. A service is created for all hosts which
have the `snmp-community` custom variable.

```
template Host "snmp-agent" {
  check_command = "hostalive"

  vars.agent_type = "snmp"

  vars.snmp_community = "public-icinga"
}

object Host "snmp-agent1.localdomain" {
  import "snmp-agent"
}
```

```
apply Service "uptime" {
  import "generic-service"

  check_command = "snmp"
  vars.snmp_oid = "1.3.6.1.2.1.1.3.0"
  vars.snmp_miblist = "DISMAN-EVENT-MIB"

  assign where host.vars.agent_type == "snmp" && host.vars.snmp_community != ""
}
```

If no `snmp_miblist` is specified, the plugin will default to `ALL`. As the number of available MIB files
on the system increases so will the load generated by this plugin if no `MIB` is specified.
As such, it is recommended to always specify at least one `MIB`.

Additional SNMP plugins are available using the [Manubulon SNMP Plugins](10-icinga-template-library.md#snmp-manubulon-plugin-check-commands).

For network monitoring, community members advise to use [nwc_health](05-service-monitoring.md#service-monitoring-network)
for example.


## SNMP Traps and Passive Check Results <a id="agent-based-checks-snmp-traps"></a>

SNMP Traps can be received and filtered by using [SNMPTT](http://snmptt.sourceforge.net/)
and specific trap handlers passing the check results to Icinga 2.

Following the SNMPTT [Format](http://snmptt.sourceforge.net/docs/snmptt.shtml#SNMPTT.CONF-FORMAT)
documentation and the Icinga external command syntax found [here](24-appendix.md#external-commands-list-detail)
we can create generic services that can accommodate any number of hosts for a given scenario.

### Simple SNMP Traps <a id="simple-traps"></a>

A simple example might be monitoring host reboots indicated by an SNMP agent reset.
Building the event to auto reset after dispatching a notification is important.
Setup the manual check parameters to reset the event from an initial unhandled
state or from a missed reset event.

Add a directive in `snmptt.conf`

```
EVENT coldStart .1.3.6.1.6.3.1.1.5.1 "Status Events" Normal
FORMAT Device reinitialized (coldStart)
EXEC echo "[$@] PROCESS_SERVICE_CHECK_RESULT;$A;Coldstart;2;The snmp agent has reinitialized." >> /var/run/icinga2/cmd/icinga2.cmd
SDESC
A coldStart trap signifies that the SNMPv2 entity, acting
in an agent role, is reinitializing itself and that its
configuration may have been altered.
EDESC
```

1. Define the `EVENT` as per your need.
2. Construct the `EXEC` statement with the service name matching your template
applied to your _n_ hosts. The host address inferred by SNMPTT will be the
correlating factor. You can have snmptt provide host names or ip addresses to
match your Icinga convention.

> **Note**
>
> Replace the deprecated command pipe EXEC statement with a curl call
> to the REST API action [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result).

Add an `EventCommand` configuration object for the passive service auto reset event.

```
object EventCommand "coldstart-reset-event" {
  command = [ ConfigDir + "/conf.d/custom/scripts/coldstart_reset_event.sh" ]

  arguments = {
    "-i" = "$service.state_id$"
    "-n" = "$host.name$"
    "-s" = "$service.name$"
  }
}
```

Create the `coldstart_reset_event.sh` shell script to pass the expanded variable
data in. The `$service.state_id$` is important in order to prevent an endless loop
of event firing after the service has been reset.

```bash
#!/bin/bash

SERVICE_STATE_ID=""
HOST_NAME=""
SERVICE_NAME=""

show_help()
{
cat <<-EOF
	Usage: ${0##*/} [-h] -n HOST_NAME -s SERVICE_NAME
	Writes a coldstart reset event to the Icinga command pipe.

	  -h                  Display this help and exit.
	  -i SERVICE_STATE_ID The associated service state id.
	  -n HOST_NAME        The associated host name.
	  -s SERVICE_NAME     The associated service name.
EOF
}

while getopts "hi:n:s:" opt; do
    case "$opt" in
      h)
          show_help
          exit 0
          ;;
      i)
          SERVICE_STATE_ID=$OPTARG
          ;;
      n)
          HOST_NAME=$OPTARG
          ;;
      s)
          SERVICE_NAME=$OPTARG
          ;;
      '?')
          show_help
          exit 0
          ;;
      esac
done

if [ -z "$SERVICE_STATE_ID" ]; then
    show_help
    printf "\n  Error: -i required.\n"
    exit 1
fi

if [ -z "$HOST_NAME" ]; then
    show_help
    printf "\n  Error: -n required.\n"
    exit 1
fi

if [ -z "$SERVICE_NAME" ]; then
    show_help
    printf "\n  Error: -s required.\n"
    exit 1
fi

if [ "$SERVICE_STATE_ID" -gt 0 ]; then
    echo "[`date +%s`] PROCESS_SERVICE_CHECK_RESULT;$HOST_NAME;$SERVICE_NAME;0;Auto-reset (`date +"%m-%d-%Y %T"`)." >> /var/run/icinga2/cmd/icinga2.cmd
fi
```

> **Note**
>
> Replace the deprecated command pipe EXEC statement with a curl call
> to the REST API action [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result).

Finally create the `Service` and assign it:

```
apply Service "Coldstart" {
  import "generic-service-custom"

  check_command         = "dummy"
  event_command         = "coldstart-reset-event"

  enable_notifications  = 1
  enable_active_checks  = 0
  enable_passive_checks = 1
  enable_flapping       = 0
  volatile              = 1
  enable_perfdata       = 0

  vars.dummy_state      = 0
  vars.dummy_text       = "Manual reset."

  vars.sla              = "24x7"

  assign where (host.vars.os == "Linux" || host.vars.os == "Windows")
}
```

### Complex SNMP Traps <a id="complex-traps"></a>

A more complex example might be passing dynamic data from a traps varbind list
for a backup scenario where the backup software dispatches status updates. By
utilizing active and passive checks, the older freshness concept can be leveraged.

By defining the active check as a hard failed state, a missed backup can be reported.
As long as the most recent passive update has occurred, the active check is bypassed.

Add a directive in `snmptt.conf`

```
EVENT enterpriseSpecific <YOUR OID> "Status Events" Normal
FORMAT Enterprise specific trap
EXEC echo "[$@] PROCESS_SERVICE_CHECK_RESULT;$A;$1;$2;$3" >> /var/run/icinga2/cmd/icinga2.cmd
SDESC
An enterprise specific trap.
The varbinds in order denote the Icinga service name, state and text.
EDESC
```

1. Define the `EVENT` as per your need using your actual oid.
2. The service name, state and text are extracted from the first three varbinds.
This has the advantage of accommodating an unlimited set of use cases.

> **Note**
>
> Replace the deprecated command pipe EXEC statement with a curl call
> to the REST API action [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result).

Create a `Service` for the specific use case associated to the host. If the host
matches and the first varbind value is `Backup`, SNMPTT will submit the corresponding
passive update with the state and text from the second and third varbind:

```
object Service "Backup" {
  import "generic-service-custom"

  host_name             = "host.domain.com"
  check_command         = "dummy"

  enable_notifications  = 1
  enable_active_checks  = 1
  enable_passive_checks = 1
  enable_flapping       = 0
  volatile              = 1
  max_check_attempts    = 1
  check_interval        = 87000
  enable_perfdata       = 0

  vars.sla              = "24x7"
  vars.dummy_state      = 2
  vars.dummy_text       = "No passive check result received."
}
```


## Agents sending Check Results via REST API <a id="agent-based-checks-rest-api"></a>

Whenever the remote agent cannot run the Icinga agent, or a backup script
should just send its current state after finishing, you can use the [REST API](12-icinga2-api.md#icinga2-api)
as secure transport and send [passive external check results](08-advanced-topics.md#external-check-results).

Use the [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result) API action to send the external passive check result.
You can either use `curl` or implement the HTTP requests in your preferred programming
language. Examples for API clients are available in [this chapter](12-icinga2-api.md#icinga2-api-clients).

Feeding check results from remote hosts requires the host/service
objects configured on the master/satellite instance.

## NSClient++ on Windows <a id="agent-based-checks-nsclient"></a>

[NSClient++](https://nsclient.org/) works on both Windows and Linux platforms and is well
known for its magnificent Windows support. There are alternatives like the WMI interface,
but using `NSClient++` will allow you to run local scripts similar to check plugins fetching
the required output and performance counters.

> **Tip**
>
> Best practice is to use the Icinga agent as secure execution
> bridge (`check_nt` and `check_nrpe` are considered insecure)
> and query the NSClient++ service [locally](06-distributed-monitoring.md#distributed-monitoring-windows-nscp).

You can use the `check_nt` plugin from the Monitoring Plugins project to query NSClient++.
Icinga 2 provides the [nscp check command](10-icinga-template-library.md#plugin-check-command-nscp) for this:

Example:

```
object Service "disk" {
  import "generic-service"

  host_name = "remote-windows-host"

  check_command = "nscp"

  vars.nscp_variable = "USEDDISKSPACE"
  vars.nscp_params = "c"
  vars.nscp_warn = 70
  vars.nscp_crit = 80
}
```

For details on the `NSClient++` configuration please refer to the [official documentation](https://docs.nsclient.org/).

## WMI on Windows <a id="agent-based-checks-wmi"></a>

The most popular plugin is [check_wmi_plus](https://edcint.co.nz/checkwmiplus/).

> Check WMI Plus uses the Windows Management Interface (WMI) to check for common services (cpu, disk, sevices, eventlog…) on Windows machines. It requires the open source wmi client for Linux.

Community examples:

* [Icinga 2 check_wmi_plus example by 18pct](https://18pct.com/icinga2-check_wmi_plus-example/)
* [Agent-less monitoring with WMI](https://www.devlink.de/linux/icinga2-nagios-agentless-monitoring-von-windows/)