diff options
Diffstat (limited to 'doc/07-agent-based-monitoring.md')
-rw-r--r-- | doc/07-agent-based-monitoring.md | 484 |
1 files changed, 484 insertions, 0 deletions
diff --git a/doc/07-agent-based-monitoring.md b/doc/07-agent-based-monitoring.md new file mode 100644 index 0000000..51e41ac --- /dev/null +++ b/doc/07-agent-based-monitoring.md @@ -0,0 +1,484 @@ +# Agent-based Checks <a id="agent-based-checks-addon"></a> + +If the remote services are not directly accessible through the network, a +local agent installation exposing the results to check queries can +become handy. + +Prior to installing and configuration an agent service, evaluate possible +options based on these requirements: + +* Security (authentication, TLS certificates, secure connection handling, etc.) +* Connection direction + * Master/satellite can execute commands directly or + * Agent sends back passive/external check results +* Availability on specific OS types and versions + * Packages available +* Configuration and initial setup +* Updates and maintenance, compatibility + +Available agent types: + +* [Icinga Agent](07-agent-based-monitoring.md#agent-based-checks-icinga) on Linux/Unix and Windows +* [SSH](07-agent-based-monitoring.md#agent-based-checks-ssh) on Linux/Unix +* [SNMP](07-agent-based-monitoring.md#agent-based-checks-snmp) on Linux/Unix and hardware +* [SNMP Traps](07-agent-based-monitoring.md#agent-based-checks-snmp-traps) as passive check results +* [REST API](07-agent-based-monitoring.md#agent-based-checks-rest-api) for passive external check results +* [NSClient++](07-agent-based-monitoring.md#agent-based-checks-nsclient) and [WMI](07-agent-based-monitoring.md#agent-based-checks-wmi) on Windows + + +## Icinga Agent <a id="agent-based-checks-icinga"></a> + +For the most common setups on Linux/Unix and Windows, we recommend +to setup the Icinga agent in a [distributed environment](06-distributed-monitoring.md#distributed-monitoring). + +![Icinga 2 Distributed Master with Agents](images/distributed-monitoring/icinga2_distributed_monitoring_scenarios_master_with_agents.png) + +Key benefits: + +* Directly integrated into the distributed monitoring stack of Icinga +* Works on Linux/Unix and Windows +* Secure communication with TLS +* Connection can be established from both sides. Once connected, command execution and check results are exchanged. + * Master/satellite connects to agent + * Agent connects to parent satellite/master +* Same configuration language and binaries +* Troubleshooting docs and community best practices + +Follow the setup and configuration instructions [here](06-distributed-monitoring.md#distributed-monitoring-setup-agent-satellite). + +On Windows hosts, the Icinga agent can query a local NSClient++ service +for additional checks in case there are no plugins available. + +![Icinga 2 Windows Setup](images/distributed-monitoring/icinga2_windows_setup_wizard_01.png) + +## SSH <a id="agent-based-checks-ssh"></a> + +> **Tip** +> +> This is the recommended way for systems where the Icinga agent is not available +> Be it specific hardware architectures, old systems or forbidden to install an additional software. + +This method uses the SSH service on the remote host to execute +an arbitrary plugin command line. The output and exit code is +returned and used by the core. + +The `check_by_ssh` plugin takes care of this. It is available in the +[Monitoring Plugins](https://www.monitoring-plugins.org/) package. +For your convenience, the Icinga template library provides the [by_ssh](10-icinga-template-library.md#plugin-check-command-by-ssh) +CheckCommand already. + +### SSH: Preparations <a id="agent-based-checks-ssh-preparations"></a> + +SSH key pair for the Icinga daemon user. In case the user has no shell, temporarily enable this. +When asked for a passphrase, **do not set it** and press enter. + +```bash +sudo su - icinga + +ssh-keygen -b 4096 -t rsa -C "icinga@$(hostname) user for check_by_ssh" -f $HOME/.ssh/id_rsa +``` + +On the remote agent, create the icinga user and generate a temporary password. + +```bash +useradd -m icinga +passwd icinga +``` + +Copy the public key from the Icinga server to the remote agent, e.g. with `ssh-copy-id` +or manually into `/home/icinga/.ssh/authorized_keys`. +This will ask for the password once. + +```bash +sudo su - icinga + +ssh-copy-id -i $HOME/.ssh/id_rsa icinga@ssh-agent1.localdomain +``` + +After the SSH key is copied, test at the connection **at least once** and +accept the host key verification. If you forget about this step, checks will +become UNKNOWN later. + +```bash +ssh -i $HOME/.ssh/id_rsa icinga@ssh-agent1.localdomain +``` + +After the SSH key login works, disable the previously enabled logins. + +* Remote agent user's password with `passwd -l icinga` +* Local icinga user terminal + +Also, ensure that the permissions are correct for the `.ssh` directory +as otherwise logins will fail. + +* `.ssh` directory: 700 +* `.ssh/id_rsa.pub` public key file: 644 +* `.ssh/id_rsa` private key file: 600 + + +### SSH: Configuration <a id="agent-based-checks-ssh-config"></a> + +First, create a host object which has SSH configured and enabled. +Mark this e.g. with the custom variable `agent_type` to later +use this for service apply rule matches. Best practice is to +store that in a specific template, either in the static configuration +or inside the Director. + +``` +template Host "ssh-agent" { + check_command = "hostalive" + + vars.agent_type = "ssh" + vars.os_type = "linux" +} + +object Host "ssh-agent1.localdomain" { + import "ssh-agent" + + address = "192.168.56.115" +} +``` + +Example for monitoring the remote users: + +``` +apply Service "users" { + check_command = "by_ssh" + + vars.by_ssh_command = [ "/usr/lib/nagios/plugins/check_users" ] + + // Follows the same principle as with command arguments, e.g. for ordering + vars.by_ssh_arguments = { + "-w" = { + value = "$users_wgreater$" // Can reference an existing custom variable defined on the host or service, evaluated at runtime + } + "-c" = { + value = "$users_cgreater$" + } + } + + vars.users_wgreater = 3 + vars.users_cgreater = 5 + + assign where host.vars.os_type == "linux" && host.vars.agent_type == "ssh" +} +``` + +A more advanced example with better arguments is shown in [this blogpost](https://www.netways.de/blog/2016/03/21/check_by_ssh-mit-icinga-2/). + + +## SNMP <a id="agent-based-checks-snmp"></a> + +The SNMP daemon runs on the remote system and answers SNMP queries by plugin scripts. +The [Monitoring Plugins](https://www.monitoring-plugins.org/) package provides +the `check_snmp` plugin binary, but there are plenty of [existing plugins](05-service-monitoring.md#service-monitoring-plugins) +for specific use cases already around, for example monitoring Cisco routers. + +The following example uses the [SNMP ITL](10-icinga-template-library.md#plugin-check-command-snmp) +CheckCommand and sets the `snmp_oid` custom variable. A service is created for all hosts which +have the `snmp-community` custom variable. + +``` +template Host "snmp-agent" { + check_command = "hostalive" + + vars.agent_type = "snmp" + + vars.snmp_community = "public-icinga" +} + +object Host "snmp-agent1.localdomain" { + import "snmp-agent" +} +``` + +``` +apply Service "uptime" { + import "generic-service" + + check_command = "snmp" + vars.snmp_oid = "1.3.6.1.2.1.1.3.0" + vars.snmp_miblist = "DISMAN-EVENT-MIB" + + assign where host.vars.agent_type == "snmp" && host.vars.snmp_community != "" +} +``` + +If no `snmp_miblist` is specified, the plugin will default to `ALL`. As the number of available MIB files +on the system increases so will the load generated by this plugin if no `MIB` is specified. +As such, it is recommended to always specify at least one `MIB`. + +Additional SNMP plugins are available using the [Manubulon SNMP Plugins](10-icinga-template-library.md#snmp-manubulon-plugin-check-commands). + +For network monitoring, community members advise to use [nwc_health](05-service-monitoring.md#service-monitoring-network) +for example. + + +## SNMP Traps and Passive Check Results <a id="agent-based-checks-snmp-traps"></a> + +SNMP Traps can be received and filtered by using [SNMPTT](http://snmptt.sourceforge.net/) +and specific trap handlers passing the check results to Icinga 2. + +Following the SNMPTT [Format](http://snmptt.sourceforge.net/docs/snmptt.shtml#SNMPTT.CONF-FORMAT) +documentation and the Icinga external command syntax found [here](24-appendix.md#external-commands-list-detail) +we can create generic services that can accommodate any number of hosts for a given scenario. + +### Simple SNMP Traps <a id="simple-traps"></a> + +A simple example might be monitoring host reboots indicated by an SNMP agent reset. +Building the event to auto reset after dispatching a notification is important. +Setup the manual check parameters to reset the event from an initial unhandled +state or from a missed reset event. + +Add a directive in `snmptt.conf` + +``` +EVENT coldStart .1.3.6.1.6.3.1.1.5.1 "Status Events" Normal +FORMAT Device reinitialized (coldStart) +EXEC echo "[$@] PROCESS_SERVICE_CHECK_RESULT;$A;Coldstart;2;The snmp agent has reinitialized." >> /var/run/icinga2/cmd/icinga2.cmd +SDESC +A coldStart trap signifies that the SNMPv2 entity, acting +in an agent role, is reinitializing itself and that its +configuration may have been altered. +EDESC +``` + +1. Define the `EVENT` as per your need. +2. Construct the `EXEC` statement with the service name matching your template +applied to your _n_ hosts. The host address inferred by SNMPTT will be the +correlating factor. You can have snmptt provide host names or ip addresses to +match your Icinga convention. + +> **Note** +> +> Replace the deprecated command pipe EXEC statement with a curl call +> to the REST API action [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result). + +Add an `EventCommand` configuration object for the passive service auto reset event. + +``` +object EventCommand "coldstart-reset-event" { + command = [ ConfigDir + "/conf.d/custom/scripts/coldstart_reset_event.sh" ] + + arguments = { + "-i" = "$service.state_id$" + "-n" = "$host.name$" + "-s" = "$service.name$" + } +} +``` + +Create the `coldstart_reset_event.sh` shell script to pass the expanded variable +data in. The `$service.state_id$` is important in order to prevent an endless loop +of event firing after the service has been reset. + +```bash +#!/bin/bash + +SERVICE_STATE_ID="" +HOST_NAME="" +SERVICE_NAME="" + +show_help() +{ +cat <<-EOF + Usage: ${0##*/} [-h] -n HOST_NAME -s SERVICE_NAME + Writes a coldstart reset event to the Icinga command pipe. + + -h Display this help and exit. + -i SERVICE_STATE_ID The associated service state id. + -n HOST_NAME The associated host name. + -s SERVICE_NAME The associated service name. +EOF +} + +while getopts "hi:n:s:" opt; do + case "$opt" in + h) + show_help + exit 0 + ;; + i) + SERVICE_STATE_ID=$OPTARG + ;; + n) + HOST_NAME=$OPTARG + ;; + s) + SERVICE_NAME=$OPTARG + ;; + '?') + show_help + exit 0 + ;; + esac +done + +if [ -z "$SERVICE_STATE_ID" ]; then + show_help + printf "\n Error: -i required.\n" + exit 1 +fi + +if [ -z "$HOST_NAME" ]; then + show_help + printf "\n Error: -n required.\n" + exit 1 +fi + +if [ -z "$SERVICE_NAME" ]; then + show_help + printf "\n Error: -s required.\n" + exit 1 +fi + +if [ "$SERVICE_STATE_ID" -gt 0 ]; then + echo "[`date +%s`] PROCESS_SERVICE_CHECK_RESULT;$HOST_NAME;$SERVICE_NAME;0;Auto-reset (`date +"%m-%d-%Y %T"`)." >> /var/run/icinga2/cmd/icinga2.cmd +fi +``` + +> **Note** +> +> Replace the deprecated command pipe EXEC statement with a curl call +> to the REST API action [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result). + +Finally create the `Service` and assign it: + +``` +apply Service "Coldstart" { + import "generic-service-custom" + + check_command = "dummy" + event_command = "coldstart-reset-event" + + enable_notifications = 1 + enable_active_checks = 0 + enable_passive_checks = 1 + enable_flapping = 0 + volatile = 1 + enable_perfdata = 0 + + vars.dummy_state = 0 + vars.dummy_text = "Manual reset." + + vars.sla = "24x7" + + assign where (host.vars.os == "Linux" || host.vars.os == "Windows") +} +``` + +### Complex SNMP Traps <a id="complex-traps"></a> + +A more complex example might be passing dynamic data from a traps varbind list +for a backup scenario where the backup software dispatches status updates. By +utilizing active and passive checks, the older freshness concept can be leveraged. + +By defining the active check as a hard failed state, a missed backup can be reported. +As long as the most recent passive update has occurred, the active check is bypassed. + +Add a directive in `snmptt.conf` + +``` +EVENT enterpriseSpecific <YOUR OID> "Status Events" Normal +FORMAT Enterprise specific trap +EXEC echo "[$@] PROCESS_SERVICE_CHECK_RESULT;$A;$1;$2;$3" >> /var/run/icinga2/cmd/icinga2.cmd +SDESC +An enterprise specific trap. +The varbinds in order denote the Icinga service name, state and text. +EDESC +``` + +1. Define the `EVENT` as per your need using your actual oid. +2. The service name, state and text are extracted from the first three varbinds. +This has the advantage of accommodating an unlimited set of use cases. + +> **Note** +> +> Replace the deprecated command pipe EXEC statement with a curl call +> to the REST API action [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result). + +Create a `Service` for the specific use case associated to the host. If the host +matches and the first varbind value is `Backup`, SNMPTT will submit the corresponding +passive update with the state and text from the second and third varbind: + +``` +object Service "Backup" { + import "generic-service-custom" + + host_name = "host.domain.com" + check_command = "dummy" + + enable_notifications = 1 + enable_active_checks = 1 + enable_passive_checks = 1 + enable_flapping = 0 + volatile = 1 + max_check_attempts = 1 + check_interval = 87000 + enable_perfdata = 0 + + vars.sla = "24x7" + vars.dummy_state = 2 + vars.dummy_text = "No passive check result received." +} +``` + + +## Agents sending Check Results via REST API <a id="agent-based-checks-rest-api"></a> + +Whenever the remote agent cannot run the Icinga agent, or a backup script +should just send its current state after finishing, you can use the [REST API](12-icinga2-api.md#icinga2-api) +as secure transport and send [passive external check results](08-advanced-topics.md#external-check-results). + +Use the [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result) API action to send the external passive check result. +You can either use `curl` or implement the HTTP requests in your preferred programming +language. Examples for API clients are available in [this chapter](12-icinga2-api.md#icinga2-api-clients). + +Feeding check results from remote hosts requires the host/service +objects configured on the master/satellite instance. + +## NSClient++ on Windows <a id="agent-based-checks-nsclient"></a> + +[NSClient++](https://nsclient.org/) works on both Windows and Linux platforms and is well +known for its magnificent Windows support. There are alternatives like the WMI interface, +but using `NSClient++` will allow you to run local scripts similar to check plugins fetching +the required output and performance counters. + +> **Tip** +> +> Best practice is to use the Icinga agent as secure execution +> bridge (`check_nt` and `check_nrpe` are considered insecure) +> and query the NSClient++ service [locally](06-distributed-monitoring.md#distributed-monitoring-windows-nscp). + +You can use the `check_nt` plugin from the Monitoring Plugins project to query NSClient++. +Icinga 2 provides the [nscp check command](10-icinga-template-library.md#plugin-check-command-nscp) for this: + +Example: + +``` +object Service "disk" { + import "generic-service" + + host_name = "remote-windows-host" + + check_command = "nscp" + + vars.nscp_variable = "USEDDISKSPACE" + vars.nscp_params = "c" + vars.nscp_warn = 70 + vars.nscp_crit = 80 +} +``` + +For details on the `NSClient++` configuration please refer to the [official documentation](https://docs.nsclient.org/). + +## WMI on Windows <a id="agent-based-checks-wmi"></a> + +The most popular plugin is [check_wmi_plus](https://edcint.co.nz/checkwmiplus/). + +> Check WMI Plus uses the Windows Management Interface (WMI) to check for common services (cpu, disk, sevices, eventlog…) on Windows machines. It requires the open source wmi client for Linux. + +Community examples: + +* [Icinga 2 check_wmi_plus example by 18pct](https://18pct.com/icinga2-check_wmi_plus-example/) +* [Agent-less monitoring with WMI](https://www.devlink.de/linux/icinga2-nagios-agentless-monitoring-von-windows/) |