diff options
Diffstat (limited to 'collectors/apps.plugin/README.md')
-rw-r--r-- | collectors/apps.plugin/README.md | 172 |
1 files changed, 91 insertions, 81 deletions
diff --git a/collectors/apps.plugin/README.md b/collectors/apps.plugin/README.md index ee5c6971a..1b682bc65 100644 --- a/collectors/apps.plugin/README.md +++ b/collectors/apps.plugin/README.md @@ -5,9 +5,9 @@ To achieve this task, it iterates through the whole process tree, collecting resource usage information for every process found running. -Since netdata needs to present this information in charts and track them through time, +Since Netdata needs to present this information in charts and track them through time, instead of presenting a `top` like list, `apps.plugin` uses a pre-defined list of **process groups** -to which it assigns all running processes. This list is [customizable](apps_groups.conf) and netdata +to which it assigns all running processes. This list is [customizable](apps_groups.conf) and Netdata ships with a good default for most cases (to edit it on your system run `/etc/netdata/edit-config apps_groups.conf`). So, `apps.plugin` builds a process tree (much like `ps fax` does in Linux), and groups @@ -15,7 +15,7 @@ processes together (evaluating both child and parent processes) so that the resu a predefined set of members (of course, only process groups found running are reported). > If you find that `apps.plugin` categorizes standard applications as `other`, we would be -> glad to accept pull requests improving the [defaults](apps_groups.conf) shipped with netdata. +> glad to accept pull requests improving the [defaults](apps_groups.conf) shipped with Netdata. Unlike traditional process monitoring tools (like `top`), `apps.plugin` is able to account the resource utilization of exit processes. Their utilization is accounted at their currently running parents. @@ -26,45 +26,50 @@ that fork/spawn other short lived processes hundreds of times per second. `apps.plugin` provides charts for 3 sections: -1. Per application charts as **Applications** at netdata dashboards -2. Per user charts as **Users** at netdata dashboards -3. Per user group charts as **User Groups** at netdata dashboards +1. Per application charts as **Applications** at Netdata dashboards +2. Per user charts as **Users** at Netdata dashboards +3. Per user group charts as **User Groups** at Netdata dashboards Each of these sections provides the same number of charts: -- CPU Utilization - - Total CPU usage - - User / System CPU usage -- Disk I/O - - Physical Reads / Writes - - Logical Reads / Writes - - Open Unique Files (if a file is found open multiple times, it is counted just once) -- Memory - - Real Memory Used (non shared) - - Virtual Memory Allocated - - Minor Page Faults (i.e. memory activity) -- Processes - - Threads Running - - Processes Running - - Pipes Open -- Swap Memory - - Swap Memory Used - - Major Page Faults (i.e. swap activity) -- Network - - Sockets Open +- CPU Utilization + - Total CPU usage + - User / System CPU usage +- Disk I/O + - Physical Reads / Writes + - Logical Reads / Writes + - Open Unique Files (if a file is found open multiple times, it is counted just once) +- Memory + - Real Memory Used (non shared) + - Virtual Memory Allocated + - Minor Page Faults (i.e. memory activity) +- Processes + - Threads Running + - Processes Running + - Pipes Open + - Carried Over Uptime (since the Netdata restart) + - Minimum Uptime + - Average Uptime + - Maximum Uptime + +- Swap Memory + - Swap Memory Used + - Major Page Faults (i.e. swap activity) +- Network + - Sockets Open The above are reported: -- For **Applications** per [target configured](apps_groups.conf). -- For **Users** per username or UID (when the username is not available). -- For **User Groups** per groupname or GID (when groupname is not available). +- For **Applications** per [target configured](apps_groups.conf). +- For **Users** per username or UID (when the username is not available). +- For **User Groups** per groupname or GID (when groupname is not available). ## Performance `apps.plugin` is a complex piece of software and has a lot of work to do We are proud that `apps.plugin` is a lot faster compared to any other similar tool, while collecting a lot more information for the processes, however the fact is that -this plugin requires more CPU resources than the netdata daemon itself. +this plugin requires more CPU resources than the `netdata` daemon itself. Under Linux, for each process running, `apps.plugin` reads several `/proc` files per process. Doing this work per-second, especially on hosts with several thousands @@ -80,7 +85,7 @@ To do this, edit `/etc/netdata/netdata.conf` and find this section: # command options = ``` -Uncomment the line `update every` and set it to a higher number. If you just set it to ` 2 `, +Uncomment the line `update every` and set it to a higher number. If you just set it to `2`, its CPU resources will be cut in half, and data collection will be once every 2 seconds. ## Configuration @@ -105,28 +110,28 @@ undesirable, the line `other: *` should be appended to the `apps_groups.conf`. The process names are the ones returned by: - - `ps -e` or `cat /proc/PID/stat` - - in case of substring mode (see below): `/proc/PID/cmdline` +- `ps -e` or `cat /proc/PID/stat` +- in case of substring mode (see below): `/proc/PID/cmdline` To add process names with spaces, enclose them in quotes (single or double) -example: ` 'Plex Media Serv' ` or ` "my other process" `. +example: `'Plex Media Serv'` or `"my other process"`. -You can add an asterisk ` * ` at the beginning and/or the end of a process: +You can add an asterisk `*` at the beginning and/or the end of a process: - - `*name` *suffix* mode: will search for processes ending with `name` (at `/proc/PID/stat`) - - `name*` *prefix* mode: will search for processes beginning with `name` (at `/proc/PID/stat`) - - `*name*` *substring* mode: will search for `name` in the whole command line (at `/proc/PID/cmdline`) +- `*name` _suffix_ mode: will search for processes ending with `name` (at `/proc/PID/stat`) +- `name*` _prefix_ mode: will search for processes beginning with `name` (at `/proc/PID/stat`) +- `*name*` _substring_ mode: will search for `name` in the whole command line (at `/proc/PID/cmdline`) -If you enter even just one *name* (substring), `apps.plugin` will process +If you enter even just one _name_ (substring), `apps.plugin` will process `/proc/PID/cmdline` for all processes (of course only once per process: when they are first seen). -To add processes with single quotes, enclose them in double quotes: ` "process with this ' single quote" ` +To add processes with single quotes, enclose them in double quotes: `"process with this ' single quote"` -To add processes with double quotes, enclose them in single quotes: ` 'process with this " double quote' ` +To add processes with double quotes, enclose them in single quotes: `'process with this " double quote'` -If a group or process name starts with a ` - `, the dimension will be hidden from the chart (cpu chart only). +If a group or process name starts with a `-`, the dimension will be hidden from the chart (cpu chart only). -If a process starts with a ` + `, debugging will be enabled for it (debugging produces a lot of output - do not enable it in production systems). +If a process starts with a `+`, debugging will be enabled for it (debugging produces a lot of output - do not enable it in production systems). You can add any number of groups. Only the ones found running will affect the charts generated. However, producing charts with hundreds of dimensions may slow down your web browser. @@ -135,17 +140,24 @@ The order of the entries in this list is important: the first that matches a pro ones at the top. Processes not matched by any row, will inherit it from their parents or children. The order also controls the order of the dimensions on the generated charts (although applications started -after apps.plugin is started, will be appended to the existing list of dimensions the netdata daemon maintains). +after apps.plugin is started, will be appended to the existing list of dimensions the `netdata` daemon maintains). + +There are a few command line options you can pass to `apps.plugin`. The list of available options can be acquired with the `--help` flag. The options can be set in the `netdata.conf` file. For example, to disable user and user group charts you should set + +``` +[plugin:apps] + command options = without-users without-groups +``` ## Permissions `apps.plugin` requires additional privileges to collect all the information it needs. The problem is described in issue #157. -When netdata is installed, `apps.plugin` is given the capabilities `cap_dac_read_search,cap_sys_ptrace+ep`. +When Netdata is installed, `apps.plugin` is given the capabilities `cap_dac_read_search,cap_sys_ptrace+ep`. If this fails (i.e. `setcap` fails), `apps.plugin` is setuid to `root`. -#### linux capabilities in containers +### linux capabilities in containers There are a few cases, like `docker` and `virtuozzo` containers, where `setcap` succeeds, but the capabilities are silently ignored (in `lxc` containers `setcap` fails). @@ -158,15 +170,15 @@ chown root:netdata /usr/libexec/netdata/plugins.d/apps.plugin chmod 4750 /usr/libexec/netdata/plugins.d/apps.plugin ``` -You will have to run these, every time you update netdata. +You will have to run these, every time you update Netdata. ## Security `apps.plugin` performs a hard-coded function of building the process tree in memory, -iterating forever, collecting metrics for each running process and sending them to netdata. -This is a one-way communication, from `apps.plugin` to netdata. +iterating forever, collecting metrics for each running process and sending them to Netdata. +This is a one-way communication, from `apps.plugin` to Netdata. -So, since `apps.plugin` cannot be instructed by netdata for the actions it performs, +So, since `apps.plugin` cannot be instructed by Netdata for the actions it performs, we think it is pretty safe to allow it have these increased privileges. Keep in mind that `apps.plugin` will still run without escalated permissions, @@ -189,28 +201,27 @@ Here is an example for the process group `sql` at `https://registry.my-netdata.i Netdata is able give you a lot more badges for your app. Examples below for process group `sql`: -- CPU usage: ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.cpu&dimensions=sql&value_color=green=0%7Corange%3C50%7Cred) -- Disk Physical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.preads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Disk Physical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Disk Logical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lreads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Disk Logical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Open Files ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.files&dimensions=sql&value_color=green%3E30%7Cred) -- Real Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.mem&dimensions=sql&value_color=green%3C100%7Corange%3C200%7Cred) -- Virtual Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.vmem&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Swap Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.swap&dimensions=sql&value_color=green=0%7Cred) -- Minor Page Faults ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.minor_faults&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) -- Processes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.processes&dimensions=sql&value_color=green%3E0%7Cred) -- Threads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.threads&dimensions=sql&value_color=green%3E=28%7Cred) -- Major Faults (swap activity) ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.major_faults&dimensions=sql&value_color=green=0%7Cred) -- Open Pipes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pipes&dimensions=sql&value_color=green=0%7Cred) -- Open Sockets ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.sockets&dimensions=sql&value_color=green%3E=3%7Cred) - +- CPU usage: ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.cpu&dimensions=sql&value_color=green=0%7Corange%3C50%7Cred) +- Disk Physical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.preads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Disk Physical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Disk Logical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lreads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Disk Logical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Open Files ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.files&dimensions=sql&value_color=green%3E30%7Cred) +- Real Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.mem&dimensions=sql&value_color=green%3C100%7Corange%3C200%7Cred) +- Virtual Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.vmem&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Swap Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.swap&dimensions=sql&value_color=green=0%7Cred) +- Minor Page Faults ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.minor_faults&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred) +- Processes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.processes&dimensions=sql&value_color=green%3E0%7Cred) +- Threads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.threads&dimensions=sql&value_color=green%3E=28%7Cred) +- Major Faults (swap activity) ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.major_faults&dimensions=sql&value_color=green=0%7Cred) +- Open Pipes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pipes&dimensions=sql&value_color=green=0%7Cred) +- Open Sockets ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.sockets&dimensions=sql&value_color=green%3E=3%7Cred) For more information about badges check [Generating Badges](../../web/api/badges) ## Comparison with console tools -Ssh to a server running netdata and execute this: +SSH to a server running Netdata and execute this: ```sh while true; do ls -l /var/run >/dev/null; done @@ -225,7 +236,7 @@ identify the process that consumes so much CPU**. Here is what common Linux console monitoring tools report: -#### top +### top `top` reports that `bash` is using just 14%. @@ -249,7 +260,7 @@ KiB Swap: 0 total, 0 free, 0 used. 753712 avail Mem 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd ``` -#### htop +### htop Exactly like `top`, `htop` is providing an incomplete breakdown of the system CPU utilization. @@ -267,7 +278,7 @@ Exactly like `top`, `htop` is providing an incomplete breakdown of the system CP 7019 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.14 /usr/sbin/netdata ``` -#### atop +### atop `atop` also fails to break down CPU usage. @@ -290,14 +301,13 @@ NET | eth0 ---- | pcki 16 | pcko 15 | si 1 Kbps | so 4 Kbps | 7009 0.01s 0.01s 0K 0K 0K 4K -- - S 0% netdata ``` -#### glances +### glances And the same is true for `glances`. The system runs at 100%, but `glances` reports only 17% per process utilization. Note also, that being a `python` program, `glances` uses 1.6% CPU while it runs. - ``` localhost Uptime: 3 days, 21:42:00 @@ -318,24 +328,24 @@ FILE SYS Used Total 0.3 2.1 7009 netdata 0 S /usr/sbin/netdata / (vda1) 1.56G 29.5G 0.0 0.0 17 root 0 S oom_reaper ``` -#### why this happens? +### why does this happen? All the console tools report usage based on the processes found running *at the moment they examine the process tree*. So, they see just one `ls` command, which is actually very quick with minor CPU utilization. But the shell, is spawning hundreds of them, one after another (much like shell scripts do). -#### what netdata reports? +### What does Netdata report? The total CPU utilization of the system: ![image](https://cloud.githubusercontent.com/assets/2662304/21076212/9198e5a6-bf2e-11e6-9bc0-6bdea25befb2.png) -<br/>_**Figure 1**: The system overview section at netdata, just a few seconds after the command was run_ +<br/>***Figure 1**: The system overview section at Netdata, just a few seconds after the command was run* And at the applications `apps.plugin` breaks down CPU usage per application: ![image](https://cloud.githubusercontent.com/assets/2662304/21076220/c9687848-bf2e-11e6-8d81-348592c5aca2.png) -<br/>_**Figure 2**: The Applications section at netdata, just a few seconds after the command was run_ +<br/>***Figure 2**: The Applications section at Netdata, just a few seconds after the command was run* So, the `ssh` session is using 95% CPU time. @@ -344,7 +354,7 @@ Why `ssh`? `apps.plugin` groups all processes based on its configuration file [`/etc/netdata/apps_groups.conf`](apps_groups.conf) (to edit it on your system run `/etc/netdata/edit-config apps_groups.conf`). -The default configuration has nothing for `bash`, but it has for `sshd`, so netdata accumulates +The default configuration has nothing for `bash`, but it has for `sshd`, so Netdata accumulates all ssh sessions to a dimension on the charts, called `ssh`. This includes all the processes in the process tree of `sshd`, **including the exited children**. @@ -353,9 +363,9 @@ the process tree of `sshd`, **including the exited children**. > `apps.plugin` does not use these mechanisms. The process grouping made by `apps.plugin` works > on any Linux, `systemd` based or not. -#### a more technical description of how netdata works +#### a more technical description of how Netdata works -netdata reads `/proc/<pid>/stat` for all processes, once per second and extracts `utime` and +Netdata reads `/proc/<pid>/stat` for all processes, once per second and extracts `utime` and `stime` (user and system cpu utilization), much like all the console tools do. But it [also extracts `cutime` and `cstime`](https://github.com/netdata/netdata/blob/62596cc6b906b1564657510ca9135c08f6d4cdda/src/apps_plugin.c#L636-L642) @@ -369,7 +379,7 @@ been reported for it prior to this iteration. It is even trickier, because walking through the entire process tree takes some time itself. So, if you sum the CPU utilization of all processes, you might have more CPU time than the reported -total cpu time of the system. netdata solves this, by adapting the per process cpu utilization to +total cpu time of the system. Netdata solves this, by adapting the per process cpu utilization to the total of the system. [Netdata adds charts that document this normalization](https://london.my-netdata.io/default.html#menu_netdata_submenu_apps_plugin). -[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fcollectors%2Fapps.plugin%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)]() +[![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fcollectors%2Fapps.plugin%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) |