diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2021-12-01 06:15:04 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2021-12-01 06:15:04 +0000 |
commit | e970e0b37b8bd7f246feb3f70c4136418225e434 (patch) | |
tree | 0b67c0ca45f56f2f9d9c5c2e725279ecdf52d2eb /collectors/ebpf.plugin/README.md | |
parent | Adding upstream version 1.31.0. (diff) | |
download | netdata-e970e0b37b8bd7f246feb3f70c4136418225e434.tar.xz netdata-e970e0b37b8bd7f246feb3f70c4136418225e434.zip |
Adding upstream version 1.32.0.upstream/1.32.0
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'collectors/ebpf.plugin/README.md')
-rw-r--r-- | collectors/ebpf.plugin/README.md | 722 |
1 files changed, 583 insertions, 139 deletions
diff --git a/collectors/ebpf.plugin/README.md b/collectors/ebpf.plugin/README.md index 1e593786..60f1fd74 100644 --- a/collectors/ebpf.plugin/README.md +++ b/collectors/ebpf.plugin/README.md @@ -1,35 +1,52 @@ <!-- title: "eBPF monitoring with Netdata" -description: "Use Netdata's extended Berkeley Packet Filter (eBPF) collector to monitor kernel-level metrics about your complex applications with per-second granularity." +description: "Use Netdata's extended Berkeley Packet Filter (eBPF) collector to monitor kernel-level metrics about your +complex applications with per-second granularity." custom_edit_url: https://github.com/netdata/netdata/edit/master/collectors/ebpf.plugin/README.md sidebar_label: "eBPF" --> # eBPF monitoring with Netdata -Netdata's extended Berkeley Packet Filter (eBPF) collector monitors kernel-level metrics for file descriptors, virtual -filesystem IO, and process management on Linux systems. You can use our eBPF collector to analyze how and when a process -accesses files, when it makes system calls, whether it leaks memory or creating zombie processes, and more. +eBPF consists of a wide toolchain that ultimately outputs a set of bytecode that will run inside the eBPF virtual +machine (VM) which lives inside the Linux kernel. The program in particular is executed in response to a [tracepoint +or kprobe](#probes-and-tracepoints) activation. -Netdata's eBPF monitoring toolkit uses two custom eBPF programs. The default, called `entry`, monitors calls to a -variety of kernel functions, such as `do_sys_open`, `__close_fd`, `vfs_read`, `vfs_write`, `_do_fork`, and more. The -`return` program also monitors the return of each kernel functions to deliver more granular metrics about how your -system and its applications interact with the Linux kernel. +Netdata has written many eBPF programs, which, when compiled and integrated into the Netdata Agent, are able to collect +a wide array of data about the host that would otherwise be impossible. The data eBPF programs can collect is truly unique, +which gives the Netdata Agent access to data that is high value but normally hard to capture. -eBPF monitoring can help you troubleshoot and debug how applications interact with the Linux kernel. See our [guide on -troubleshooting apps with eBPF metrics](/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md) for configuration -and troubleshooting tips. +eBPF monitoring can help you troubleshoot and debug how applications interact with the Linux kernel. See +our [guide on troubleshooting apps with eBPF metrics](/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md) for +configuration and troubleshooting tips. <figure> <img src="https://user-images.githubusercontent.com/1153921/74746434-ad6a1e00-5222-11ea-858a-a7882617ae02.png" alt="An example of VFS charts, made possible by the eBPF collector plugin" /> <figcaption>An example of VFS charts made possible by the eBPF collector plugin.</figcaption> </figure> -## Enable the collector on Linux +## Probes and Tracepoints + +The following two features from the Linux kernel are used by Netdata to run eBPF programs: + +- Kprobes and return probes (kretprobe): Probes can insert virtually into any kernel instruction. When eBPF runs in + `entry` mode, it attaches only `kprobes` for internal functions monitoring calls and some arguments every time a + function is called. The user can also change configuration to use [`return`](#global) mode, and this will allow users + to monitor return from these functions and detect possible failures. +- Tracepoints are hooks to call specific functions. Tracepoints are more stable than `kprobes` and are preferred when + both options are available. + +In each case, wherever a normal kprobe, kretprobe, or tracepoint would have run its hook function, an eBPF program is +run instead, performing various collection logic before letting the kernel continue its normal control flow. + +There are more methods by which eBPF programs can be triggered but which are not currently supported, such as via uprobes +which allow hooking into arbitrary user-space functions in a similar manner to kprobes. + +## Manually enable the collector on Linux **The eBPF collector is installed and enabled by default on most new installations of the Agent**. The eBPF collector -does not currently work with [static build installations](/packaging/installer/methods/kickstart-64.md), but improved -support is in active development. +does not currently work with [static build installations](/packaging/installer/methods/kickstart-64.md) for kernels older +than `4.11`, but improved support is in active development. eBPF monitoring only works on Linux systems and with specific Linux kernels, including all kernels newer than `4.11.0`, and all kernels on CentOS 7.6 or later. @@ -39,72 +56,403 @@ section for details. ## Charts -The eBPF collector creates an **eBPF** menu in the Agent's dashboard along with three sub-menus: **File**, **VFS**, and -**Process**. All the charts in this section update every second. The collector stores the actual value inside of its -process, but charts only show the difference between the values collected in the previous and current seconds. +The eBPF collector creates charts on different menus, like System Overview, Memory, MD arrays, Disks, Filesystem, +Mount Points, Networking Stack, systemd Services, and Applications. + +The collector stores the actual value inside of its process, but charts only show the difference between the values +collected in the previous and current seconds. + +### System overview + +Not all charts within the System Overview menu are enabled by default, because they add around 100ns overhead for each +function call, this number is small for a human perspective, but the functions are called many times creating an impact +on host. See the [configuration](#configuration) section for details about how to enable them. + +#### Processes + +Internally, the Linux kernel treats both processes and threads as `tasks`. To create a thread, the kernel offers a few +system calls: `fork(2)`, `vfork(2)`, and `clone(2)`. To generate this chart, the eBPF +collector uses the following `tracepoints` and `kprobe`: + +- `sched/sched_process_fork`: Tracepoint called after a call for `fork (2)`, `vfork (2)` and `clone (2)`. +- `sched/sched_process_exec`: Tracepoint called after a exec-family syscall. +- `kprobe/kernel_clone`: This is the main [`fork()`](https://elixir.bootlin.com/linux/v5.10/source/kernel/fork.c#L2415) + routine since kernel `5.10.0` was released. +- `kprobe/_do_fork`: Like `kernel_clone`, but this was the main function between kernels `4.2.0` and `5.9.16` +- `kprobe/do_fork`: This was the main function before kernel `4.2.0`. + +#### Process Exit + +Ending a task requires two steps. The first is a call to the internal function `do_exit`, which notifies the operating +system that the task is finishing its work. The second step is to release the kernel information with the internal +function `release_task`. The difference between the two dimensions can help you discover +[zombie processes](https://en.wikipedia.org/wiki/Zombie_process). To get the metrics, the collector uses: + +- `sched/sched_process_exit`: Tracepoint called after a task exits. +- `kprobe/release_task`: This function is called when a process exits, as the kernel still needs to remove the process + descriptor. + +#### Task error + +The functions responsible for ending tasks do not return values, so this chart contains information about failures on +process and thread creation only. + +#### Swap + +Inside the swap submenu the eBPF plugin creates the chart `swapcalls`; this chart is displaying when processes are +calling functions [`swap_readpage` and `swap_writepage`](https://hzliu123.github.io/linux-kernel/Page%20Cache%20in%20Linux%202.6.pdf ), +which are functions responsible for doing IO in swap memory. To collect the exact moment that an access to swap happens, +the collector attaches `kprobes` for cited functions. + +#### Soft IRQ + +The following `tracepoints` are used to measure time usage for soft IRQs: + +- [`irq/softirq_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_entry): Called + before softirq handler +- [`irq/softirq_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_exit): Called when + softirq handler returns. + +#### Hard IRQ + +The following tracepoints are used to measure the latency of servicing a +hardware interrupt request (hard IRQ). + +- [`irq/irq_handler_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_entry): + Called immediately before the IRQ action handler. +- [`irq/irq_handler_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_exit): + Called immediately after the IRQ action handler returns. +- `irq_vectors`: These are traces from `irq_handler_entry` and + `irq_handler_exit` when an IRQ is handled. The following elements from vector + are triggered: + - `irq_vectors/local_timer_entry` + - `irq_vectors/local_timer_exit` + - `irq_vectors/reschedule_entry` + - `irq_vectors/reschedule_exit` + - `irq_vectors/call_function_entry` + - `irq_vectors/call_function_exit` + - `irq_vectors/call_function_single_entry` + - `irq_vectors/call_function_single_xit` + - `irq_vectors/irq_work_entry` + - `irq_vectors/irq_work_exit` + - `irq_vectors/error_apic_entry` + - `irq_vectors/error_apic_exit` + - `irq_vectors/thermal_apic_entry` + - `irq_vectors/thermal_apic_exit` + - `irq_vectors/threshold_apic_entry` + - `irq_vectors/threshold_apic_exit` + - `irq_vectors/deferred_error_entry` + - `irq_vectors/deferred_error_exit` + - `irq_vectors/spurious_apic_entry` + - `irq_vectors/spurious_apic_exit` + - `irq_vectors/x86_platform_ipi_entry` + - `irq_vectors/x86_platform_ipi_exit` + +#### IPC shared memory + +To monitor shared memory system call counts, the following `kprobes` are used: + +- `shmget`: Runs when [`shmget`](https://man7.org/linux/man-pages/man2/shmget.2.html) is called. +- `shmat`: Runs when [`shmat`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called. +- `shmdt`: Runs when [`shmdt`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called. +- `shmctl`: Runs when [`shmctl`](https://man7.org/linux/man-pages/man2/shmctl.2.html) is called. + +### Memory + +In the memory submenu the eBPF plugin creates two submenus **page cache** and **synchronization** with the following +organization: + +* Page Cache + * Page cache ratio + * Dirty pages + * Page cache hits + * Page cache misses +* Synchronization + * File sync + * Memory map sync + * File system sync + * File range sync + +#### Page cache ratio + +The chart `cachestat_ratio` shows how processes are accessing page cache. In a normal scenario, we expect values around +100%, which means that the majority of the work on the machine is processed in memory. To calculate the ratio, Netdata +attaches `kprobes` for kernel functions: + +- `add_to_page_cache_lru`: Page addition. +- `mark_page_accessed`: Access to cache. +- `account_page_dirtied`: Dirty (modified) pages. +- `mark_buffer_dirty`: Writes to page cache. + +#### Dirty pages + +On `cachestat_dirties` Netdata demonstrates the number of pages that were modified. This chart shows the number of calls +to the function `mark_buffer_dirty`. + +#### Page cache hits + +A page cache hit is when the page cache is successfully accessed with a read operation. We do not count pages that were +added relatively recently. + +#### Page cache misses + +A page cache miss means that a page was not inside memory when the process tried to access it. This chart shows the +result of the difference for calls between functions `add_to_page_cache_lru` and `account_page_dirtied`. + +#### File sync + +This chart shows calls to synchronization methods, [`fsync(2)`](https://man7.org/linux/man-pages/man2/fdatasync.2.html) +and [`fdatasync(2)`](https://man7.org/linux/man-pages/man2/fdatasync.2.html), to transfer all modified page caches +for the files on disk devices. These calls block until the disk reports that the transfer has been completed. They flush +data for specific file descriptors. + +#### Memory map sync + +The chart shows calls to [`msync(2)`](https://man7.org/linux/man-pages/man2/msync.2.html) syscalls. This syscall flushes +changes to a file that was mapped into memory using [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html). + +#### File system sync + +This chart monitors calls demonstrating commits from filesystem caches to disk. Netdata attaches `kprobes` for +[`sync(2)`](https://man7.org/linux/man-pages/man2/sync.2.html), and [`syncfs(2)`](https://man7.org/linux/man-pages/man2/sync.2.html). + +#### File range sync + +This chart shows calls to [`sync_file_range(2)`](https://man7.org/linux/man-pages/man2/sync_file_range.2.html) which +synchronizes file segments with disk. + +> Note: This is the most dangerous syscall to synchronize data, according to its manual. + +### Multiple Device (MD) arrays + +The eBPF plugin shows multi-device flushes happening in real time. This can be used to explain some spikes happening +in [disk latency](#disk) charts. + +By default, MD flush is disabled. To enable it, configure your +`/etc/netdata/ebpf.d.conf` file as: + +```conf +[global] + mdflush = yes +``` + +#### MD flush + +To collect data related to Linux multi-device (MD) flushing, the following kprobe is used: + +- `kprobe/md_flush_request`: called whenever a request for flushing multi-device data is made. + +### Disk + +The eBPF plugin also shows a chart in the Disk section when the `disk` thread is enabled. This will create the +chart `disk_latency_io` for each disk on the host. The following tracepoints are used: + +- [`block/block_rq_issue`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_issue): + IO request operation to a device drive. +- [`block/block_rq_complete`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_complete): + IO operation completed by device. + +### Filesystem + +This group has charts demonstrating how applications interact with the Linux +kernel to open and close file descriptors. It also brings latency charts for +several different filesystems. -### File +#### ext4 -This group has two charts demonstrating how software interacts with the Linux kernel to open and close file descriptors. +To measure the latency of executing some actions in an +[ext4](https://elixir.bootlin.com/linux/latest/source/fs/ext4) filesystem, the +collector needs to attach `kprobes` and `kretprobes` for each of the following +functions: + +- `ext4_file_read_iter`: Function used to measure read latency. +- `ext4_file_write_iter`: Function used to measure write latency. +- `ext4_file_open`: Function used to measure open latency. +- `ext4_sync_file`: Function used to measure sync latency. + +#### ZFS + +To measure the latency of executing some actions in a zfs filesystem, the +collector needs to attach `kprobes` and `kretprobes` for each of the following +functions: + +- `zpl_iter_read`: Function used to measure read latency. +- `zpl_iter_write`: Function used to measure write latency. +- `zpl_open`: Function used to measure open latency. +- `zpl_fsync`: Function used to measure sync latency. + +#### XFS + +To measure the latency of executing some actions in an +[xfs](https://elixir.bootlin.com/linux/latest/source/fs/xfs) filesystem, the +collector needs to attach `kprobes` and `kretprobes` for each of the following +functions: + +- `xfs_file_read_iter`: Function used to measure read latency. +- `xfs_file_write_iter`: Function used to measure write latency. +- `xfs_file_open`: Function used to measure open latency. +- `xfs_file_fsync`: Function used to measure sync latency. + +#### NFS + +To measure the latency of executing some actions in an +[nfs](https://elixir.bootlin.com/linux/latest/source/fs/nfs) filesystem, the +collector needs to attach `kprobes` and `kretprobes` for each of the following +functions: + +- `nfs_file_read`: Function used to measure read latency. +- `nfs_file_write`: Function used to measure write latency. +- `nfs_file_open`: Functions used to measure open latency. +- `nfs4_file_open`: Functions used to measure open latency for NFS v4. +- `nfs_getattr`: Function used to measure sync latency. + +#### btrfs + +To measure the latency of executing some actions in a [btrfs](https://elixir.bootlin.com/linux/latest/source/fs/btrfs/file.c) +filesystem, the collector needs to attach `kprobes` and `kretprobes` for each of the following functions: + +> Note: We are listing two functions used to measure `read` latency, but we use either `btrfs_file_read_iter` or +`generic_file_read_iter`, depending on kernel version. + +- `btrfs_file_read_iter`: Function used to measure read latency since kernel `5.10.0`. +- `generic_file_read_iter`: Like `btrfs_file_read_iter`, but this function was used before kernel `5.10.0`. +- `btrfs_file_write_iter`: Function used to write data. +- `btrfs_file_open`: Function used to open files. +- `btrfs_sync_file`: Function used to synchronize data to filesystem. #### File descriptor -This chart contains two dimensions that show the number of calls to the functions `do_sys_open` and `__close_fd`. Most -software do not commonly call these functions directly, but they are behind the system calls `open(2)`, `openat(2)`, -and `close(2)`. +To give metrics related to `open` and `close` events, instead of attaching kprobes for each syscall used to do these +events, the collector attaches `kprobes` for the common function used for syscalls: + +- [`do_sys_open`](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-5.html ): Internal function used to + open files. +- [`do_sys_openat2`](https://elixir.bootlin.com/linux/v5.6/source/fs/open.c#L1162): + Function called from `do_sys_open` since version `5.6.0`. +- [`close_fd`](https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2271761.html): Function used to close file + descriptor since kernel `5.11.0`. +- `__close_fd`: Function used to close files before version `5.11.0`. #### File error This chart shows the number of times some software tried and failed to open or close a file descriptor. -### VFS +#### VFS + +The Linux Virtual File System (VFS) is an abstraction layer on top of a +concrete filesystem like the ones listed in the parent section, e.g. `ext4`. -A [virtual file system](https://en.wikipedia.org/wiki/Virtual_file_system) (VFS) is a layer on top of regular -filesystems. The functions present inside this API are used for all filesystems, so it's possible the charts in this -group won't show _all_ the actions that occurred on your system. +In this section we list the mechanism by which we gather VFS data, and what +charts are consequently created. -#### Deleted objects +##### VFS eBPF Hooks -This chart monitors calls for `vfs_unlink`. This function is responsible for removing objects from the file system. +To measure the latency and total quantity of executing some VFS-level +functions, ebpf.plugin needs to attach kprobes and kretprobes for each of the +following functions: -#### IO +- `vfs_write`: Function used monitoring the number of successful & failed + filesystem write calls, as well as the total number of written bytes. +- `vfs_writev`: Same function as `vfs_write` but for vector writes (i.e. a + single write operation using a group of buffers rather than 1). +- `vfs_read`: Function used for monitoring the number of successful & failed + filesystem read calls, as well as the total number of read bytes. +- `vfs_readv` Same function as `vfs_read` but for vector reads (i.e. a single + read operation using a group of buffers rather than 1). +- `vfs_unlink`: Function used for monitoring the number of successful & failed + filesystem unlink calls. +- `vfs_fsync`: Function used for monitoring the number of successful & failed + filesystem fsync calls. +- `vfs_open`: Function used for monitoring the number of successful & failed + filesystem open calls. +- `vfs_create`: Function used for monitoring the number of successful & failed + filesystem create calls. + +##### VFS Deleted objects + +This chart monitors calls to `vfs_unlink`. This function is responsible for removing objects from the file system. + +##### VFS IO This chart shows the number of calls to the functions `vfs_read` and `vfs_write`. -#### IO bytes +##### VFS IO bytes -This chart also monitors `vfs_read` and `vfs_write`, but instead shows the total of bytes read and written with these -functions. +This chart also monitors `vfs_read` and `vfs_write` but, instead of the number of calls, it shows the total amount of +bytes read and written with these functions. The Agent displays the number of bytes written as negative because they are moving down to disk. -#### IO errors +##### VFS IO errors The Agent counts and shows the number of instances where a running program experiences a read or write error. -### Process +##### VFS Create -For this group, the eBPF collector monitors process/thread creation and process end, and then displays any errors in the -following charts. +This chart shows the number of calls to `vfs_create`. This function is responsible for creating files. -#### Process thread +##### VFS Synchronization -Internally, the Linux kernel treats both processes and threads as `tasks`. To create a thread, the kernel offers a few -system calls: `fork(2)`, `vfork(2)` and `clone(2)`. In turn, each of these system calls use the function `_do_fork`. To -generate this chart, the eBPF collector monitors `_do_fork` to populate the `process` dimension, and monitors -`sys_clone` to identify threads. +This chart shows the number of calls to `vfs_fsync`. This function is responsible for calling `fsync(2)` or +`fdatasync(2)` on a file. You can see more details in the Synchronization section. -#### Exit +##### VFS Open -Ending a task requires two steps. The first is a call to the internal function `do_exit`, which notifies the operating -system that the task is finishing its work. The second step is to release the kernel information with the internal -function `release_task`. The difference between the two dimensions can help you discover [zombie -processes](https://en.wikipedia.org/wiki/Zombie_process). +This chart shows the number of calls to `vfs_open`. This function is responsible for opening files. -#### Task error +#### Directory Cache -The functions responsible for ending tasks do not return values, so this chart contains information about failures on -process and thread creation. +Metrics for directory cache are collected using kprobe for `lookup_fast`, because we are interested in the number of +times this function is accessed. On the other hand, for `d_lookup` we are not only interested in the number of times it +is accessed, but also in possible errors, so we need to attach a `kretprobe`. For this reason, the following is used: + +- [`lookup_fast`](https://lwn.net/Articles/649115/): Called to look at data inside the directory cache. +- [`d_lookup`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dcache.c?id=052b398a43a7de8c68c13e7fa05d6b3d16ce6801#n2223): + Called when the desired file is not inside the directory cache. + +### Mount Points + +The following `kprobes` are used to collect `mount` & `unmount` call counts: + +- [`mount`](https://man7.org/linux/man-pages/man2/mount.2.html): mount filesystem on host. +- [`umount`](https://man7.org/linux/man-pages/man2/umount.2.html): umount filesystem on host. + +### Networking Stack + +Netdata monitors socket bandwidth attaching `kprobes` for internal functions. + +#### TCP functions + +This chart demonstrates calls to functions `tcp_sendmsg`, `tcp_cleanup_rbuf`, and `tcp_close`; these functions are used +to send & receive data and to close connections when `TCP` protocol is used. + +#### TCP bandwidth + +Like the previous chart, this one also monitors `tcp_sendmsg` and `tcp_cleanup_rbuf`, but instead of showing the number +of calls, it demonstrates the number of bytes sent and received. + +#### TCP retransmit + +This chart demonstrates calls to function `tcp_retransmit` that is responsible for executing TCP retransmission when the +receiver did not return the packet during the expected time. + +#### UDP functions + +This chart demonstrates calls to functions `udp_sendmsg` and `udp_recvmsg`, which are responsible for sending & +receiving data for connections when the `UDP` protocol is used. + +#### UDP bandwidth + +Like the previous chart, this one also monitors `udp_sendmsg` and `udp_recvmsg`, but instead of showing the number of +calls, it monitors the number of bytes sent and received. + +### Apps + +#### OOM Killing + +These are tracepoints related to [OOM](https://en.wikipedia.org/wiki/Out_of_memory) killing processes. + +- `oom/mark_victim`: Monitors when an oomkill event happens. ## Configuration @@ -134,7 +482,7 @@ cd /etc/netdata/ # Replace with your Netdata configuration directory, if not / The `[global]` section defines settings for the whole eBPF collector. -#### ebpf load mode +#### eBPF load mode The collector has two different eBPF programs. These programs monitor the same functions inside the kernel, but they monitor, process, and display different kinds of information. @@ -143,43 +491,20 @@ By default, this plugin uses the `entry` mode. Changing this mode can create sig system, but also offer valuable information if you are developing or debugging software. The `ebpf load mode` option accepts the following values: -- `entry`: This is the default mode. In this mode, the eBPF collector only monitors calls for the functions described - in the sections above, and does not show charts related to errors. -- `return`: In the `return` mode, the eBPF collector monitors the same kernel functions as `entry`, but also creates - new charts for the return of these functions, such as errors. Monitoring function returns can help in debugging - software, such as failing to close file descriptors or creating zombie processes. -- `update every`: Number of seconds used for eBPF to send data for Netdata. -- `pid table size`: Defines the maximum number of PIDs stored inside the application hash table. - +- `entry`: This is the default mode. In this mode, the eBPF collector only monitors calls for the functions described in + the sections above, and does not show charts related to errors. +- `return`: In the `return` mode, the eBPF collector monitors the same kernel functions as `entry`, but also creates new + charts for the return of these functions, such as errors. Monitoring function returns can help in debugging software, + such as failing to close file descriptors or creating zombie processes. +- `update every`: Number of seconds used for eBPF to send data for Netdata. +- `pid table size`: Defines the maximum number of PIDs stored inside the application hash table. + #### Integration with `apps.plugin` The eBPF collector also creates charts for each running application through an integration with the [`apps.plugin`](/collectors/apps.plugin/README.md). This integration helps you understand how specific applications interact with the Linux kernel. -When the integration is enabled, your dashboard will also show the following charts using low-level Linux metrics: - -- eBPF file - - Number of calls to open files. (`apps.file_open`) - - Number of files closed. (`apps.file_closed`) - - Number of calls to open files that returned errors. - - Number of calls to close files that returned errors. -- eBPF syscall - - Number of calls to delete files. (`apps.file_deleted`) - - Number of calls to `vfs_write`. (`apps.vfs_write_call`) - - Number of calls to `vfs_read`. (`apps.vfs_read_call`) - - Number of bytes written with `vfs_write`. (`apps.vfs_write_bytes`) - - Number of bytes read with `vfs_read`. (`apps.vfs_read_bytes`) - - Number of calls to write a file that returned errors. - - Number of calls to read a file that returned errors. -- eBPF process - - Number of process created with `do_fork`. (`apps.process_create`) - - Number of threads created with `do_fork` or `__x86_64_sys_clone`, depending on your system's kernel version. (`apps.thread_create`) - - Number of times that a process called `do_exit`. (`apps.task_close`) -- eBPF net - - Number of bytes sent. (`apps.bandwidth_sent`) - - Number of bytes received. (`apps.bandwidth_recv`) - If you want to _disable_ the integration with `apps.plugin` along with the above charts, change the setting `apps` to `no`. @@ -188,30 +513,129 @@ If you want to _disable_ the integration with `apps.plugin` along with the above apps = yes ``` -When the integration is enabled, eBPF collector allocates memory for each process running. The total - allocated memory has direct relationship with the kernel version. When the eBPF plugin is running on kernels newer than `4.15`, - it uses per-cpu maps to speed up the update of hash tables. This also implies storing data for the same PID - for each processor it runs. +When the integration is enabled, eBPF collector allocates memory for each process running. The total allocated memory +has direct relationship with the kernel version. When the eBPF plugin is running on kernels newer than `4.15`, it uses +per-cpu maps to speed up the update of hash tables. This also implies storing data for the same PID for each processor +it runs. + +#### Integration with `cgroups.plugin` -#### `[ebpf programs]` +The eBPF collector also creates charts for each cgroup through an integration with the +[`cgroups.plugin`](/collectors/cgroups.plugin/README.md). This integration helps you understand how a specific cgroup +interacts with the Linux kernel. + +The integration with `cgroups.plugin` is disabled by default to avoid creating overhead on your system. If you want to +_enable_ the integration with `cgroups.plugin`, change the `cgroups` setting to `yes`. + +```conf +[global] + cgroups = yes +``` + +If you do not need to monitor specific metrics for your `cgroups`, you can enable `cgroups` inside +`ebpf.d.conf`, and then disable the plugin for a specific `thread` by following the steps in the +[Configuration](#configuration) section. + +#### Integration Dashboard Elements + +When an integration is enabled, your dashboard will also show the following cgroups and apps charts using low-level +Linux metrics: + +> Note: The parenthetical accompanying each bulleted item provides the chart name. + +- mem + - Number of processes killed due out of memory. (`oomkills`) +- process + - Number of processes created with `do_fork`. (`process_create`) + - Number of threads created with `do_fork` or `clone (2)`, depending on your system's kernel + version. (`thread_create`) + - Number of times that a process called `do_exit`. (`task_exit`) + - Number of times that a process called `release_task`. (`task_close`) + - Number of times that an error happened to create thread or process. (`task_error`) +- swap + - Number of calls to `swap_readpage`. (`swap_read_call`) + - Number of calls to `swap_writepage`. (`swap_write_call`) +- network + - Number of bytes sent. (`total_bandwidth_sent`) + - Number of bytes received. (`total_bandwidth_recv`) + - Number of calls to `tcp_sendmsg`. (`bandwidth_tcp_send`) + - Number of calls to `tcp_cleanup_rbuf`. (`bandwidth_tcp_recv`) + - Number of calls to `tcp_retransmit_skb`. (`bandwidth_tcp_retransmit`) + - Number of calls to `udp_sendmsg`. (`bandwidth_udp_send`) + - Number of calls to `udp_recvmsg`. (`bandwidth_udp_recv`) +- file access + - Number of calls to open files. (`file_open`) + - Number of calls to open files that returned errors. (`open_error`) + - Number of files closed. (`file_closed`) + - Number of calls to close files that returned errors. (`file_error_closed`) +- vfs + - Number of calls to `vfs_unlink`. (`file_deleted`) + - Number of calls to `vfs_write`. (`vfs_write_call`) + - Number of calls to write a file that returned errors. (`vfs_write_error`) + - Number of calls to `vfs_read`. (`vfs_read_call`) + - Number of bytes written with `vfs_write`. (`vfs_write_bytes`) + - Number of bytes read with `vfs_read`. (`vfs_read_bytes`) + - Number of calls to read a file that returned errors. (`vfs_read_error`) + - Number of calls to `vfs_fsync`. (`vfs_fsync`) + - Number of calls to sync file that returned errors. (`vfs_fsync_error`) + - Number of calls to `vfs_open`. (`vfs_open`) + - Number of calls to open file that returned errors. (`vfs_open_error`) + - Number of calls to `vfs_create`. (`vfs_create`) + - Number of calls to open file that returned errors. (`vfs_create_error`) +- page cache + - Ratio of pages accessed. (`cachestat_ratio`) + - Number of modified pages ("dirty"). (`cachestat_dirties`) + - Number of accessed pages. (`cachestat_hits`) + - Number of pages brought from disk. (`cachestat_misses`) +- directory cache + - Ratio of files available in directory cache. (`dc_hit_ratio`) + - Number of files accessed. (`dc_reference`) + - Number of files accessed that were not in cache. (`dc_not_cache`) + - Number of files not found. (`dc_not_found`) +- ipc shm + - Number of calls to `shm_get`. (`shmget_call`) + - Number of calls to `shm_at`. (`shmat_call`) + - Number of calls to `shm_dt`. (`shmdt_call`) + - Number of calls to `shm_ctl`. (`shmctl_call`) + +### `[ebpf programs]` The eBPF collector enables and runs the following eBPF programs by default: -- `cachestat`: Netdata's eBPF data collector creates charts about the memory page cache. When the integration with - [`apps.plugin`](/collectors/apps.plugin/README.md) is enabled, this collector creates charts for the whole host _and_ - for each application. -- `dcstat` : This eBPF program creates charts that show information about file access using directory cache. It appends - `kprobes` for `lookup_fast()` and `d_lookup()` to identify if files are inside directory cache, outside and - files are not found. -- `process`: This eBPF program creates charts that show information about process creation, VFS IO, and files removed. - When in `return` mode, it also creates charts showing errors when these operations are executed. -- `network viewer`: This eBPF program creates charts with information about `TCP` and `UDP` functions, including the - bandwidth consumed by each. -- `sync`: Montitor calls for syscalls sync(2), fsync(2), fdatasync(2), syncfs(2), msync(2), and sync_file_range(2). +- `fd` : This eBPF program creates charts that show information about calls to open files. +- `mount`: This eBPF program creates charts that show calls to syscalls mount(2) and umount(2). +- `shm`: This eBPF program creates charts that show calls to syscalls shmget(2), shmat(2), shmdt(2) and shmctl(2). +- `sync`: Montitor calls to syscalls sync(2), fsync(2), fdatasync(2), syncfs(2), msync(2), and sync_file_range(2). +- `network viewer`: This eBPF program creates charts with information about `TCP` and `UDP` functions, including the + bandwidth consumed by each. +- `vfs`: This eBPF program creates charts that show information about VFS (Virtual File System) functions. +- `process`: This eBPF program creates charts that show information about process life. When in `return` mode, it also + creates charts showing errors when these operations are executed. +- `hardirq`: This eBPF program creates charts that show information about time spent servicing individual hardware + interrupt requests (hard IRQs). +- `softirq`: This eBPF program creates charts that show information about time spent servicing individual software + interrupt requests (soft IRQs). +- `oomkill`: This eBPF program creates a chart that shows OOM kills for all applications recognized via + the `apps.plugin` integration. Note that this program will show application charts regardless of whether apps + integration is turned on or off. + +You can also enable the following eBPF programs: + +- `cachestat`: Netdata's eBPF data collector creates charts about the memory page cache. When the integration with + [`apps.plugin`](/collectors/apps.plugin/README.md) is enabled, this collector creates charts for the whole host _and_ + for each application. +- `dcstat` : This eBPF program creates charts that show information about file access using directory cache. It appends + `kprobes` for `lookup_fast()` and `d_lookup()` to identify if files are inside directory cache, outside and files are + not found. +- `disk` : This eBPF program creates charts that show information about disk latency independent of filesystem. +- `filesystem` : This eBPF program creates charts that show information about some filesystem latency. +- `swap` : This eBPF program creates charts that show information about swap access. +- `mdflush`: This eBPF program creates charts that show information about + multi-device software flushes. ## Thread configuration -You can configure each thread of the eBPF data collector by editing either the `cachestat.conf`, `process.conf`, +You can configure each thread of the eBPF data collector by editing either the `cachestat.conf`, `process.conf`, or `network.conf` files. Use [`edit-config`](/docs/configure/nodes.md) from your Netdata config directory: ```bash @@ -225,10 +649,16 @@ The following configuration files are available: - `cachestat.conf`: Configuration for the `cachestat` thread. - `dcstat.conf`: Configuration for the `dcstat` thread. +- `disk.conf`: Configuration for the `disk` thread. +- `fd.conf`: Configuration for the `file descriptor` thread. +- `filesystem.conf`: Configuration for the `filesystem` thread. +- `hardirq.conf`: Configuration for the `hardirq` thread. - `process.conf`: Configuration for the `process` thread. -- `network.conf`: Configuration for the `network viewer` thread. This config file overwrites the global options and - also lets you specify which network the eBPF collector monitors. +- `network.conf`: Configuration for the `network viewer` thread. This config file overwrites the global options and also + lets you specify which network the eBPF collector monitors. +- `softirq.conf`: Configuration for the `softirq` thread. - `sync.conf`: Configuration for the `sync` thread. +- `vfs.conf`: Configuration for the `vfs` thread. ### Network configuration @@ -237,7 +667,7 @@ are divided in the following sections: #### `[network connections]` -You can configure the information shown on `outbound` and `inbound` charts with the settings in this section. +You can configure the information shown on `outbound` and `inbound` charts with the settings in this section. ```conf [network connections] @@ -249,24 +679,24 @@ You can configure the information shown on `outbound` and `inbound` charts with ``` When you define a `ports` setting, Netdata will collect network metrics for that specific port. For example, if you -write `ports = 19999`, Netdata will collect only connections for itself. The `hostnames` setting accepts -[simple patterns](/libnetdata/simple_pattern/README.md). The `ports`, and `ips` settings accept negation (`!`) to - deny specific values or asterisk alone to define all values. +write `ports = 19999`, Netdata will collect only connections for itself. The `hostnames` setting accepts +[simple patterns](/libnetdata/simple_pattern/README.md). The `ports`, and `ips` settings accept negation (`!`) to deny +specific values or asterisk alone to define all values. In the above example, Netdata will collect metrics for all ports between 1 and 443, with the exception of 53 (domain) and 145. The following options are available: -- `ports`: Define the destination ports for Netdata to monitor. -- `hostnames`: The list of hostnames that can be resolved to an IP address. -- `ips`: The IP or range of IPs that you want to monitor. You can use IPv4 or IPv6 addresses, use dashes to define a - range of IPs, or use CIDR values. The default behavior is to only collect data for private IP addresses, but this - can be changed with the `ips` setting. - -By default, Netdata displays up to 500 dimensions on network connection charts. If there are more possible dimensions, -they will be bundled into the `other` dimension. You can increase the number of shown dimensions by changing the `maximum -dimensions` setting. +- `ports`: Define the destination ports for Netdata to monitor. +- `hostnames`: The list of hostnames that can be resolved to an IP address. +- `ips`: The IP or range of IPs that you want to monitor. You can use IPv4 or IPv6 addresses, use dashes to define a + range of IPs, or use CIDR values. The default behavior is to only collect data for private IP addresses, but this can + be changed with the `ips` setting. + +By default, Netdata displays up to 500 dimensions on network connection charts. If there are more possible dimensions, +they will be bundled into the `other` dimension. You can increase the number of shown dimensions by changing +the `maximum dimensions` setting. The dimensions for the traffic charts are created using the destination IPs of the sockets by default. This can be changed setting `resolve hostname ips = yes` and restarting Netdata, after this Netdata will create dimensions using @@ -274,8 +704,9 @@ the `hostnames` every time that is possible to resolve IPs to their hostnames. #### `[service name]` -Netdata uses the list of services in `/etc/services` to plot network connection charts. If this file does not contain the -name for a particular service you use in your infrastructure, you will need to add it to the `[service name]` section. +Netdata uses the list of services in `/etc/services` to plot network connection charts. If this file does not contain +the name for a particular service you use in your infrastructure, you will need to add it to the `[service name]` +section. For example, Netdata's default port (`19999`) is not listed in `/etc/services`. To associate that port with the Netdata service in network connection charts, and thus see the name of the service instead of its port, define it: @@ -287,7 +718,7 @@ service in network connection charts, and thus see the name of the service inste ### Sync configuration -The sync configuration has specific options to disable monitoring for syscalls, as default option all syscalls are +The sync configuration has specific options to disable monitoring for syscalls, as default option all syscalls are monitored. ```conf @@ -300,6 +731,22 @@ monitored. sync_file_range = yes ``` +### Filesystem configuration + +The filesystem configuration has specific options to disable monitoring for filesystems, by default all filesystems are +monitored. + +```conf +[filesystem] + btrfsdist = yes + ext4dist = yes + nfsdist = yes + xfsdist = yes + zfsdist = yes +``` + +The ebpf program `nfsdist` monitors only `nfs` mount points. + ## Troubleshooting If the eBPF collector does not work, you can troubleshoot it by running the `ebpf.plugin` command and investigating its @@ -330,17 +777,18 @@ curl -sSL https://raw.githubusercontent.com/netdata/kernel-collector/master/tool If this script returns no output, your system is ready to compile and run the eBPF collector. -If you see a warning about a missing kernel configuration (`KPROBES KPROBES_ON_FTRACE HAVE_KPROBES BPF BPF_SYSCALL -BPF_JIT`), you will need to recompile your kernel to support this configuration. The process of recompiling Linux -kernels varies based on your distribution and version. Read the documentation for your system's distribution to learn -more about the specific workflow for recompiling the kernel, ensuring that you set all the necessary +If you see a warning about a missing kernel +configuration (`KPROBES KPROBES_ON_FTRACE HAVE_KPROBES BPF BPF_SYSCALL BPF_JIT`), you will need to recompile your kernel +to support this configuration. The process of recompiling Linux kernels varies based on your distribution and version. +Read the documentation for your system's distribution to learn more about the specific workflow for recompiling the +kernel, ensuring that you set all the necessary -- [Ubuntu](https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel) -- [Debian](https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official) -- [Fedora](https://fedoraproject.org/wiki/Building_a_custom_kernel) -- [CentOS](https://wiki.centos.org/HowTos/Custom_Kernel) -- [Arch Linux](https://wiki.archlinux.org/index.php/Kernel/Traditional_compilation) -- [Slackware](https://docs.slackware.com/howtos:slackware_admin:kernelbuilding) +- [Ubuntu](https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel) +- [Debian](https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official) +- [Fedora](https://fedoraproject.org/wiki/Building_a_custom_kernel) +- [CentOS](https://wiki.centos.org/HowTos/Custom_Kernel) +- [Arch Linux](https://wiki.archlinux.org/index.php/Kernel/Traditional_compilation) +- [Slackware](https://docs.slackware.com/howtos:slackware_admin:kernelbuilding) ### Mount `debugfs` and `tracefs` @@ -353,19 +801,20 @@ sudo mount -t tracefs nodev /sys/kernel/tracing ``` If they are already mounted, you will see an error. You can also configure your system's `/etc/fstab` configuration to -mount these filesystems on startup. More information can be found in the [ftrace documentation](https://www.kernel.org/doc/Documentation/trace/ftrace.txt). +mount these filesystems on startup. More information can be found in +the [ftrace documentation](https://www.kernel.org/doc/Documentation/trace/ftrace.txt). ## Performance -eBPF monitoring is complex and produces a large volume of metrics. We've discovered scenarios where the eBPF plugin +eBPF monitoring is complex and produces a large volume of metrics. We've discovered scenarios where the eBPF plugin significantly increases kernel memory usage by several hundred MB. -If your node is experiencing high memory usage and there is no obvious culprit to be found in the `apps.mem` chart, -consider testing for high kernel memory usage by [disabling eBPF monitoring](#configuration). Next, -[restart Netdata](/docs/configure/start-stop-restart.md) with `sudo systemctl restart netdata` to see if system -memory usage (see the `system.ram` chart) has dropped significantly. +If your node is experiencing high memory usage and there is no obvious culprit to be found in the `apps.mem` chart, +consider testing for high kernel memory usage by [disabling eBPF monitoring](#configuration). Next, +[restart Netdata](/docs/configure/start-stop-restart.md) with `sudo systemctl restart netdata` to see if system memory +usage (see the `system.ram` chart) has dropped significantly. -Beginning with `v1.31`, kernel memory usage is configurable via the [`pid table size` setting](#ebpf-load-mode) +Beginning with `v1.31`, kernel memory usage is configurable via the [`pid table size` setting](#ebpf-load-mode) in `ebpf.conf`. ## SELinux @@ -423,7 +872,7 @@ allow unconfined_service_t self:bpf { map_create map_read map_write prog_load pr Then compile your `netdata_ebpf.te` file with the following commands to create a binary that loads the new policies: ```bash -# checkmodule -M -m -o netdata_ebpf.mod netdata_ebpf.te +# checkmodule -M -m -o netdata_ebpf.mod netdata_ebpf.te # semodule_package -o netdata_ebpf.pp -m netdata_ebpf.mod ``` @@ -450,9 +899,4 @@ shows how the lockdown module impacts `ebpf.plugin` based on the selected option If you or your distribution compiled the kernel with the last combination, your system cannot load shared libraries required to run `ebpf.plugin`. -## Cleaning `kprobe_events` -The eBPF collector adds entries to the file `/sys/kernel/debug/tracing/kprobe_events`, and cleans them on exit, unless -another process prevents it. If you need to clean the eBPF entries safely, you can manually run the script -`/usr/libexec/netdata/plugins.d/reset_netdata_trace.sh`. - [![analytics](https://www.google-analytics.com/collect?v=1&aip=1&t=pageview&_s=1&ds=github&dr=https%3A%2F%2Fgithub.com%2Fnetdata%2Fnetdata&dl=https%3A%2F%2Fmy-netdata.io%2Fgithub%2Fcollectors%2Febpf.plugin%2FREADME&_u=MAC~&cid=5792dfd7-8dc4-476b-af31-da2fdb9f93d2&tid=UA-64295674-3)](<>) |