diff options
Diffstat (limited to 'docs/guides/troubleshoot')
-rw-r--r-- | docs/guides/troubleshoot/monitor-debug-applications-ebpf.md | 270 | ||||
-rw-r--r-- | docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md | 117 |
2 files changed, 387 insertions, 0 deletions
diff --git a/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md b/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md new file mode 100644 index 0000000..3ebca54 --- /dev/null +++ b/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md @@ -0,0 +1,270 @@ +<!-- +title: "Monitor, troubleshoot, and debug applications with eBPF metrics" +description: "Use Netdata's built-in eBPF metrics collector to monitor, troubleshoot, and debug your custom application using low-level kernel feedback." +image: /img/seo/guides/troubleshoot/monitor-debug-applications-ebpf.png +custom_edit_url: https://github.com/netdata/netdata/edit/master/docs/guides/troubleshoot/monitor-debug-applications-ebpf.md +--> + +# Monitor, troubleshoot, and debug applications with eBPF metrics + +When trying to troubleshoot or debug a finicky application, there's no such thing as too much information. At Netdata, +we developed programs that connect to the [_extended Berkeley Packet Filter_ (eBPF) virtual +machine](/collectors/ebpf.plugin/README.md) to help you see exactly how specific applications are interacting with the +Linux kernel. With these charts, you can root out bugs, discover optimizations, diagnose memory leaks, and much more. + +This means you can see exactly how often, and in what volume, the application creates processes, opens files, writes to +filesystem using virtual filesystem (VFS) functions, and much more. Even better, the eBPF collector gathers metrics at +an _event frequency_, which is even faster than Netdata's beloved 1-second granularity. When you troubleshoot and debug +applications with eBPF, rest assured you miss not even the smallest meaningful event. + +Using this guide, you'll learn the fundamentals of setting up Netdata to give you kernel-level metrics from your +application so that you can monitor, troubleshoot, and debug to your heart's content. + +## Configure `apps.plugin` to recognize your custom application + +To start troubleshooting an application with eBPF metrics, you need to ensure your Netdata dashboard collects and +displays those metrics independent from any other process. + +You can use the `apps_groups.conf` file to configure which applications appear in charts generated by +[`apps.plugin`](/collectors/apps.plugin/README.md). Once you edit this file and create a new group for the application +you want to monitor, you can see how it's interacting with the Linux kernel via real-time eBPF metrics. + +Let's assume you have an application that runs on the process `custom-app`. To monitor eBPF metrics for that application +separate from any others, you need to create a new group in `apps_groups.conf` and associate that process name with it. + +Open the `apps_groups.conf` file in your Netdata configuration directory. + +```bash +cd /etc/netdata # Replace this path with your Netdata config directory +sudo ./edit-config apps_groups.conf +``` + +Scroll down past the explanatory comments and stop when you see `# NETDATA processes accounting`. Above that, paste in +the following text, which creates a new `dev` group with the `custom-app` process. Replace `custom-app` with the name of +your application's process name. + +Your file should now look like this: + +```conf +... +# ----------------------------------------------------------------------------- +# Custom applications to monitor with apps.plugin and ebpf.plugin + +dev: custom-app + +# ----------------------------------------------------------------------------- +# NETDATA processes accounting +... +``` + +Restart Netdata with `sudo systemctl restart netdata`, or the [appropriate +method](/docs/configure/start-stop-restart.md) for your system, to begin seeing metrics for this particular +group+process. You can also add additional processes to the same group. + +You can set up `apps_groups.conf` to more show more precise eBPF metrics for any application or service running on your +system, even if it's a standard package like Redis, Apache, or any other [application/service Netdata collects +from](/collectors/COLLECTORS.md). + +```conf +# ----------------------------------------------------------------------------- +# Custom applications to monitor with apps.plugin and ebpf.plugin + +dev: custom-app +database: *redis* +apache: *apache* + +# ----------------------------------------------------------------------------- +# NETDATA processes accounting +... +``` + +Now that you have `apps_groups.conf` set up to monitor your application/service, you can also set up the eBPF collector +to show other charts that will help you debug and troubleshoot how it interacts with the Linux kernel. + +## Configure the eBPF collector to monitor errors + +The eBPF collector has [two possible modes](/collectors/ebpf.plugin#ebpf-load-mode): `entry` and `return`. The default +is `entry`, and only monitors calls to kernel functions, but the `return` also monitors and charts _whether these calls +return in error_. + +Let's turn on the `return` mode for more granularity when debugging Firefox's behavior. + +```bash +cd /etc/netdata # Replace this path with your Netdata config directory +sudo ./edit-config ebpf.d.conf +``` + +Replace `entry` with `return`: + +```conf +[global] + ebpf load mode = return + disable apps = no + +[ebpf programs] + process = yes + network viewer = yes +``` + +Restart Netdata with `sudo systemctl restart netdata`, or the [appropriate +method](/docs/configure/start-stop-restart.md) for your system. + +## Get familiar with per-application eBPF metrics and charts + +Visit the Netdata dashboard at `http://NODE:19999`, replacing `NODE` with the hostname or IP of the system you're using +to monitor this application. Scroll down to the **Applications** section. These charts now feature a `firefox` dimension +with metrics specific to that process. + +Pay particular attention to the charts in the **ebpf file**, **ebpf syscall**, **ebpf process**, and **ebpf net** +sub-sections. These charts are populated by low-level Linux kernel metrics thanks to eBPF, and showcase the volume of +calls to open/close files, call functions like `do_fork`, IO activity on the VFS, and much more. + +See the [eBPF collector documentation](/collectors/ebpf.plugin/README.md#integration-with-appsplugin) for the full list +of per-application charts. + +Let's show some examples of how you can first identify normal eBPF patterns, then use that knowledge to identify +anomalies in a few simulated scenarios. + +For example, the following screenshot shows the number of open files, failures to open files, and closed files on a +Debian 10 system. The first spike is from configuring/compiling a small C program, then from running Apache's `ab` tool +to benchmark an Apache web server. + +![An example of eBPF +charts](https://user-images.githubusercontent.com/1153921/85311677-a8380c80-b46a-11ea-9735-babaedc22fdb.png) + +In these charts, you can see first a spike in syscalls to open and close files from the configure/build process, +followed by a similar spike from the Apache benchmark. + +> 👋 Don't forget that you can view chart data directly via Netdata's API! +> +> For example, open your browser and navigate to `http://NODE:19999/api/v1/data?chart=apps.file_open`, replacing `NODE` +> with the IP address or hostname of your Agent. The API returns JSON of that chart's dimensions and metrics, which you +> can use in other operations. +> +> To see other charts, replace `apps.file_open` with the context of the chart you want to see data for. +> +> To see all the API options, visit our [Swagger +> documentation](https://editor.swagger.io/?url=https://raw.githubusercontent.com/netdata/netdata/master/web/api/netdata-swagger.yaml) +> and look under the **/data** section. + +## Troubleshoot and debug applications with eBPF + +The actual method of troubleshooting and debugging any application with Netdata's eBPF metrics depends on the +application, its place within your stack, and the type of issue you're trying to root cause. This guide won't be able to +explain how to troubleshoot _any_ application with eBPF metrics, but it should give you some ideas on how to start with +your own systems. + +The value of using Netdata to collect and visualize eBPF metrics is that you don't have to rely on existing (complex) +command line eBPF programs or, even worse, write your own eBPF program to get the information you need. + +Let's walk through some scenarios where you might find value in eBPF metrics. + +### Benchmark application performance + +You can use eBPF metrics to profile the performance of your applications, whether they're custom or a standard Linux +service, like a web server or database. + +For example, look at the charts below. The first spike represents running a Redis benchmark _without_ pipelining +(`redis-benchmark -n 1000000 -t set,get -q`). The second spike represents the same benchmark _with_ pipelining +(`redis-benchmark -n 1000000 -t set,get -q -P 16`). + +![Screenshot of eBPF metrics during a Redis +benchmark](https://user-images.githubusercontent.com/1153921/84916168-91607700-b072-11ea-8fec-b76df89315aa.png) + +The performance optimization is clear from the speed at which the benchmark finished (the horizontal length of the +spike) and the reduced write/read syscalls and bytes written to disk. + +You can run similar performance benchmarks against any application, view the results on a Linux kernel level, and +continuously improve the performance of your infrastructure. + +### Inspect for leaking file descriptors + +If your application runs fine and then only crashes after a few hours, leaking file descriptors may be to blame. + +Check the **Number of open files (apps.file_open)** and **Files closed (apps.file_closed)** for discrepancies. These +metrics should be more or less equal. If they diverge, with more open files than closed, your application may not be +closing file descriptors properly. + +See, for example, the volume of files opened and closed by `apps.plugin` itself. Because the eBPF collector is +monitoring these syscalls at an event level, you can see at any given second that the open and closed numbers as equal. + +This isn't to say Netdata is _perfect_, but at least `apps.plugin` doesn't have a file descriptor problem. + +![Screenshot of open and closed file +descriptors](https://user-images.githubusercontent.com/1153921/84816048-c57f5d80-afc8-11ea-9684-d2b923d5d2b2.png) + +### Pin down syscall failures + +If you enabled the eBPF collector's `return` mode as mentioned [in a previous +step](#configure-the-ebpf-collector-to-monitor-errors), you can view charts related to how often a given application's +syscalls return in failure. + +By understanding when these failures happen, and when, you might be able to diagnose a bug in your application. + +To diagnose potential issues with an application, look at the **Fails to open files (apps.file_open_error)**, **Fails to +close files (apps.file_close_error)**, **Fails to write (apps.vfs_write_error)**, and **Fails to read +(apps.vfs_read_error)** charts for failed syscalls coming from your application. If you see any, look to the surrounding +charts for anomalies at the same time frame, or correlate with other activity in the application or on the system to get +closer to the root cause. + +### Investigate zombie processes + +Look for the trio of **Process started (apps.process_create)**, **Threads started (apps.thread_create)**, and **Tasks +closed (apps.task_close)** charts to investigate situations where an application inadvertently leaves [zombie +processes](https://en.wikipedia.org/wiki/Zombie_process). + +These processes, which are terminated and don't use up system resources, can still cause issues if your system runs out +of available PIDs to allocate. + +For example, the chart below demonstrates a [zombie factory +program](https://www.refining-linux.org/archives/7-Dr.-Frankenlinux-or-how-to-create-zombie-processes.html) in action. + +![Screenshot of eBPF showing evidence of a zombie +process](https://user-images.githubusercontent.com/1153921/84831957-27e45800-afe1-11ea-9fe2-fdd910915366.png) + +Starting at 14:51:49, Netdata sees the `zombie` group creating one new process every second, but no closed tasks. This +continues for roughly 30 seconds, at which point the factory program was killed with `SIGINT`, which results in the 31 +closed tasks in the subsequent second. + +Zombie processes may not be catastrophic, but if you're developing an application on Linux, you should eliminate them. +If a service in your stack creates them, you should consider filing a bug report. + +## View eBPF metrics in Netdata Cloud + +You can also show per-application eBPF metrics in Netdata Cloud. This could be particularly useful if you're running the +same application on multiple systems and want to correlate how it performs on each target, or if you want to share your +findings with someone else on your team. + +If you don't already have a Netdata Cloud account, go [sign in](https://app.netdata.cloud) and get started for free. +Read the [get started with Cloud guide](https://learn.netdata.cloud/docs/cloud/get-started) for a walkthrough of +connecting nodes to and other fundamentals. + +Once you've added one or more nodes to a Space in Netdata Cloud, you can see aggregated eBPF metrics in the [Overview +dashboard](/docs/visualize/overview-infrastructure.md) under the same **Applications** or **eBPF** sections that you +find on the local Agent dashboard. Or, [create new dashboards](/docs/visualize/create-dashboards.md) using eBPF metrics +from any number of distributed nodes to see how your application interacts with multiple Linux kernels on multiple Linux +systems. + +Now that you can see eBPF metrics in Netdata Cloud, you can [invite your +team](https://learn.netdata.cloud/docs/cloud/manage/invite-your-team) and share your findings with others. + +## What's next? + +Debugging and troubleshooting an application takes a special combination of practice, experience, and sheer luck. With +Netdata's eBPF metrics to back you up, you can rest assured that you see every minute detail of how your application +interacts with the Linux kernel. + +If you're still trying to wrap your head around what we offer, be sure to read up on our accompanying documentation and +other resources on eBPF monitoring with Netdata: + +- [eBPF collector](/collectors/ebpf.plugin/README.md) +- [eBPF's integration with `apps.plugin`](/collectors/apps.plugin/README.md#integration-with-ebpf) +- [Linux eBPF monitoring with Netdata](https://www.netdata.cloud/blog/linux-ebpf-monitoring-with-netdata/) + +The scenarios described above are just the beginning when it comes to troubleshooting with eBPF metrics. We're excited +to explore others and see what our community dreams up. If you have other use cases, whether simulated or real-world, +we'd love to hear them: [info@netdata.cloud](mailto:info@netdata.cloud). + +Happy troubleshooting! + + diff --git a/docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md b/docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md new file mode 100644 index 0000000..3bb5ace --- /dev/null +++ b/docs/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md @@ -0,0 +1,117 @@ +<!-- +title: "Troubleshoot Agent-Cloud connectivity issues" +description: "A simple guide to troubleshoot occurrences where the Agent is showing as offline after claiming." +custom_edit_url: https://github.com/netdata/netdata/edit/master/guides/troubleshoot/troubleshooting-agent-with-cloud-connection.md +--> + +# Troubleshoot Agent-Cloud connectivity issues + +When you are claiming a node, you might not be able to immediately see it online in Netdata Cloud. +This could be due to an error in the claiming process or a temporary outage of some services. + +We identified some scenarios that might cause this delay and possible actions you could take to overcome each situation. + +The most common explanation for the delay usually falls into one of the following three categories: + +- [The claiming process of the kickstart script was unsuccessful](#the-claiming-process-of-the-kickstart-script-was-unsuccessful) +- [Claiming on an older, deprecated version of the Agent](#claiming-on-an-older-deprecated-version-of-the-agent) +- [Network issues while connecting to the Cloud](#network-issues-while-connecting-to-the-cloud) + +## The claiming process of the kickstart script was unsuccessful + +Here, we will try to define some edge cases you might encounter when claiming a node. + +### The kickstart script auto-claimed the Agent but there was no error message displayed + +The kickstart script will install/update your Agent and then try to claim the node to the Cloud (if tokens are provided). To +complete the second part, the Agent must be running. In some platforms, the Netdata service cannot be enabled by default +and you must do it manually, using the following steps: + +1. Check if the Agent is running: + + ```bash + systemctl status netdata + ``` + + The expected output should contain info like this: + + ```bash + Active: active (running) since Wed 2022-07-06 12:25:02 EEST; 1h 40min ago + ``` + +2. Enable and start the Netdata Service. + + ```bash + systemctl enable netdata + systemctl start netdata + ``` + +3. Retry the kickstart claiming process. + +:::note + +In some cases a simple restart of the Agent can fix the issue. +Read more about [Starting, Stopping and Restarting the Agent](/docs/configure/start-stop-restart.md). + +::: + +## Claiming on an older, deprecated version of the Agent + +Make sure that you are using the latest version of Netdata if you are using the [Claiming script](https://learn.netdata.cloud/docs/agent/claim#claiming-script). + +With the introduction of our new architecture, Agents running versions lower than `v1.32.0` can face claiming problems, so we recommend you [update the Netdata Agent](https://learn.netdata.cloud/docs/agent/packaging/installer/update) to the latest stable version. + +## Network issues while connecting to the Cloud + +### Verify that your IP is whitelisted from Netdata Cloud + +Most of the nodes change IPs dynamically. It is possible that your current IP has been restricted from accessing `api.netdata.cloud` due to security concerns. + +To verify this: + +1. Check the Agent's `aclk-state`. + + ```bash + sudo netdatacli aclk-state | grep "Banned By Cloud" + ``` + + The output will contain a line indicating if the IP is banned from `api.netdata.cloud`: + + ```bash + Banned By Cloud: yes + ``` + +2. If your node's IP is banned, you can: + + - Contact our team to whitelist your IP by submitting a ticket in the [Netdata forum](https://community.netdata.cloud/) + - Change your node's IP + +### Make sure that your node has internet connectivity and can resolve network domains + +1. Try to reach a well known host: + + ```bash + ping 8.8.8.8 + ``` + +2. If you can reach external IPs, then check your domain resolution. + + ```bash + host api.netdata.cloud + ``` + + The expected output should be something like this: + + ```bash + api.netdata.cloud is an alias for main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com. + main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com has address 54.198.178.11 + main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com has address 44.207.131.212 + main-ingress-545609a41fcaf5d6.elb.us-east-1.amazonaws.com has address 44.196.50.41 + ``` + + :::info + + There will be cases in which the firewall restricts network access. In those cases, you need to whitelist `api.netdata.cloud` and `mqtt.netdata.cloud` domains to be able to see your nodes in Netdata Cloud. + If you can't whitelist domains in your firewall, you can whitelist the IPs that the above command will produce, but keep in mind that they can change without any notice. + + ::: |