From be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Fri, 19 Apr 2024 04:57:58 +0200 Subject: Adding upstream version 1.44.3. Signed-off-by: Daniel Baumann --- health/guides/cpu/10min_cpu_iowait.md | 36 ++++++++++++++++++++++++++++++++++ health/guides/cpu/10min_cpu_usage.md | 37 +++++++++++++++++++++++++++++++++++ health/guides/cpu/20min_steal_cpu.md | 18 +++++++++++++++++ 3 files changed, 91 insertions(+) create mode 100644 health/guides/cpu/10min_cpu_iowait.md create mode 100644 health/guides/cpu/10min_cpu_usage.md create mode 100644 health/guides/cpu/20min_steal_cpu.md (limited to 'health/guides/cpu') diff --git a/health/guides/cpu/10min_cpu_iowait.md b/health/guides/cpu/10min_cpu_iowait.md new file mode 100644 index 00000000..b05530e8 --- /dev/null +++ b/health/guides/cpu/10min_cpu_iowait.md @@ -0,0 +1,36 @@ +### Understand the alert + +This alarm calculates the average time of `iowait` through 10 minute interval periods. `iowait` is the percentage of time where there has been at least one I/O request in progress while the CPU has been idle. + +I/O -at a process level- is the use of the read and write services, such as reading data from a physical drive. + +It's important to note that during the time a process waits on I/O, the system can schedule other processes, but `iowait` is measured specifically while the CPU is idle. + +A common example of when this alert might be triggered would be when your CPU requests some data and the device responsible for it can't deliver it fast enough. As a result the CPU (in the next clock interrupt) is idle, so you +encounter `iowait`. If this persists for some time and the average from the metrics we gather exceeds the value that is being checked in the `.conf` file, then the alert is raised because the CPU is being bottlenecked by your system’s disks. + +### Troubleshooting Section + +- Check for main I/O related processes and hardware issues + +Generally, this issue is caused by having slow hard drives that cannot keep up with the speed of your CPU. You can see the percentage of `iowait` by going to your node on Netdata Cloud and clicking the `iowait` dimension under the Total CPU Utilization chart. + +- You can use `vmstat` (or `vmstat 1`, to set a delay between updates in seconds) + +The `procs` column, shows the number of processes blocked waiting for I/O to complete. + +After that, you can use `ps` and specifically `ps -eo s,user,cmd | grep ^[D]` to fetch the processes that their state code starts with `D` which means uninterruptible sleep (usually IO). + +- It could be helpful to close any of the main consumer processes, but Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary. + +- If you see that you don't have a lot of processes that you can terminate (or you need them for your workflow), then you would have to upgrade your system’s drives; if you have an HDD, upgrading to an SSD or an NVME drive would make a great impact on this metric. + +### Are you operating a database? + +In a database environment, you would want to optimize your operations. Check for potential inserts on large data sets, keeping in mind that `write` operations take more time than `read`. You should also search for + complex requests, like large joins and queries over a big data set. These can introduce `iowait` and need to be optimized. + +### Useful resources + +- [What exactly is "iowait"?](https://serverfault.com/questions/12679/can-anyone-explain-precisely-what-iowait-is) + diff --git a/health/guides/cpu/10min_cpu_usage.md b/health/guides/cpu/10min_cpu_usage.md new file mode 100644 index 00000000..17e153f6 --- /dev/null +++ b/health/guides/cpu/10min_cpu_usage.md @@ -0,0 +1,37 @@ +### Understand the alert + +This alarm calculates an average on CPU utilization over a period of 10 minutes, **excluding** `iowait`, `nice` and `steal` values. + +*Note that on FreeBSD, the alert excludes only `nice`. + +`iowait` is the percentage of time the CPU waits on a disk for an I/O; it happens when the former is getting bottlenecked by the latter. At this point the CPU is being idle, waiting only on the I/O. + +`nice` value of a processor is the time it has spent on running low priority processes. Low priority processes are those with a 'nice' value greater than 0 (on UNIX-like systems, a higher ‘nice’ value indicates a lower priority). + +`steal`, in a virtual machine, is the percentage of time that particular virtual CPU has to wait for an available host CPU to run on. If this metric goes up, it means that your VM is not getting the processing power it needs. + +### Troubleshooting Section + +- Processes slowing down your CPU + +There are two primary cases in which this alarm is raised, and determining which applies to you requires understanding your own scenario. + +1. High CPU utilization with high `nice` value means that the system is running through all the low priority processes, and if some high priority process needs CPU time, it can get it at any time. +2. High CPU utilization with low `nice` value means that the CPU is used on high priority processes and new ones will not be able to take CPU time, and they will have to wait. + +The latter scenario is worth investigating if there is a process slowing down your CPU. We suggest you go to your node on Netdata Cloud and click the `nice` dimension under the `Total CPU Utilization` chart to see the value. You can then check per process CPU usage using `top`: + +If you're using Linux: +``` +root@netdata~ # top -o +%CPU -i +``` + +And for FreeBSD: +``` +root@netdata~ # top -o cpu -I +``` + +Here, you can see which processes are the main cpu consumers on the `CPU` column. + +It would be helpful to close any of the main consumer processes, but Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary. + diff --git a/health/guides/cpu/20min_steal_cpu.md b/health/guides/cpu/20min_steal_cpu.md new file mode 100644 index 00000000..e87c6f05 --- /dev/null +++ b/health/guides/cpu/20min_steal_cpu.md @@ -0,0 +1,18 @@ +### Understand the alert + +This alarm calculates average CPU `steal` time over the last 20 minutes + +`steal`, in a virtual machine, is the percentage of time that particular virtual CPU has to wait for an available host CPU to run on. If this metric goes up, it means that your VM is not getting the processing power it needs. + +### Troubleshoot the alert + +Check for CPU quota and host issues. + +Generally, if `steal` is high, it could mean one of the following: + +- Another VM on the host system is hogging the CPU. +- System services on the host system are monopolizing the CPU (for example, system updates). +- The host CPUs are over-committed (you have more virtual CPUs assigned to VMs than the host system has physical CPUs) and too many VMs need CPU time simultanously. +- The VM itself has a CPU quota that is too low. + +So in the end you can increase the CPU resources of that particular VM, and if the alert persists, move the guest to a different *physical* server. -- cgit v1.2.3