summaryrefslogtreecommitdiffstats
path: root/health/guides/cpu/10min_cpu_iowait.md
diff options
context:
space:
mode:
authorDaniel Baumann <daniel.baumann@progress-linux.org>2024-03-09 13:19:22 +0000
committerDaniel Baumann <daniel.baumann@progress-linux.org>2024-03-09 13:19:22 +0000
commitc21c3b0befeb46a51b6bf3758ffa30813bea0ff0 (patch)
tree9754ff1ca740f6346cf8483ec915d4054bc5da2d /health/guides/cpu/10min_cpu_iowait.md
parentAdding upstream version 1.43.2. (diff)
downloadnetdata-0d980fd06561f4670f5d8170c5aedd74023e3702.tar.xz
netdata-0d980fd06561f4670f5d8170c5aedd74023e3702.zip
Adding upstream version 1.44.3.upstream/1.44.3
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'health/guides/cpu/10min_cpu_iowait.md')
-rw-r--r--health/guides/cpu/10min_cpu_iowait.md36
1 files changed, 36 insertions, 0 deletions
diff --git a/health/guides/cpu/10min_cpu_iowait.md b/health/guides/cpu/10min_cpu_iowait.md
new file mode 100644
index 000000000..b05530e84
--- /dev/null
+++ b/health/guides/cpu/10min_cpu_iowait.md
@@ -0,0 +1,36 @@
+### Understand the alert
+
+This alarm calculates the average time of `iowait` through 10 minute interval periods. `iowait` is the percentage of time where there has been at least one I/O request in progress while the CPU has been idle.
+
+I/O -at a process level- is the use of the read and write services, such as reading data from a physical drive.
+
+It's important to note that during the time a process waits on I/O, the system can schedule other processes, but `iowait` is measured specifically while the CPU is idle.
+
+A common example of when this alert might be triggered would be when your CPU requests some data and the device responsible for it can't deliver it fast enough. As a result the CPU (in the next clock interrupt) is idle, so you
+encounter `iowait`. If this persists for some time and the average from the metrics we gather exceeds the value that is being checked in the `.conf` file, then the alert is raised because the CPU is being bottlenecked by your system’s disks.
+
+### Troubleshooting Section
+
+- Check for main I/O related processes and hardware issues
+
+Generally, this issue is caused by having slow hard drives that cannot keep up with the speed of your CPU. You can see the percentage of `iowait` by going to your node on Netdata Cloud and clicking the `iowait` dimension under the Total CPU Utilization chart.
+
+- You can use `vmstat` (or `vmstat 1`, to set a delay between updates in seconds)
+
+The `procs` column, shows the number of processes blocked waiting for I/O to complete.
+
+After that, you can use `ps` and specifically `ps -eo s,user,cmd | grep ^[D]` to fetch the processes that their state code starts with `D` which means uninterruptible sleep (usually IO).
+
+- It could be helpful to close any of the main consumer processes, but Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary.
+
+- If you see that you don't have a lot of processes that you can terminate (or you need them for your workflow), then you would have to upgrade your system’s drives; if you have an HDD, upgrading to an SSD or an NVME drive would make a great impact on this metric.
+
+### Are you operating a database?
+
+In a database environment, you would want to optimize your operations. Check for potential inserts on large data sets, keeping in mind that `write` operations take more time than `read`. You should also search for
+ complex requests, like large joins and queries over a big data set. These can introduce `iowait` and need to be optimized.
+
+### Useful resources
+
+- [What exactly is "iowait"?](https://serverfault.com/questions/12679/can-anyone-explain-precisely-what-iowait-is)
+