summaryrefslogtreecommitdiffstats
path: root/health/guides/cpu/10min_cpu_iowait.md
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/cpu/10min_cpu_iowait.md')
-rw-r--r--health/guides/cpu/10min_cpu_iowait.md36
1 files changed, 36 insertions, 0 deletions
diff --git a/health/guides/cpu/10min_cpu_iowait.md b/health/guides/cpu/10min_cpu_iowait.md
new file mode 100644
index 000000000..b05530e84
--- /dev/null
+++ b/health/guides/cpu/10min_cpu_iowait.md
@@ -0,0 +1,36 @@
+### Understand the alert
+
+This alarm calculates the average time of `iowait` through 10 minute interval periods. `iowait` is the percentage of time where there has been at least one I/O request in progress while the CPU has been idle.
+
+I/O -at a process level- is the use of the read and write services, such as reading data from a physical drive.
+
+It's important to note that during the time a process waits on I/O, the system can schedule other processes, but `iowait` is measured specifically while the CPU is idle.
+
+A common example of when this alert might be triggered would be when your CPU requests some data and the device responsible for it can't deliver it fast enough. As a result the CPU (in the next clock interrupt) is idle, so you
+encounter `iowait`. If this persists for some time and the average from the metrics we gather exceeds the value that is being checked in the `.conf` file, then the alert is raised because the CPU is being bottlenecked by your system’s disks.
+
+### Troubleshooting Section
+
+- Check for main I/O related processes and hardware issues
+
+Generally, this issue is caused by having slow hard drives that cannot keep up with the speed of your CPU. You can see the percentage of `iowait` by going to your node on Netdata Cloud and clicking the `iowait` dimension under the Total CPU Utilization chart.
+
+- You can use `vmstat` (or `vmstat 1`, to set a delay between updates in seconds)
+
+The `procs` column, shows the number of processes blocked waiting for I/O to complete.
+
+After that, you can use `ps` and specifically `ps -eo s,user,cmd | grep ^[D]` to fetch the processes that their state code starts with `D` which means uninterruptible sleep (usually IO).
+
+- It could be helpful to close any of the main consumer processes, but Netdata strongly suggests knowing exactly what processes you are closing and being certain that they are not necessary.
+
+- If you see that you don't have a lot of processes that you can terminate (or you need them for your workflow), then you would have to upgrade your system’s drives; if you have an HDD, upgrading to an SSD or an NVME drive would make a great impact on this metric.
+
+### Are you operating a database?
+
+In a database environment, you would want to optimize your operations. Check for potential inserts on large data sets, keeping in mind that `write` operations take more time than `read`. You should also search for
+ complex requests, like large joins and queries over a big data set. These can introduce `iowait` and need to be optimized.
+
+### Useful resources
+
+- [What exactly is "iowait"?](https://serverfault.com/questions/12679/can-anyone-explain-precisely-what-iowait-is)
+