summaryrefslogtreecommitdiffstats
path: root/health/guides/load
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/load')
-rw-r--r--health/guides/load/load_average_1.md51
-rw-r--r--health/guides/load/load_average_15.md55
-rw-r--r--health/guides/load/load_average_5.md66
-rw-r--r--health/guides/load/load_cpu_number.md48
4 files changed, 220 insertions, 0 deletions
diff --git a/health/guides/load/load_average_1.md b/health/guides/load/load_average_1.md
new file mode 100644
index 000000000..1f33f8ff5
--- /dev/null
+++ b/health/guides/load/load_average_1.md
@@ -0,0 +1,51 @@
+### Understand the alert
+
+This alarm calculates the system `load average` (`CPU` and `I/O` demand) over the period of one minute. If you receive this alarm, it means that your system is `overloaded`.
+
+### What does "load average" mean?
+
+The term `system load average` on a Linux machine, measures the **number of threads that are currently working and those waiting to work** (CPU, disk, uninterruptible locks). So simply stated: **System load average measures the number of threads that aren't idle.**
+
+### What does "overloaded" mean?
+
+Let's look at a single core CPU system and think of its core count as car lanes on a bridge. A car represents a process in this example:
+
+- On a 0.5 load average, the traffic on the bridge is fine, it is at 50% of its capacity.
+- If the load average is at 1, then the bridge is full, and it is utilized 100%.
+- If the load average gets to 2 (remember we are on a single core machine), it means that there is one car lane that is passing the bridge. However, there is **another** full car lane that waits to pass the bridge.
+
+So this is how you can imagine CPU load, but keep in mind that `load average` counts also I/O demand, so there is an analogous example there.
+
+### Troubleshoot the alert
+
+- Determine if the problem is CPU load or I/O load
+
+To get a report about your system statistics, use `vmstat` (or `vmstat 1`, to set a delay between updates in seconds):
+
+The `procs` column, shows:
+r: The number of runnable processes (running or waiting for run time).
+b: The number of processes blocked waiting for I/O to complete.
+
+- Check per-process CPU/disk usage to find the top consumers
+
+1. To see the processes that are the main CPU consumers, use the task manager program `top` like this:
+
+ ```
+ top -o +%CPU -i
+ ```
+
+2. Use `iotop`:
+ `iotop` is a useful tool, similar to `top`, used to monitor Disk I/O usage, if you don't have it, then [install it](https://www.tecmint.com/iotop-monitor-linux-disk-io-activity-per-process/)
+ ```
+ sudo iotop
+ ```
+
+3. Minimize the load by closing any unnecessary main consumer processes. We strongly advise you to double-check if the process you want to close is necessary.
+
+### Useful resources
+
+1. [UNIX Load Average Part 1: How It Works](https://www.helpsystems.com/resources/guides/unix-load-average-part-1-how-it-works)
+2. [UNIX Load Average Part 2: Not Your Average Average](https://www.helpsystems.com/resources/guides/unix-load-average-part-2-not-your-average-average)
+3. [Understanding Linux CPU Load](https://scoutapm.com/blog/understanding-load-averages)
+4. [Linux Load Averages: Solving the Mystery](https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html)
+5. [Understanding Linux Process States](https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf)
diff --git a/health/guides/load/load_average_15.md b/health/guides/load/load_average_15.md
new file mode 100644
index 000000000..ba8b1e3e0
--- /dev/null
+++ b/health/guides/load/load_average_15.md
@@ -0,0 +1,55 @@
+### Understand the alert
+
+This alarm calculates the system `load average` (CPU and I/O demand) over the period of fifteen minutes. If you receive this alarm, it means that your system is "overloaded."
+
+The alert gets raised into warning if the metric is 2 times the expected value and cleared if the value is 1.75 times the expected value.
+
+For further information on how our alerts are calculated, please have a look at our [Documentation](https://learn.netdata.cloud/docs/agent/health/reference#expressions).
+
+### What does "load average" mean?
+
+The term `system load average` on a Linux machine, measures the **number of threads that are currently working and those waiting to work** (CPU, disk, uninterruptible locks). So simply stated: **System load average measures the number of threads that aren't idle.**
+
+### What does "overloaded" mean?
+
+Let's look at a single core CPU system and think of its core count as car lanes on a bridge. A car represents a process in this example:
+
+- On a 0.5 load average, the traffic on the bridge is fine, it is at 50% of its capacity.
+- If the load average is at 1, then the bridge is full, and it is utilized 100%.
+- If the load average gets to 2 (remember we are on a single core machine), it means that there is one car lane that is passing the bridge. However, there is **another** full car lane that waits to pass the bridge.
+
+So this is how you can imagine CPU load, but keep in mind that `load average` counts also I/O demand, so there is an analogous example there.
+
+### Troubleshoot the alert
+
+- Determine if the problem is CPU load or I/O load
+
+To get a report about your system statistics, use `vmstat` (or `vmstat 1`, to set a delay between updates in seconds):
+
+The `procs` column, shows:
+r: The number of runnable processes (running or waiting for run time).
+b: The number of processes blocked waiting for I/O to complete.
+
+- Check per-process CPU/disk usage to find the top consumers
+
+1. To see the processes that are the main CPU consumers, use the task manager program `top` like this:
+
+ ```
+ top -o +%CPU -i
+ ```
+
+2. Use `iotop`:
+ `iotop` is a useful tool, similar to `top`, used to monitor Disk I/O usage, if you don't have it, then [install it](https://www.tecmint.com/iotop-monitor-linux-disk-io-activity-per-process/)
+ ```
+ sudo iotop
+ ```
+
+3. Minimize the load by closing any unnecessary main consumer processes. We strongly advise you to double-check if the process you want to close is necessary.
+
+### Useful resources
+
+1. [UNIX Load Average Part 1: How It Works](https://www.helpsystems.com/resources/guides/unix-load-average-part-1-how-it-works)
+2. [UNIX Load Average Part 2: Not Your Average Average](https://www.helpsystems.com/resources/guides/unix-load-average-part-2-not-your-average-average)
+3. [Understanding Linux CPU Load](https://scoutapm.com/blog/understanding-load-averages)
+4. [Linux Load Averages: Solving the Mystery](https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html)
+5. [Understanding Linux Process States](https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf)
diff --git a/health/guides/load/load_average_5.md b/health/guides/load/load_average_5.md
new file mode 100644
index 000000000..6eacfcec9
--- /dev/null
+++ b/health/guides/load/load_average_5.md
@@ -0,0 +1,66 @@
+### Understand the alert
+
+This alarm calculates the system `load average` (CPU and I/O demand) over the period of five minutes. If you receive this alarm, it means that your system is "overloaded."
+
+The alert gets raised into warning if the metric is 4 times the expected value and cleared if the value is 3.5 times the expected value.
+
+For further information on how our alerts are calculated, please have a look at our [Documentation](https://learn.netdata.cloud/docs/agent/health/reference#expressions).
+
+
+### What does "load average" mean?
+
+The term `system load average` on a Linux machine, measures the **number of threads that are currently working and those waiting to work** (CPU, disk, uninterruptible locks). So simply stated: **System load average measures the number of threads that aren't idle.**
+
+### What does "overloaded" mean?
+
+Let's look at a single core CPU system and think of its core count as car lanes on a bridge. A car represents a process in this example:
+
+- On a 0.5 load average, the traffic on the bridge is fine, it is at 50% of its capacity.
+- If the load average is at 1, then the bridge is full, and it is utilized 100%.
+- If the load average gets to 2 (remember we are on a single core machine), it means that there is one car lane that is passing the bridge. However, there is **another** full car lane that waits to pass the bridge.
+
+So this is how you can imagine CPU load, but keep in mind that `load average` counts also I/O demand, so there is an analogous example there.
+
+### Useful resources
+
+1. [UNIX Load Average Part 1: How It Works](https://www.helpsystems.com/resources/guides/unix-load-average-part-1-how-it-works)
+2. [UNIX Load Average Part 2: Not Your Average Average](https://www.helpsystems.com/resources/guides/unix-load-average-part-2-not-your-average-average)
+3. [Understanding Linux CPU Load](https://scoutapm.com/blog/understanding-load-averages)
+4. [Linux Load Averages: Solving the Mystery](https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html)
+5. [Understanding Linux Process States](https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf)
+
+
+### Troubleshoot the alert
+
+- Determine if the problem is CPU or I/O bound
+
+First you need to check if you are running on a CPU load or an I/O load problem.
+
+1. To get a report about your system statistics, use `vmstat` (or `vmstat 1`, to set a delay between updates in seconds):
+
+The `procs` column, shows:
+r: The number of runnable processes (running or waiting for run time).
+b: The number of processes blocked waiting for I/O to complete.
+
+2. List your currently running processes using the `ps` command:
+
+The `grep` command will fetch the processes that their state code starts either with R (running or runnable (on run queue)) or D(uninterruptible sleep (usually IO)).
+
+3. Minimize the load by closing any unnecessary main consumer processes. We strongly advise you to double-check if the process you want to close is necessary.
+
+- Check per-process CPU/disk usage to find the top consumers
+
+1. To see the processes that are the main CPU consumers, use the task manager program `top` like this:
+
+ ```
+ top -o +%CPU -i
+ ```
+
+2. Use `iotop`:
+ `iotop` is a useful tool, similar to `top`, used to monitor Disk I/O usage, if you don't have it, then [install it](https://www.tecmint.com/iotop-monitor-linux-disk-io-activity-per-process/)
+ ```
+ sudo iotop
+ ```
+
+3. Minimize the load by closing any unnecessary main consumer processes. We strongly advise you to double-check if the process you want to close is necessary.
+
diff --git a/health/guides/load/load_cpu_number.md b/health/guides/load/load_cpu_number.md
new file mode 100644
index 000000000..250a6d069
--- /dev/null
+++ b/health/guides/load/load_cpu_number.md
@@ -0,0 +1,48 @@
+### Understand the alert
+
+This alert, `load_cpu_number`, calculates the base trigger point for load average alarms, which helps identify when the system is overloaded. The alert checks the maximum number of CPUs in the system over the past 1 minute. If there is only one CPU, the trigger is set at 2.
+
+### What does load average mean?
+
+The term `system load average` on a Linux machine measures the number of threads that are currently working and those waiting to work (CPU, disk, uninterruptible locks). In simpler terms, the load average measures the number of threads that aren't idle.
+
+### What does overloaded mean?
+
+An overloaded system is when the demand on the system's resources (CPUs, disks, etc.) is higher than its capacity to handle tasks. This can lead to increased wait times, slower processing, and in worst cases, system crashes.
+
+### Troubleshoot the alert
+
+1. Determine the current load average on the system:
+
+ Use the `uptime` command in the terminal to see the current load average:
+ ```
+ uptime
+ ```
+
+2. Identify if the problem is CPU load or I/O load:
+
+ Use `vmstat` (or `vmstat 1`, to set a delay between updates in seconds) to get a report on system statistics:
+
+ The `procs` column shows:
+ r: The number of runnable processes (running or waiting for run time).
+ b: The number of processes blocked waiting for I/O to complete.
+
+3. Check per-process CPU/disk usage to find the top consumers:
+
+ a. Use `top` to see the processes that are the main CPU consumers:
+ ```
+ top -o +%CPU -i
+ ```
+
+ b. Use `iotop` to monitor Disk I/O usage (install it if not available):
+ ```
+ sudo iotop
+ ```
+
+4. Minimize the load by closing any unnecessary main consumer processes. Double-check if the process you want to close is necessary.
+
+### Useful resources
+
+1. [Unix Load Average Part 1: How It Works](https://www.helpsystems.com/resources/guides/unix-load-average-part-1-how-it-works)
+2. [Unix Load Average Part 2: Not Your Average Average](https://www.helpsystems.com/resources/guides/unix-load-average-part-2-not-your-average-average)
+3. [Understanding Linux Process States](https://access.redhat.com/sites/default/files/attachments/processstates_20120831.pdf) \ No newline at end of file