summaryrefslogtreecommitdiffstats
path: root/health/guides/disks
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/disks')
-rw-r--r--health/guides/disks/10min_disk_backlog.md10
-rw-r--r--health/guides/disks/10min_disk_utilization.md28
-rw-r--r--health/guides/disks/bcache_cache_dirty.md74
-rw-r--r--health/guides/disks/bcache_cache_errors.md66
-rw-r--r--health/guides/disks/disk_inode_usage.md23
-rw-r--r--health/guides/disks/disk_space_usage.md19
6 files changed, 220 insertions, 0 deletions
diff --git a/health/guides/disks/10min_disk_backlog.md b/health/guides/disks/10min_disk_backlog.md
new file mode 100644
index 00000000..9b0a275b
--- /dev/null
+++ b/health/guides/disks/10min_disk_backlog.md
@@ -0,0 +1,10 @@
+### Understand the alert
+
+This alert presents the average backlog size of the disk raising this alarm over the last 10 minutes.
+
+This alert is escalated to warning when the metric exceeds the size of 5000.
+
+### What is "disk backlog"?
+
+Backlog is an indication of the duration of pending disk operations. On every I/O event the system is multiplying the time spent doing I/O since the last update of this field with the number of pending operations. While not accurate, this metric can provide an indication of the expected completion time of the operations in progress.
+
diff --git a/health/guides/disks/10min_disk_utilization.md b/health/guides/disks/10min_disk_utilization.md
new file mode 100644
index 00000000..41a987a4
--- /dev/null
+++ b/health/guides/disks/10min_disk_utilization.md
@@ -0,0 +1,28 @@
+### Understand the alert
+
+This alert presents the average percentage of time the disk was busy over the last 10 minutes. If you receive this it indicates high disk load and that the disk spent most of the time servicing
+read or write requests.
+
+This alert is triggered in a warning state when the metric exceeds 98%.
+
+This metric is the same as the %util column on the command `iostat -x`.
+
+### Troubleshoot the alert
+
+- Check per-process disk usage to find the top consumers (If you got this alert for a device serving requests in parallel, you can ignore it)
+
+On Linux use `iotop` to see which processes are the main Disk I/O consumers on the `IO` column.
+ ```
+ sudo iotop
+ ```
+ Using this, you can see which processes are the main Disk I/O consumers on the `IO` column.
+
+On FreeBSD use `top`
+ ```
+ top -m io -o total
+ ```
+### Useful resources
+
+1. [Two traps in iostat: %util and svctm](https://brooker.co.za/blog/2014/07/04/iostat-pct.html)
+
+2. `iotop` is a useful tool, similar to `top`, used to monitor Disk I/O usage, if you don't have it, then [install it](https://www.tecmint.com/iotop-monitor-linux-disk-io-activity-per-process/)
diff --git a/health/guides/disks/bcache_cache_dirty.md b/health/guides/disks/bcache_cache_dirty.md
new file mode 100644
index 00000000..11b74e52
--- /dev/null
+++ b/health/guides/disks/bcache_cache_dirty.md
@@ -0,0 +1,74 @@
+### Understand the Alert
+
+`bcache` is a cache in the block layer of the Linux Kernel. **It allows fast storage devices**, as SSDs (Solid State Drives), **to act as a cache for slower storage devices**, such as HDDs (Hard Disk Drives). As a result, **hybrid volumes are made with performance improvements**. Generally, a cache device is divided up into `buckets`, matching the physical disk's erase blocks.
+
+This alert indicates that your SSD cache is too small, and overpopulated with data.
+
+You can view `bcache_cache_dirty` as the `bcache` analogous metric to `dirty memory`. `dirty memory` is memory that has been changed but has not yet been written out to disk. For example, you make a change to a file but do not save it. These temporary changes are stored in memory, waiting to be written to disk. So `dirty` data on `bcache` is data that is stored on the cache disk and waits to be written to the backing device (Normally your HDD).
+
+`dirty` data is data in the cache that has not been written to the backing device (normally your HDD). So when the system shuts down, the cache device and the backing device are not safe to be separated.
+`metadata` in general, is data that provides information about other data.
+
+### Troubleshoot the Alert
+
+- Upgrade your cache's capacity
+
+This alert is raised when there is more than 70% *(for warning status)* of your cache populated by `dirty` data and `metadata`, it means that your current cache device doesn't have the capacity to support your workflow. Using a bigger
+capacity device as cache can solve the problem.
+
+- Monitor cache usage regularly
+
+Keep an eye on the cache usage regularly to understand the pattern of how your cache gets filled up with dirty data and metadata. This can help you better manage the cache and take proactive measures before facing a performance bottleneck.
+
+ To monitor cache usage, use `cat` command on the cache device's sysfs directory like this:
+
+ ```
+ cat /sys/fs/bcache/<CACHE_DEV_UUID>/cache0/bcache/stats_five_minute/cache_hit_ratio
+ ```
+
+ Replace `<CACHE_DEV_UUID>` with your cache device's UUID.
+
+- Periodically write dirty data to the backing device
+
+If the cache becomes frequently filled with dirty data, you can try periodically writing dirty data to the backing device to create more space in the cache. This can especially help if your caching device isn't frequently reaching its full capacity.
+
+ To perform this, you can use the `cron` job scheduler to run a command that flushes dirty data to the HDD periodically. Add the following line to your crontab:
+
+ ```
+ */5 * * * * echo writeback > /sys/fs/bcache/<CACHE_DEV_UUID>/cache0/bcache/writeback_rate_debug
+ ```
+
+ Replace `<CACHE_DEV_UUID>` with your cache device's UUID. This configuration will flush the dirty data to the backing device every 5 minutes.
+
+- Check for I/O bottlenecks
+
+If you experience performance issues with bcache, it's essential to identify the cause, which could be I/O bottlenecks. Look for any I/O errors or an overloaded I/O subsystem that may be affecting your cache device's performance.
+
+ To check I/O statistics, you can use tools like `iotop`, `iostat` or `vmstat`:
+
+ ```bash
+ iotop
+ iostat -x -d -z -t 5 5 # run 5 times with a 5-second interval between each report
+ vmstat -d
+ ```
+
+ Analyze the output and look for any signs of a bottleneck, such as excessive disk utilization, slow transfer speeds, or high I/O wait times.
+
+- Optimize cache configuration
+
+Review your current cache configuration and make sure it's optimized for your system's workload. In some cases, adjusting cache settings could help improve the hit ratio and reduce the amount of dirty data.
+
+ To view the bcache settings:
+
+ ```
+ cat /sys/fs/bcache/<CACHE_DEV_UUID>/cache0/bcache/*
+ ```
+
+ Replace `<CACHE_DEV_UUID>` with your cache device's UUID.
+
+ You can also make changes to the cache settings by echoing the new values to the corresponding sysfs files. Please refer to the [Cache Settings section in the Bcache documentation](https://www.kernel.org/doc/Documentation/bcache.txt) for more details.
+
+### Useful resources
+
+1. [Bcache documentation](https://www.kernel.org/doc/Documentation/bcache.txt)
+2. [Arch Linux Wiki: Bcache](https://wiki.archlinux.org/title/bcache)
diff --git a/health/guides/disks/bcache_cache_errors.md b/health/guides/disks/bcache_cache_errors.md
new file mode 100644
index 00000000..5256c480
--- /dev/null
+++ b/health/guides/disks/bcache_cache_errors.md
@@ -0,0 +1,66 @@
+### Understand the alert
+
+This alert is triggered when the number of read races in the last minute on a `bcache` system has increased. A read race occurs when a `bucket` is reused and invalidated while it's being read from the cache. In this situation, the data is reread from the slower backing device.
+
+### What is bcache?
+
+`bcache` is a cache within the block layer of the Linux kernel. It enables fast storage devices, such as SSDs (Solid State Drives), to act as a cache for slower storage devices like HDDs (Hard Disk Drives). This creates hybrid volumes with improved performance. A cache device is usually divided into `buckets` that match the physical disk's erase blocks.
+
+### Troubleshoot the alert
+
+1. Verify the current `bcache` cache errors:
+
+ ```
+ grep bcache_cache_errors /sys/fs/bcache/*/stats_total/*
+ ```
+
+ This command will show the total number of cache errors for all `bcache` devices.
+
+2. Identify the affected backing device:
+
+ You can determine the affected backing device by checking the `/sys/fs/bcache` directory. Look for the symbolic link that points to the problematic device.
+
+ ```
+ ls -l /sys/fs/bcache
+ ```
+
+ This command will show the list of devices with corresponding names.
+
+3. Monitor the cache device's performance:
+
+ Use `iostat` to check the cache device's I/O performance.
+
+ ```
+ iostat -x -h -p /dev/YOUR_CACHE_DEVICE
+ ```
+
+ Note that you should replace `YOUR_CACHE_DEVICE` with the actual cache device name.
+
+4. Check the utilization of the cache and backing devices:
+
+ Use the following commands to check the utilization percentage of the cache and backing devices:
+
+ ```
+ # for the cache device (/dev/YOUR_CACHE_DEVICE)
+ cat /sys/block/YOUR_CACHE_DEVICE/bcache/utilization
+
+ # for the backing device (/dev/YOUR_BACKING_DEVICE)
+ cat /sys/block/YOUR_BACKING_DEVICE/bcache/utilization
+ ```
+
+ Replace `YOUR_CACHE_DEVICE` and `YOUR_BACKING_DEVICE` with the respective device names.
+
+5. Optimize the cache:
+
+ - If the cache utilization is high, consider increasing the cache size or adding more cache devices.
+ - If the cache device is heavily utilized, consider upgrading it to a faster SSD.
+ - In case the read races persist, consider using a [priority caching strategy](https://www.kernel.org/doc/html/latest/admin-guide/bcache.html#priority-caching).
+
+ You may also need to review your system's overall I/O load and adjust your caching strategy accordingly.
+
+### Useful resources
+
+1. [Bcache: Caching beyond just RAM](https://lwn.net/Articles/394672/)
+2. [Kernel Documentation - Bcache](https://www.kernel.org/doc/html/latest/admin-guide/bcache.html)
+3. [Arch Linux Wiki - Bcache](https://wiki.archlinux.org/title/bcache)
+4. [Wikipedia - Bcache](https://en.wikipedia.org/wiki/Bcache)
diff --git a/health/guides/disks/disk_inode_usage.md b/health/guides/disks/disk_inode_usage.md
new file mode 100644
index 00000000..3c916106
--- /dev/null
+++ b/health/guides/disks/disk_inode_usage.md
@@ -0,0 +1,23 @@
+### Understand the alert
+
+This alarm presents the percentage of used `inodes` storage of a particular disk.
+
+The number of `inodes` indicates the number of files and folders you have. An `inode` is a data structure, containing metadata about a file. All filenames are internally mapped to respective `inode` numbers, so if you have a
+lot of files, it means there are a lot of `inodes`.
+
+If the alarm is raised, it means that your storage device is running out of `inode` space. Each disk has a particular **limitation on the amount of `inodes` it can store**, determined by its size.
+
+Many modern filesystems use dynamically allocated `inodes` instead of a static table. These should not be presented on the charts associated with this alarm, and should not ever trigger it. If such a filesystem **does** trigger this alarm, and it's constantly reporting max `inode` usage, it's probably a bug in the filesystem driver. Some such filesystems incorrectly report having max `inode` count when they should not because they have no max limit, and in turn they trigger a false positive alarm.
+
+### Troubleshoot the alert
+
+Clear cache files or delete unnecessary files and folders
+
+- To reduce the amount of how many `inodes` you store currently, you can clear your cache, trash any unnecessary files and folders in your system.
+
+We strongly suggest that you practice a high degree of caution when cleaning up drives, and removing files, make sure that you are certain that you delete only unnecessary files.
+
+### Useful resources
+
+[Linux Inodes](https://www.javatpoint.com/linux-inodes)
+[Understanding UNIX / Linux filesystem Inodes](https://www.cyberciti.biz/tips/understanding-unixlinux-filesystem-inodes.html) \ No newline at end of file
diff --git a/health/guides/disks/disk_space_usage.md b/health/guides/disks/disk_space_usage.md
new file mode 100644
index 00000000..14663942
--- /dev/null
+++ b/health/guides/disks/disk_space_usage.md
@@ -0,0 +1,19 @@
+### Understand the alert
+
+This alarm presents the percentage of used space of a particular disk. If it is close to 100%, it means that your storage device is running out of space. If the particular disk raising the alarm is full, the system could experience slowdowns and even crashes.
+
+### Troubleshoot the alert
+
+Clean or upgrade the drive.
+
+If your storage device is full and the alert is raised, there are two paths you can tend to:
+
+- Cleanup your drive, remove any unnecessary files (files on the trash directory, cache files etc.) to free up space. Some areas that are safe to delete, are:
+ - Files under `/var/cache`
+ - Old logs in `/var/log`
+ - Old crash reports in `/var/crash` or `/var/dump`
+ - The `.cache` directory in user home directories
+
+- If your workflow requires all the space that is currently used, then you might want to look into upgrading the disk that raised the alarm, because its capacity is small for your demands.
+
+Netdata strongly suggests that you are careful when cleaning up drives, and removing files, make sure that you are certain that you delete only unnecessary files. \ No newline at end of file