diff options
Diffstat (limited to '')
-rw-r--r-- | health/guides/btrfs/btrfs_allocated.md | 75 | ||||
-rw-r--r-- | health/guides/btrfs/btrfs_data.md | 30 | ||||
-rw-r--r-- | health/guides/btrfs/btrfs_device_corruption_errors.md | 57 | ||||
-rw-r--r-- | health/guides/btrfs/btrfs_device_flush_errors.md | 54 | ||||
-rw-r--r-- | health/guides/btrfs/btrfs_device_generation_errors.md | 52 | ||||
-rw-r--r-- | health/guides/btrfs/btrfs_device_read_errors.md | 50 | ||||
-rw-r--r-- | health/guides/btrfs/btrfs_device_write_errors.md | 42 | ||||
-rw-r--r-- | health/guides/btrfs/btrfs_metadata.md | 70 | ||||
-rw-r--r-- | health/guides/btrfs/btrfs_system.md | 75 |
9 files changed, 505 insertions, 0 deletions
diff --git a/health/guides/btrfs/btrfs_allocated.md b/health/guides/btrfs/btrfs_allocated.md new file mode 100644 index 00000000..690d45d0 --- /dev/null +++ b/health/guides/btrfs/btrfs_allocated.md @@ -0,0 +1,75 @@ +### Understand the alert + +Btrfs is a modern copy on write (CoW) filesystem for Linux aimed at implementing advanced features while also focusing on fault tolerance, repair and easy administration. Btrfs is intended to address the lack of pooling, snapshots, checksums, and integral multi-device spanning in Linux file systems. + +Unlike most filesystems, Btrfs allocates disk space in two distinct stages. The first stage allocates chunks of physical disk space for usage by a particular type of filesystem blocks, either data blocks (which store actual file data), metadata blocks (which store inodes and other file metadata), and system blocks (which store metadata about the filesystem itself). The second stage then allocates actual blocks within those chunks for usage by the filesystem. This metric tracks space usage in the first allocation stage. + +The Netdata Agent monitors the percentage of allocated Btrfs physical disk space. + +### Troubleshoot the alert + +- Add more physical space + +Adding a new disk always depends on your infrastructure, disk RAID configuration, encryption, etc. An easy way to add a new disk to a filesystem is: + +1. Determine which disk you want to add and in which path + ``` + btrfs device add -f /dev/<new_disk> <path> + ``` + +2. If you get an error that the drive is already mounted, you might have to unmount + ``` + btrfs device add -f /dev/<new_disk> <path> + ``` +3. See the newly added disk + ``` + btrfs filesystem show + ``` +4. Balance the system to make use of the new drive. + ``` + btrfs filesystem balance <path> + ``` + +- Delete snapshots + +You can identify and delete snapshots that you no longer need. + +1. Find the snapshots for a specific path. + ``` + sudo btrfs subvolume list -s <path> + ``` + +2. Delete a snapshot that you do not need any more. + ``` + btrfs subvolume delete <path>/@some_dir-snapshot-test + ``` + +- Enable a compression mechanism + +1. Apply compression to existing files. This command will re-compress the `mount/point` path, with the `zstd` compression algorithm. + + ``` + btrfs filesystem defragment -r -v -czstd /mount/point + ``` + +- Enable a deduplication mechanism + +Using copy-on-write, Btrfs is able to copy files or whole subvolumes without actually copying the data. However, when a file is altered, a new proper copy is created. Deduplication takes this a step further, by actively identifying blocks of data which share common sequences and combining them into an extent with the same copy-on-write semantics. + +Tools dedicated to deduplicate a Btrfs formatted partition include duperemove, bees, and dduper. These projects are 3rd party, and it is strongly suggested that you check their status before you decide to use them. + +- Perform a balance + +Especially in a Btrfs with multiple disks, there might be unevenly allocated data/metadata into the disks. +``` +btrfs balance start -musage=10 -dusage=10 -susage=5 /mount/point +``` +This command will attempt to relocate data/metdata/system-data in empty or near-empty chunks (at most X% used, in this example), allowing the space to be reclaimed and reassigned between data and metadata. If the balance command ends with "Done, had to relocate 0 out of XX chunks", then you need to increase the "dusage/musage" percentage parameter until at least some chunks are relocated. + +### Useful resources + +1. [The Btrfs filesystem on Arch linux website](https://wiki.archlinux.org/title/btrfs) +2. [The Btrfs filesystem on kernel.org website](https://btrfs.wiki.kernel.org) +3. [duperemove](https://github.com/markfasheh/duperemove) +4. [bees](https://github.com/Zygo/bees) +5. [dduper](https://github.com/lakshmipathi/dduper) diff --git a/health/guides/btrfs/btrfs_data.md b/health/guides/btrfs/btrfs_data.md new file mode 100644 index 00000000..7782b2d8 --- /dev/null +++ b/health/guides/btrfs/btrfs_data.md @@ -0,0 +1,30 @@ +### Understand the alert + +This alert is triggered when the percentage of used Btrfs data space exceeds the configured threshold. Btrfs (B-tree file system) is a modern copy-on-write (CoW) filesystem for Linux which focuses on fault tolerance, repair, and easy administration. This filesystem also provides advanced features like snapshots, checksums, and multi-device spanning. + +### What does high Btrfs data usage mean? + +High Btrfs data usage indicates that a significant amount of the allocated space for data blocks in the filesystem is being used. This could be a result of many factors, such as large files, numerous smaller files, or multiple snapshots. + +### Troubleshoot the alert + +Before you attempt any troubleshooting, make sure you have backed up your data to prevent potential data loss or corruption. + +1. **Add more physical space**: You can add a new disk to the filesystem, depending on your infrastructure and disk RAID configuration. Remember to unmount the drive if it's already mounted, then use the `btrfs device add` command to add the new disk and balance the system. + +2. **Delete snapshots**: Review the snapshots in your Btrfs filesystem and delete any unnecessary snapshots. Use the `btrfs subvolume list` command to find snapshots and `btrfs subvolume delete` to remove them. + +3. **Enable compression**: By enabling compression, you can save disk space without deleting files or snapshots. Add the `compress=alg` mount option in your `fstab` configuration file or during the mount procedure, where `alg` is the compression algorithm you want to use (e.g., `zlib`, `lzo`, `zstd`). You can apply compression to existing files using the `btrfs filesystem defragment` command. + +4. **Enable deduplication**: Implement deduplication to identify and merge blocks of data with common sequences using copy-on-write semantics. You can use third-party tools dedicated to Btrfs deduplication, such as duperemove, bees, and dduper. However, research their stability and reliability before employing them. + +5. **Perform a balance**: If the data and metadata are unevenly allocated among disks, especially in Btrfs filesystems with multiple disks, you can perform a balance operation to reallocate space between data and metadata. Use the `btrfs balance` command with appropriate usage parameters to start the balance process. + +### Useful resources + +1. [Btrfs Wiki](https://btrfs.wiki.kernel.org) +2. [The Btrfs filesystem on the Arch Linux website](https://wiki.archlinux.org/title/btrfs) +3. [Ubuntu man pages for Btrfs commands](https://manpages.ubuntu.com/manpages/bionic/man8) +4. [duperemove](https://github.com/markfasheh/duperemove) +5. [bees](https://github.com/Zygo/bees) +6. [dduper](https://github.com/lakshmipathi/dduper)
\ No newline at end of file diff --git a/health/guides/btrfs/btrfs_device_corruption_errors.md b/health/guides/btrfs/btrfs_device_corruption_errors.md new file mode 100644 index 00000000..98fd4b44 --- /dev/null +++ b/health/guides/btrfs/btrfs_device_corruption_errors.md @@ -0,0 +1,57 @@ +### Understand the alert + +This alert monitors the `corruption_errs` metric in the `btrfs.device_errors` chart. If you receive this alert, it means that your system's BTRFS file system has encountered one or more corruption errors in the past 10 minutes. These errors indicate data inconsistencies on the file system that could lead to data loss or other issues. + +### What are BTRFS corruption errors? + +BTRFS (B-Tree File System) is a modern, fault-tolerant, and highly scalable file system used in several Linux distributions. Corruption errors in a BTRFS file system refer to inconsistencies in the data structures that the file system relies on to store and manage data. Such inconsistencies can stem from software bugs, hardware failures, or other causes. + +### Troubleshoot the alert + +1. Check for system messages: + + Review your system's kernel message log (`dmesg` output) for any BTRFS-related errors or warnings. These messages can provide insights into the cause of the corruption and help you diagnose the issue. + + ``` + dmesg | grep -i btrfs + ``` + +2. Run a file system check: + + Use the `btrfs scrub` command to scan the file system for inconsistencies and attempt to automatically repair them. Note that this command may take a long time to complete, depending on the size of your BTRFS file system. + + ``` + sudo btrfs scrub start /path/to/btrfs/mountpoint + ``` + + After the scrub finishes, check the status with: + + ``` + sudo btrfs scrub status /path/to/btrfs/mountpoint + ``` + +3. Assess your storage hardware + + In some cases, BTRFS corruption errors may be caused by failing storage devices, such as a disk drive nearing the end of its lifetime. Check the S.M.A.R.T. status of your disks using the `smartctl` tool to identify potential hardware issues. + + ``` + sudo smartctl -a /dev/sdX + ``` + + Replace `/dev/sdX` with the actual device path of your disk. + +4. Update your system + + Ensuring that your system has the latest kernel, BTRFS tools package, and other relevant updates can help prevent software-related corruption errors. + + For example, on Ubuntu or Debian-based systems, you can update with: + + ``` + sudo apt-get update + sudo apt-get upgrade + ``` + +5. Backup essential data + + As file system corruption might result in data loss, ensure that you have proper backups of any critical data stored on your BTRFS file system. Regularly back up your data to an external or secondary storage device. + diff --git a/health/guides/btrfs/btrfs_device_flush_errors.md b/health/guides/btrfs/btrfs_device_flush_errors.md new file mode 100644 index 00000000..c9bb1b11 --- /dev/null +++ b/health/guides/btrfs/btrfs_device_flush_errors.md @@ -0,0 +1,54 @@ +### Understand the alert + +This alert indicates that `BTRFS` flush errors have been detected on your file system. If you receive this alert, it means that your system has encountered problems while flushing data from memory to disk, which may result in data corruption or data loss. + +### What is BTRFS? + +`BTRFS` (B-Tree File System) is a modern, copy-on-write (CoW) file system for Linux designed to address various weaknesses in traditional file systems. It provides advanced features like data pooling, snapshots, and checksums that enhance fault tolerance. + +### Troubleshoot the alert + +1. Verify the alert + +Check the `Netdata` dashboard or query the monitoring API to confirm that the alert is genuine and not a false positive. + +2. Review and analyze syslog + +Check your system's `/var/log/syslog` or `/var/log/messages`, looking for `BTRFS`-related errors. These messages will provide essential information about the cause of the flush errors. + +3. Confirm BTRFS status + +Run the following command to display the state of the BTRFS file system and ensure it is mounted and healthy: + +``` +sudo btrfs filesystem show +``` + +4. Check disk space + +Ensure your system has sufficient disk space allocated to the BTRFS file system. A full or nearly full disk might cause flush errors. You can use the `df -h` command to examine the available disk space. + +5. Check system I/O usage + +Use the `iotop` command to inspect disk I/O usage for any abnormally high activity, which could be related to the flush errors. + +``` +sudo iotop +``` + +6. Upgrade or rollback BTRFS version + +Verify that you are using a stable version of the BTRFS utilities and kernel module. If not, consider upgrading or rolling back to a more stable version. + +7. Inspect hardware health + +Inspect your disks and RAM for possible hardware problems, as these can cause flush errors. SMART data can help assess disk health (`smartctl -a /dev/sdX`), and `memtest86+` can be used to scrutinize RAM. + +8. Create backups + +Take backups of your critical BTRFS data immediately to avoid potential data loss due to flush errors. + +### Useful resources + +1. [BTRFS official website](https://btrfs.wiki.kernel.org/index.php/Main_Page) +2. [BTRFS utilities on GitHub](https://github.com/kdave/btrfs-progs) diff --git a/health/guides/btrfs/btrfs_device_generation_errors.md b/health/guides/btrfs/btrfs_device_generation_errors.md new file mode 100644 index 00000000..b357b83e --- /dev/null +++ b/health/guides/btrfs/btrfs_device_generation_errors.md @@ -0,0 +1,52 @@ +### Understand the alert + +This alert is about `BTRFS generation errors`. When you receive this alert, it means that your BTRFS file system has encountered errors during its operation. + +### What are BTRFS generation errors? + +BTRFS is a modern copy-on-write (CoW) filesystem, which is developed to address various weaknesses in traditional Linux file systems. It features snapshotting, checksumming, and performs background scrubbing to find and repair errors. + +A `BTRFS generation error` occurs when the file system encounters issues while updating the data and metadata associated with a snapshot or subvolume. This could be due to software bugs, hardware issues, or data corruption. + +### Troubleshoot the alert + +1. Verify the issue: Check your system logs for any BTRFS-related errors to further understand the problem. This can be done using the `dmesg` command: + + ``` + sudo dmesg | grep BTRFS + ``` + +2. Check the BTRFS filesystem status: Use the `btrfs filesystem` command to get information about your BTRFS filesystem, including the UUID, total size, used size, and device information: + + ``` + sudo btrfs filesystem show + ``` + +3. Perform a BTRFS scrub: Scrubbing is a process that scans the entire filesystem, verifies the data and metadata, and attempts to repair any detected errors. Run the following command to start a scrub operation: + + ``` + sudo btrfs scrub start -Bd /path/to/btrfs/mountpoint + ``` + + The `-B` flag will run the scrub in the background, and the `-d` flag will provide detailed information about the operation. + +4. Monitor scrub progress: You can monitor the scrub progress using the `btrfs scrub status` command: + + ``` + sudo btrfs scrub status /path/to/btrfs/mountpoint + ``` + +5. Analyze scrub results: The scrub operation will provide information about the total data scrubbed, the number of errors found, and the number of errors fixed. This information can help you determine the extent of the issue and any further action required. + +6. Address BTRFS issues: Depending on the nature of the errors, you may need to take further action, such as updating the BTRFS tools, updating your Linux kernel, or even replacing faulty hardware to resolve the errors. + +7. Set up a regular scrub schedule: You can schedule regular scrubs to keep your BTRFS filesystem healthy. This can be done using `cron`. For example, you can add the following line to `/etc/crontab` to run a scrub on the 1st of each month: + + ``` + 0 0 1 * * root btrfs scrub start -B /path/to/btrfs/mountpoint + ``` + +### Useful resources + +1. [BTRFS Wiki Homepage](https://btrfs.wiki.kernel.org/index.php/Main_Page) +2. [Btrfs Documentation](https://www.kernel.org/doc/Documentation/filesystems/btrfs.txt) diff --git a/health/guides/btrfs/btrfs_device_read_errors.md b/health/guides/btrfs/btrfs_device_read_errors.md new file mode 100644 index 00000000..684cd0be --- /dev/null +++ b/health/guides/btrfs/btrfs_device_read_errors.md @@ -0,0 +1,50 @@ +### Understand the alert + +This alert monitors the number of BTRFS read errors on a device. If you receive this alert, it means that your system has encountered at least one BTRFS read error in the last 10 minutes. + +### What are BTRFS read errors? + +BTRFS (B-Tree File System) is a modern file system designed for Linux. BTRFS read errors are instances where the file system fails to read data from a device. This can occur due to various reasons like hardware failure, file system corruption, or disk problems. + +### Troubleshoot the alert + +1. Check system logs for BTRFS errors + + Review the output from the following command to identify any BTRFS errors: + ``` + sudo journalctl -k | grep -i BTRFS + ``` + +2. Identify the affected BTRFS device and partition + + List all BTRFS devices with their respective information by running the following command: + ``` + sudo btrfs filesystem show + ``` + +3. Perform a BTRFS filesystem check + + To check the integrity of the BTRFS file system, run the following command, replacing `<device>` with the affected device path: + ``` + sudo btrfs check --readonly <device> + ``` + Note: Be careful when using the `--repair` option, as it may cause data loss. It is recommended to take a backup before attempting a repair. + +4. Verify the disk health + + Check the disk health using SMART tools to determine if there are any hardware issues. This can be done by first installing `smartmontools` if not already installed: + ``` + sudo apt install smartmontools + ``` + Then running a disk health check on the affected device: + ``` + sudo smartctl -a <device> + ``` + +5. Analyze the read error patterns + + If the read errors are happening consistently or increasing, consider replacing the affected device with a new one or adding redundancy to the system by using RAID or BTRFS built-in features. + +### Useful resources + +1. [smartmontools documentation](https://www.smartmontools.org/) diff --git a/health/guides/btrfs/btrfs_device_write_errors.md b/health/guides/btrfs/btrfs_device_write_errors.md new file mode 100644 index 00000000..cdf22172 --- /dev/null +++ b/health/guides/btrfs/btrfs_device_write_errors.md @@ -0,0 +1,42 @@ +### Understand the alert + +This alert is triggered when BTRFS (B-tree file system) encounters write errors on your system. BTRFS is a modern copy-on-write (COW) filesystem designed to address various weaknesses in traditional Linux file systems. If you receive this alert, it means that there have been issues with writing data to the file system. + +### What are BTRFS write errors? + +BTRFS write errors can occur when there are problems with the underlying storage devices, such as bad disks or data corruption. These errors may result in data loss or the inability to write new data to the file system. It is important to address these errors to prevent potential data loss and maintain the integrity of your file system. + +### Troubleshoot the alert + +- Check the BTRFS system status + +Execute the following command to get the current status of your BTRFS system: +``` +sudo btrfs device stats [Mount point] +``` +Replace `[Mount point]` with the actual mount point of your BTRFS file system. + +- Examine system logs for potential issues + +Check the system logs for any signs of issues with the BTRFS file system or underlying storage devices: +``` +sudo journalctl -u btrfs +``` + +- Check the health of the storage devices + +Use the `smartctl` tool to assess the health of your storage devices. For example, to check the device `/dev/sda`, use the following command: +``` +sudo smartctl -a /dev/sda +``` + +- Repair the BTRFS file system + +If there are issues with the file system, run the following command to repair it: +``` +sudo btrfs check --repair [Mount point] +``` +Replace `[Mount point]` with the actual mount point of your BTRFS file system. + +**WARNING:** The `--repair` option should be used with caution, as it may result in data loss under certain circumstances. It is recommended to back up your data before attempting to repair the file system. + diff --git a/health/guides/btrfs/btrfs_metadata.md b/health/guides/btrfs/btrfs_metadata.md new file mode 100644 index 00000000..6c44ee09 --- /dev/null +++ b/health/guides/btrfs/btrfs_metadata.md @@ -0,0 +1,70 @@ +### Understand the alert + +The `btrfs_metadata` alert calculates the percentage of used Btrfs metadata space for a Btrfs filesystem. If you receive this alert, it indicates that your Btrfs filesystem's metadata space is being utilized at a high rate. + +### Troubleshoot the alert + +**Warning: Data is valuable. Before performing any actions, make sure to take necessary backup steps. Netdata is not responsible for any loss or corruption of data, database, or software.** + +1. **Add more physical space** + + - Determine which disk you want to add and in which path: + ``` + root@netdata~ # btrfs device add -f /dev/<new_disk> <path> + ``` + + - If you get an error that the drive is already mounted, you might have to unmount: + ``` + root@netdata~ # btrfs device add -f /dev/<new_disk> <path> + ``` + + - Check the newly added disk: + ``` + root@netdata~ # btrfs filesystem show + ``` + + - Balance the system to make use of the new drive: + ``` + root@netdata~ # btrfs filesystem balance <path> + ``` + +2. **Delete snapshots** + + - List the snapshots for a specific path: + ``` + root@netdata~ # sudo btrfs subvolume list -s <path> + ``` + + - Delete an unnecessary snapshot: + ``` + root@netdata~ # btrfs subvolume delete <path>/@some_dir-snapshot-test + ``` + +3. **Enable a compression mechanism** + + Apply compression to existing files by modifying the `fstab` configuration file (or during the `mount` procedure) with the `compress=alg` option. Replace `alg` with `zlib`, `lzo`, `zstd`, or `no` (for no compression). For example, to re-compress the `/mount/point` path with `zstd` compression: + + ``` + root@netdata # btrfs filesystem defragment -r -v -czstd /mount/point + ``` + +4. **Enable a deduplication mechanism** + + Deduplication tools like duperemove, bees, and dduper can help identify blocks of data sharing common sequences and combine extents via copy-on-write semantics. Ensure you check the status of these 3rd party tools before using them. + + - [duperemove](https://github.com/markfasheh/duperemove) + - [bees](https://github.com/Zygo/bees) + - [dduper](https://github.com/lakshmipathi/dduper) + +5. **Perform a balance** + + Balance data/metadata/system-data in empty or near-empty chunks for Btrfs filesystems with multiple disks, allowing space to be reassigned: + + ``` + root@netdata # btrfs balance start -musage=50 -dusage=10 -susage=5 /mount/point + ``` + +### Useful resources + +1. [The Btrfs filesystem on Arch Linux website](https://wiki.archlinux.org/title/btrfs) +2. [The Btrfs filesystem on kernel.org website](https://btrfs.wiki.kernel.org)
\ No newline at end of file diff --git a/health/guides/btrfs/btrfs_system.md b/health/guides/btrfs/btrfs_system.md new file mode 100644 index 00000000..82d321ed --- /dev/null +++ b/health/guides/btrfs/btrfs_system.md @@ -0,0 +1,75 @@ +### Understand the alert + +The `btrfs_system` alert monitors the percentage of used Btrfs system space. If you receive this alert, it means that your Btrfs system space usage has reached a critical level and could potentially cause issues on your system. + +### Troubleshoot the alert + +**Important**: Data is priceless. Before you perform any action, make sure that you have taken any necessary backup steps. Netdata is not liable for any loss or corruption of any data, database, or software. + +1. Add more physical space + + Adding a new disk always depends on your infrastructure, disk RAID configuration, encryption, etc. To add a new disk to a filesystem: + + - Determine which disk you want to add and in which path: + ``` + root@netdata~ # btrfs device add -f /dev/<new_disk> <path> + ``` + - If you get an error that the drive is already mounted, you might have to unmount: + ``` + root@netdata~ # btrfs device add -f /dev/<new_disk> <path> + ``` + - See the newly added disk: + ``` + root@netdata~ # btrfs filesystem show + Label: none uuid: d6b9d7bc-5978-2677-ac2e-0e68204b2c7b + Total devices 2 FS bytes used 192.00KiB + devid 1 size 10.01GiB used 536.00MiB path /dev/sda1 + devid 2 size 10.01GiB used 0.00B path /dev/sdb + ``` + - Balance the system to make use of the new drive: + ``` + root@netdata~ # btrfs filesystem balance <path> + ``` + +2. Delete snapshots + + You can identify and delete snapshots that you no longer need. + + - Find the snapshots for a specific path: + ``` + root@netdata~ # sudo btrfs subvolume list -s <path> + ``` + - Delete a snapshot that you do not need any more: + ``` + root@netdata~ # btrfs subvolume delete <path>/@some_dir-snapshot-test + ``` + +3. Enable a compression mechanism + + - Apply compression to existing files. This command will re-compress the `mount/point` path, with the `zstd` compression algorithm: + ``` + root@netdata # btrfs filesystem defragment -r -v -czstd /mount/point + ``` + +4. Enable a deduplication mechanism + + Tools dedicated to deduplicate a Btrfs formatted partition include duperemove, bees, and dduper. These projects are 3rd party, and it is strongly suggested that you check their status before you decide to use them. + + - [duperemove](https://github.com/markfasheh/duperemove) + - [bees](https://github.com/Zygo/bees) + - [dduper](https://github.com/lakshmipathi/dduper) + +5. Perform a balance + + Especially in a Btrfs with multiple disks, data/metadata might be unevenly allocated into the disks. + + ``` + root@netdata # btrfs balance start -musage=10 -dusage=10 -susage=50 /mount/point + ``` + + > This command will attempt to relocate data/metdata/system-data in empty or near-empty chunks (at most X% used, in this example), allowing the space to be reclaimed and reassigned between data and metadata. If the balance command ends with "Done, had to relocate 0 out of XX chunks", then you need to increase the "dusage/musage" percentage parameter until at least some chunks are relocated. + +### Useful resources + +1. [The Btrfs filesystem on Arch Linux website](https://wiki.archlinux.org/title/btrfs) +2. [The Btrfs filesystem on kernel.org website](https://btrfs.wiki.kernel.org)
\ No newline at end of file |