summaryrefslogtreecommitdiffstats
path: root/health/guides/mdstat
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/mdstat')
-rw-r--r--health/guides/mdstat/mdstat_disks.md26
-rw-r--r--health/guides/mdstat/mdstat_mismatch_cnt.md15
-rw-r--r--health/guides/mdstat/mdstat_nonredundant_last_collected.md55
3 files changed, 96 insertions, 0 deletions
diff --git a/health/guides/mdstat/mdstat_disks.md b/health/guides/mdstat/mdstat_disks.md
new file mode 100644
index 00000000..c3daf961
--- /dev/null
+++ b/health/guides/mdstat/mdstat_disks.md
@@ -0,0 +1,26 @@
+### Understand the alert
+
+This alert presents the number of devices in the down state for the respective RAID array raising it. If you receive this alert, then the array is degraded and some array devices are missing.
+
+### What is a "degraded array" event?
+
+When a RAID array experiences the failure of one or more disks, it can enter degraded mode, a fallback mode that generally allows the continued usage of the array, but either loses the performance boosts of the RAID technique (such as a RAID-1 mirror across two disks when one of them fails; performance will fall back to that of a normal, single drive) or experiences severe performance penalties due to the necessity to reconstruct the damaged data from error correction data.
+
+### Troubleshoot the alert
+
+- Examine for faulty or offline devices
+
+Having a degraded array means that one or more devices are faulty or missing. To fix this issue, check for faulty devices by running:
+```
+mdadm --detail <RAIDDEVICE>
+```
+Replace "RAIDDEVICE" with the name of your RAID device.
+
+To recover the array, replace the faulty devices or bring back any offline devices.
+
+### Useful resources
+
+1. [Degraded Mode](https://en.wikipedia.org/wiki/Degraded_mode)
+2. [Mdadm recover degraded array procedure](https://www.thomas-krenn.com/en/wiki/Mdadm_recover_degraded_Array_procedure)
+3. [mdadm Manual page](https://linux.die.net/man/8/mdadm)
+4. [mdadm cheat sheet](https://www.ducea.com/2009/03/08/mdadm-cheat-sheet/) \ No newline at end of file
diff --git a/health/guides/mdstat/mdstat_mismatch_cnt.md b/health/guides/mdstat/mdstat_mismatch_cnt.md
new file mode 100644
index 00000000..7a156e38
--- /dev/null
+++ b/health/guides/mdstat/mdstat_mismatch_cnt.md
@@ -0,0 +1,15 @@
+### Understand the alert
+
+This alert presents the number of unsynchronized blocks for the RAID array in crisis. Receiving this alert indicates a high number of unsynchronized blocks for the RAID array. This might indicate that data on the array is corrupted.
+
+This alert is raised to warning when the metric exceeds 1024 unsynchronized blocks.
+
+### Troubleshoot the alert
+
+There is no standard approach to troubleshooting this alert because the reasons can be various.
+
+For example, one of the reasons might be a swap on the array, which is relatively harmless. However, this alert can also be triggered by hardware issues which can lead to many problems and inconsistencies between the disks.
+
+### Useful resources
+
+[Serverfault | Reasons for high mismatch_cnt on a RAID1/10 array](https://serverfault.com/questions/885565/what-are-raid-1-10-mismatch-cnt-0-causes-except-for-swap-file/885574#885574)
diff --git a/health/guides/mdstat/mdstat_nonredundant_last_collected.md b/health/guides/mdstat/mdstat_nonredundant_last_collected.md
new file mode 100644
index 00000000..f76c6148
--- /dev/null
+++ b/health/guides/mdstat/mdstat_nonredundant_last_collected.md
@@ -0,0 +1,55 @@
+### Understand the alert
+
+This alert, `mdstat_nonredundant_last_collected`, is triggered when the Netdata Agent fails to collect data from the Multiple Device (md) driver for a certain period. The md driver is used to manage software RAID arrays in Linux.
+
+### What is the md driver?
+
+The md (multiple device) driver is responsible for managing software RAID arrays on Linux systems. It provides a way to combine multiple physical disks into a single logical disk, increasing capacity and providing redundancy, depending on the RAID level. Monitoring the status of these devices is crucial to ensure data integrity and redundancy.
+
+### Troubleshoot the alert
+
+1. Check the status of the md driver:
+
+ To inspect the status of the RAID arrays managed by the md driver, use the `cat` command:
+
+ ```
+ cat /proc/mdstat
+ ```
+
+ This will display the status and configuration of all active RAID arrays. Look for any abnormal status, such as failed or degraded disks, and replace or fix them as needed.
+
+2. Verify the Netdata configuration:
+
+ Ensure that the Netdata Agent is properly configured to collect data from the md driver. Open the `netdata.conf` configuration file found in `/etc/netdata/` or `/opt/netdata/etc/netdata/`, and look for the `[plugin:proc:/proc/mdstat]` section.
+
+ Make sure that the `enabled` option is set to `yes`:
+
+ ```
+ [plugin:proc:/proc/mdstat]
+ # enabled = yes
+ ```
+
+ If you make any changes to the configuration, restart the Netdata Agent for the changes to take effect:
+
+ ```
+ sudo systemctl restart netdata
+ ```
+
+3. Check the md driver data collection:
+
+ After verifying the Netdata configuration, check if data collection is successful. On the Netdata dashboard, go to the "Disks" section, and look for "mdX" (where "X" is a number) in the list of available disks. If you can see the charts for your RAID array(s), it means data collection is working correctly.
+
+4. Investigate system logs:
+
+ If the issue persists, check the system logs for any errors or messages related to the md driver or Netdata Agent. You can use `journalctl` for this purpose:
+
+ ```
+ journalctl -u netdata
+ ```
+
+ Look for any error messages or warnings that could indicate the cause of the problem.
+
+### Useful resources
+
+1. [Linux RAID: A Quick Guide](https://www.cyberciti.biz/tips/linux-raid-increase-resync-rebuild-speed.html)
+2. [Netdata Agent Configuration Guide](https://learn.netdata.cloud/docs/agent/daemon/config)