Adding upstream version 1.44.3.upstream/1.44.3 upstream

Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
author: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-19 02:57:58 +0000
committer: Daniel Baumann <daniel.baumann@progress-linux.org> 2024-04-19 02:57:58 +0000
commit: be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97 (patch)
tree: 9754ff1ca740f6346cf8483ec915d4054bc5da2d /health/guides/nvme
parent: Initial commit. (diff)
download: netdata-upstream.tar.xz
netdata-upstream.zip
1 files changed, 36 insertions, 0 deletions
diff --git a/health/guides/nvme/nvme_device_critical_warnings_state.md b/health/guides/nvme/nvme_device_critical_warnings_state.md
new file mode 100644
index 00000000..a12381bb
--- /dev/null
+++ b/health/guides/nvme/nvme_device_critical_warnings_state.md
@@ -0,0 +1,36 @@
+### Understand the alert
+
+This alert is triggered when an `NVMe device` experiences `critical warnings`. The alert is focusing on your `NVMe` (Non-Volatile Memory Express) SSD storage device which is designed for high-performance and low-latency storage.
+
+### What does critical warnings mean?
+
+A critical warning state indicates that the NVMe device has experienced an event, error, or condition which could negatively impact performance, data integrity or device longevity. This could result from a variety of reasons such as high temperature, hardware failures, internal errors, or device reaching end of life.
+
+### Troubleshoot the alert
+
+1. Identify the affected NVMe device(s):
+
+This alert provides information in the `info` field about the affected device. It should look like: "NVMe device ${label:device} has critical warnings", where `${label:device}` will be replaced with the actual device name.
+
+2. Check device SMART information:
+
+`SMART` (Self-Monitoring, Analysis, and Reporting Technology) provides detailed information about the current health and performance of your NVMe device. To check SMART information for the affected NVMe device, use `smartctl` command:
+
+   ```
+   sudo smartctl -a /dev/nvme0n1
+   ```
+
+   Replace `/dev/nvme0n1` with the actual device name identified in step 1.
+
+3. Evaluate the SMART information for critical issues:
+
+Review the output of the `smartctl` command to identify the critical warnings or any other concerning attributes. You might see high temperature, high uncorrectable error counts, or high percent of used endurance. These values might help you diagnose the issue with your NVMe device.
+
+4. Take appropriate action based on SMART data:
+
+- If the temperature of the device is high, ensure proper cooling and airflow in the system.
+- If the device is reaching its end of life, plan for a replacement or backup.
+- If the device has high uncorrectable error counts, consider backing up critical data and contact the manufacturer for support, as this could indicate a possible hardware failure.
+
+Make sure to replace, stop the usage of, or seek support for the problematic NVMe device(s) depending on the analysis.
+
author	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-19 02:57:58 +0000
committer	Daniel Baumann <daniel.baumann@progress-linux.org>	2024-04-19 02:57:58 +0000
commit	be1c7e50e1e8809ea56f2c9d472eccd8ffd73a97 (patch)
tree	9754ff1ca740f6346cf8483ec915d4054bc5da2d /health/guides/nvme
parent	Initial commit. (diff)
download	netdata-upstream.tar.xz netdata-upstream.zip