summaryrefslogtreecommitdiffstats
path: root/src/health/guides/nvme
diff options
context:
space:
mode:
Diffstat (limited to 'src/health/guides/nvme')
-rw-r--r--src/health/guides/nvme/nvme_device_critical_warnings_state.md36
1 files changed, 36 insertions, 0 deletions
diff --git a/src/health/guides/nvme/nvme_device_critical_warnings_state.md b/src/health/guides/nvme/nvme_device_critical_warnings_state.md
new file mode 100644
index 000000000..a12381bb3
--- /dev/null
+++ b/src/health/guides/nvme/nvme_device_critical_warnings_state.md
@@ -0,0 +1,36 @@
+### Understand the alert
+
+This alert is triggered when an `NVMe device` experiences `critical warnings`. The alert is focusing on your `NVMe` (Non-Volatile Memory Express) SSD storage device which is designed for high-performance and low-latency storage.
+
+### What does critical warnings mean?
+
+A critical warning state indicates that the NVMe device has experienced an event, error, or condition which could negatively impact performance, data integrity or device longevity. This could result from a variety of reasons such as high temperature, hardware failures, internal errors, or device reaching end of life.
+
+### Troubleshoot the alert
+
+1. Identify the affected NVMe device(s):
+
+This alert provides information in the `info` field about the affected device. It should look like: "NVMe device ${label:device} has critical warnings", where `${label:device}` will be replaced with the actual device name.
+
+2. Check device SMART information:
+
+`SMART` (Self-Monitoring, Analysis, and Reporting Technology) provides detailed information about the current health and performance of your NVMe device. To check SMART information for the affected NVMe device, use `smartctl` command:
+
+ ```
+ sudo smartctl -a /dev/nvme0n1
+ ```
+
+ Replace `/dev/nvme0n1` with the actual device name identified in step 1.
+
+3. Evaluate the SMART information for critical issues:
+
+Review the output of the `smartctl` command to identify the critical warnings or any other concerning attributes. You might see high temperature, high uncorrectable error counts, or high percent of used endurance. These values might help you diagnose the issue with your NVMe device.
+
+4. Take appropriate action based on SMART data:
+
+- If the temperature of the device is high, ensure proper cooling and airflow in the system.
+- If the device is reaching its end of life, plan for a replacement or backup.
+- If the device has high uncorrectable error counts, consider backing up critical data and contact the manufacturer for support, as this could indicate a possible hardware failure.
+
+Make sure to replace, stop the usage of, or seek support for the problematic NVMe device(s) depending on the analysis.
+