summaryrefslogtreecommitdiffstats
path: root/health/guides/memory
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/memory')
-rw-r--r--health/guides/memory/1hour_ecc_memory_correctable.md37
-rw-r--r--health/guides/memory/1hour_ecc_memory_uncorrectable.md27
-rw-r--r--health/guides/memory/1hour_memory_hw_corrupted.md19
3 files changed, 83 insertions, 0 deletions
diff --git a/health/guides/memory/1hour_ecc_memory_correctable.md b/health/guides/memory/1hour_ecc_memory_correctable.md
new file mode 100644
index 000000000..1893bbf7e
--- /dev/null
+++ b/health/guides/memory/1hour_ecc_memory_correctable.md
@@ -0,0 +1,37 @@
+### Understand the alert
+
+This alert, `1hour_ecc_memory_correctable`, monitors the number of Error Correcting Code (ECC) correctable errors that occur within an hour. If you receive this alert, it means that there are ECC correctable errors in your system's memory. While it does not pose an immediate threat, it may indicate that a memory module is slowly deteriorating.
+
+### ECC Memory
+
+ECC memory is a type of computer data storage that can detect and correct the most common kinds of internal data corruption. It is used in systems that require high reliability and stability, such as servers or mission-critical applications.
+
+### Troubleshoot the alert
+
+1. Inspect the memory modules
+
+ If the alert is triggered, start by physically checking the memory modules in the system. Ensure that the contacts are clean, and all modules are firmly seated in their respective slots.
+
+2. Perform a memory test
+
+ Run a thorough memory test using a tool like Memtest86+. This will help identify if any memory chips have problems that can cause the ECC errors.
+
+ ```
+ sudo apt-get install memtester
+ sudo memtester 1024M 5
+ ```
+
+ Replace `1024M` with the amount of memory you'd like to test (in MB) and `5` with the number of loops for the test.
+
+3. Monitor the errors
+
+ Monitor the frequency of ECC correctable errors. Keep a record of when they occur and if there are any patterns or trends. If errors continue to occur, move to step 4.
+
+4. Replace faulty memory modules
+
+ If ECC correctable errors persist, identify the memory modules with the highest error rates and consider replacing them as a preventive measure. This will help maintain the reliability and stability of your system.
+
+### Useful resources
+
+1. [Memtest86+ - Advanced Memory Diagnostic Tool](https://www.memtest.org/)
+2. [How to Diagnose, Check, and Test for Bad Memory](https://www.computerhope.com/issues/ch001089.htm)
diff --git a/health/guides/memory/1hour_ecc_memory_uncorrectable.md b/health/guides/memory/1hour_ecc_memory_uncorrectable.md
new file mode 100644
index 000000000..509ff5448
--- /dev/null
+++ b/health/guides/memory/1hour_ecc_memory_uncorrectable.md
@@ -0,0 +1,27 @@
+### Understand the alert
+
+This alert, `1hour_ecc_memory_uncorrectable`, indicates that there are ECC (Error-Correcting Code) uncorrectable errors detected in your system's memory within the last hour. ECC errors are caused by issues in the system's RAM (Random Access Memory). These uncorrectable errors are severe and may lead to system crashes or data corruption.
+
+### What are ECC errors?
+
+ECC memory is designed to detect and, in some cases, correct data corruption in the memory, preventing system crashes and providing overall system stability. ECC errors fall into two categories:
+
+1. **Correctable Errors**: These are errors that the ECC memory can detect and correct, preventing system crashes and ensuring data integrity.
+2. **Uncorrectable Errors**: These are more severe errors that the ECC memory cannot correct, often requiring faulty memory modules to be replaced to prevent system crashes and data corruption.
+
+### Troubleshoot the alert
+
+- **Inspect the memory modules**: Power off the system and check the memory modules for any signs of damage or poor contact with the socket. Ensure that the memory modules are seated firmly and there is proper contact.
+
+- **Run memory diagnostics**: Run memory diagnostic tools, like [Memtest86+](https://www.memtest.org/) to identify any memory errors and verify the memory's health. If errors are detected, it's an indication that the memory modules need to be replaced.
+
+- **Replace faulty memory modules**: If uncorrectable errors continue occurring or if diagnostics identify faulty memory modules, consider replacing them. Before doing so, check if the memory modules are still covered under warranty.
+
+- **Check system logs**: Review system logs, such as Event Viewer on Windows or `/var/log` on Linux systems, for any related messages or errors that may help to diagnose the issue further.
+
+- **Update firmware**: Ensure your system's firmware and BIOS are up-to-date. Manufacturers often release stability and performance improvements that can potentially resolve or mitigate ECC errors.
+
+
+### Useful resources
+
+1. [How to Check Memory Problems in Linux](https://www.cyberciti.biz/faq/linux-check-memory-usage/)
diff --git a/health/guides/memory/1hour_memory_hw_corrupted.md b/health/guides/memory/1hour_memory_hw_corrupted.md
new file mode 100644
index 000000000..1be030480
--- /dev/null
+++ b/health/guides/memory/1hour_memory_hw_corrupted.md
@@ -0,0 +1,19 @@
+
+### Understand the alert
+The Linux kernel keeps track of the system memory state. You can find the actual values it tracks in the [man pages](https://man7.org/linux/man-pages/man5/proc.5.html) under the `/proc/meminfo` subsection. One of the values that the kernel reports is the `HardwareCorrupted` , which is the amount of memory, in kibibytes (1024 bytes), with physical memory corruption problems, identified by the hardware and set aside by the kernel so it does not get used.
+
+The Netdata Agent monitors this value. This alert indicates that the memory is corrupted due to a hardware failure. While primarily the error may be due to a failing RAM chip, it can also be caused by incorrect seating or improper contact between the socket and memory module.
+
+### Troubleshoot the alert
+
+Most of the time uncorrectable errors will make your system reboot/shutdown in a state of panic. If not, that means that your tolerance level is high enough to not make the system go into panic. You must identify the defective module immediately.
+
+`memtester` is a userspace utility for testing the memory subsystem for faults.
+
+You may also receive this error as a result of incorrect seating or improper contact between the socket and RAM module. Check both before consider replacing the RAM module.
+
+### Useful resources
+
+1. [man pages /proc](https://man7.org/linux/man-pages/man5/proc.5.html)
+2. [memtester homepage](https://pyropus.ca/software/memtester/)
+