summaryrefslogtreecommitdiffstats
path: root/health/guides/memory/1hour_ecc_memory_correctable.md
blob: 1893bbf7eab518b11e6301da7308fd06f80de46f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
### Understand the alert

This alert, `1hour_ecc_memory_correctable`, monitors the number of Error Correcting Code (ECC) correctable errors that occur within an hour. If you receive this alert, it means that there are ECC correctable errors in your system's memory. While it does not pose an immediate threat, it may indicate that a memory module is slowly deteriorating. 

### ECC Memory

ECC memory is a type of computer data storage that can detect and correct the most common kinds of internal data corruption. It is used in systems that require high reliability and stability, such as servers or mission-critical applications.

### Troubleshoot the alert

1. Inspect the memory modules

   If the alert is triggered, start by physically checking the memory modules in the system. Ensure that the contacts are clean, and all modules are firmly seated in their respective slots.

2. Perform a memory test

   Run a thorough memory test using a tool like Memtest86+. This will help identify if any memory chips have problems that can cause the ECC errors.

   ```
   sudo apt-get install memtester
   sudo memtester 1024M 5
   ```

   Replace `1024M` with the amount of memory you'd like to test (in MB) and `5` with the number of loops for the test.

3. Monitor the errors

   Monitor the frequency of ECC correctable errors. Keep a record of when they occur and if there are any patterns or trends. If errors continue to occur, move to step 4.

4. Replace faulty memory modules

   If ECC correctable errors persist, identify the memory modules with the highest error rates and consider replacing them as a preventive measure. This will help maintain the reliability and stability of your system.

### Useful resources

1. [Memtest86+ - Advanced Memory Diagnostic Tool](https://www.memtest.org/)
2. [How to Diagnose, Check, and Test for Bad Memory](https://www.computerhope.com/issues/ch001089.htm)