summaryrefslogtreecommitdiffstats
path: root/health/guides/ipmi
diff options
context:
space:
mode:
Diffstat (limited to 'health/guides/ipmi')
-rw-r--r--health/guides/ipmi/ipmi_events.md38
-rw-r--r--health/guides/ipmi/ipmi_sensors_states.md41
2 files changed, 79 insertions, 0 deletions
diff --git a/health/guides/ipmi/ipmi_events.md b/health/guides/ipmi/ipmi_events.md
new file mode 100644
index 000000000..284abd4cd
--- /dev/null
+++ b/health/guides/ipmi/ipmi_events.md
@@ -0,0 +1,38 @@
+### Understand the alert
+
+This alert is triggered when there are events recorded in the IPMI System Event Log (SEL). These events can range from critical, warning, and informational events. The alert enters a warning state when the number of events in the IPMI SEL exceeds 0, meaning there are recorded events that may require your attention.
+
+### What is IPMI SEL?
+
+The Intelligent Platform Management Interface (IPMI) System Event Log (SEL) is a log that records events related to hardware components and firmware on a server. These events can provide insight into potential issues with the server's hardware or firmware, which could impact the server's overall performance or stability.
+
+### Troubleshoot the alert
+
+1. **Use `ipmitool` to view the IPMI SEL events:**
+
+ You can view the System Event Log using the `ipmitool` command. If you don't have `ipmitool` installed, you might need to install it first. Once `ipmitool` is installed, use the following command to list the SEL events:
+
+ ```
+ ipmitool sel list
+ ```
+
+ This command will display the recorded events with their respective timestamp, event ID, and a brief description.
+
+2. **Identify and resolve issues:**
+
+ Analyze the events listed to identify any critical or warning events that may require immediate attention. You may need to refer to your server's hardware documentation or firmware updates to resolve the issue.
+
+3. **Clear the IPMI SEL events (optional):**
+
+ If you have resolved the issues or if the events listed are no longer relevant, you can clear the IPMI SEL events using the following command:
+
+ ```
+ ipmitool sel clear
+ ```
+
+ Note: Clearing the SEL events may cause you to lose important historical information related to your hardware components and firmware. Be cautious when using this command, and ensure that you have resolved any critical issues before clearing the event log.
+
+### Useful resources
+
+1. [IPMITOOL GitHub Repository](https://github.com/ipmitool/ipmitool)
+2. [IPMITOOL Manual Page](https://linux.die.net/man/1/ipmitool)
diff --git a/health/guides/ipmi/ipmi_sensors_states.md b/health/guides/ipmi/ipmi_sensors_states.md
new file mode 100644
index 000000000..e7521a306
--- /dev/null
+++ b/health/guides/ipmi/ipmi_sensors_states.md
@@ -0,0 +1,41 @@
+### Understand the alert
+
+This alert is related to the IPMI (Intelligent Platform Management Interface) sensors in your system. IPMI is a hardware management interface used for monitoring server health and collecting information on various hardware components. The alert is triggered when any of the IPMI sensors detect conditions that are outside the normal operating range, and are in a warning or critical state.
+
+### Troubleshoot the alert
+
+1. Check IPMI sensor status:
+
+ To check the status of IPMI sensors, you can use the `ipmi-sensors` command with appropriate flags. For instance:
+
+ ```
+ sudo ipmi-sensors --output-sensor-state
+ ```
+
+ This command will provide you with detailed information on the current state of each sensor, allowing you to determine which ones are in a warning or critical state.
+
+2. Analyze sensor data:
+
+ Based on the output obtained in the previous step, identify the sensors that are causing the alert. Take note of their current values and thresholds.
+
+ To obtain more detailed information, you can also use the `-v` (verbose) flag with the command:
+
+ ```
+ sudo ipmi-sensors -v --output-sensor-state
+ ```
+
+3. Investigate the cause of the issue:
+
+ Once you have identified the sensors in a non-nominal state, start investigating the root cause of the issue. This may involve checking the hardware components, system logs, or contacting your hardware vendor for additional support.
+
+4. Resolve the issue:
+
+ Based on your investigation, take the necessary steps to resolve the issue. This may include replacing faulty hardware, addressing configuration errors, or applying firmware updates.
+
+5. Verify resolution:
+
+ After addressing the issue, use the `ipmi-sensors` command to check the status of the affected sensors. Ensure that they have returned to the nominal state, and no additional warning or critical conditions are being reported.
+
+### Useful resources
+
+1. ["ipmi-sensors" manual page](https://www.gnu.org/software/freeipmi/manpages/man8/ipmi-sensors.8.html)